S5.04 - Dependability

S5.04 ? Dependability

flowchart LR
    A[One-off success is not enough] --> B[RAIDT - run-level evidence framework]
    A2[Drift, variance, and unmanaged change] --> B
    B --> C[[Dependability]]
    C --> D[Repeat runs and variance evidence]
    C --> E[Evidence pack]
    C --> F[Score profile]
    C --> G[Controlled change and drift monitoring]
    E --> H[Reviewer reconstruction]
    F --> I[Governance readiness]
    G --> I
    J[Healthcare, finance, public services, enterprise productivity] --> C
    K[Threshold checks and change logs] --> C

? Star S5 - RAIDT Pillars and Scoring

Star context: Places Dependability within RAIDT's five-pillar score profile so that stability, consistency, and controlled change can be judged through run-level evidence rather than assumed through policy claims alone.

Academic picture

Definition / background

Dependability in RAIDT asks whether a particular generative AI configuration behaves stably, safely, and predictably enough for a defined organisational purpose when it is used repeatedly, exposed to foreseeable variation, or changed over time. It is therefore a governance judgement about the reliability of a run in context, not an abstract claim that a model is generally trustworthy.

Conceptually, the term sits close to ideas such as reliability, robustness, consistency, resilience, and operational assurance, but RAIDT gives it a more specific scope. Reliability often refers to whether a system works as intended; robustness often refers to resistance to perturbation; resilience often refers to recovery after failure. Dependability in RAIDT gathers these concerns into a practical review question: can this run be used again, under known conditions, with confidence that its behaviour will remain within acceptable bounds?

This matters in generative AI governance because many organisational failures arise not from a single spectacular error, but from unstable performance across time, users, prompts, integrations, or model versions. A run that looks acceptable once may still be unsuitable for operational deployment if it produces wide variation, degrades under workload, drifts after an update, or fails silently when upstream conditions change.

Within RAIDT, Dependability belongs inside the five-pillar model because run-level governance requires more than documenting intent or preserving logs. A run-level evidence pack should show not only what was done, but whether the run can be relied upon under expected conditions. The resulting score profile then makes Dependability visible alongside Responsibility, Auditability, Interpretability, and Traceability, allowing reviewers to see trade-offs rather than hiding them.

Why this concept matters

Dependability solves a central governance problem: organisations often approve generative AI uses on the basis of demos, vendor assurances, or one-off successful outputs, even though operational use depends on repeatability and controlled change. Without an explicit Dependability lens, governance can confuse isolated success with stable readiness.

The concept also helps avoid a common confusion between output quality and operational reliability. A single high-quality answer does not prove that a system is dependable. Dependability asks whether acceptable performance can be reproduced, monitored, and defended across the conditions that matter for real work.

If Dependability is missing, organisations risk adopting systems that behave inconsistently, fail unpredictably under pressure, or become unsafe after model, prompt, or workflow changes. This weakens reviewability, makes incidents harder to interpret, and leaves management without a sound basis for deciding whether a use case should scale, pause, or be redesigned.

For RAIDT, Dependability is one of the mechanisms that shifts AI governance from principles to operational control. It forces the governance conversation to ask what evidence demonstrates stability, what thresholds define acceptable variation, and what change-control process protects continued readiness.

Key idea: Dependability matters because RAIDT must show that a run is not only documented, but stable enough to be reviewed, repeated, and governed with confidence.

What this item measures

Stability of behaviour across repeat runs of the same or closely comparable task.
Sensitivity of outputs to foreseeable variation in prompts, inputs, context, users, or supporting tools.
Whether performance remains within agreed thresholds after updates, reconfiguration, or environmental change.
The presence and quality of drift monitoring, variance tracking, and incident review.
Whether change-control records support confidence that a previously acceptable run remains acceptable.
The degree to which claims of readiness are backed by evidence rather than anecdote.

Practical example / likely audience question

Audience question

What evidence supports a Dependability score, and how is that different from simply saying the model worked well in testing?

Answer

The concern behind the question is that many AI evaluations focus on point performance rather than operational stability. A model may perform well in a benchmark or in a small pilot, yet still be unsuitable for live organisational use because outputs vary too widely across runs, edge cases are not controlled, or later changes are not monitored.

In RAIDT, evidence for Dependability includes repeat runs, variance measures, stability thresholds, drift monitoring, and change-control records. The direct question is not merely, "Did it work once?" but, "Can we show that this configured run behaves consistently enough, under expected conditions, for governance to rely on it?" A strong answer therefore combines technical evidence with governance evidence: repeated execution results, acceptable error ranges, records of updates, and documented responses when thresholds are breached.

For example, a team using a large language model to draft procurement summaries may find that the tool performs well on five sampled cases. That is encouraging, but it does not yet establish Dependability. RAIDT would ask whether the same configuration has been rerun across a wider sample, whether summary quality varies across document formats, whether updates to the model or prompt template changed performance, and whether these shifts are logged and reviewed. This is stronger than a generic AI governance approach because it ties the judgement to a specific run and produces an auditable basis for continued use.

Practical example in RAIDT terms

Consider a healthcare use case in which a generative AI system drafts clinic follow-up letters from structured consultation notes. The run-level issue is not simply whether one output looks fluent; it is whether the configured workflow produces consistently accurate and clinically safe letters across repeated cases, across different clinicians, and after updates to the prompt template or underlying model.

In RAIDT terms, the evidence pack would need repeat-run results, variance measures for omission and wording changes, clinically defined acceptance thresholds, a record of model or template changes, and drift checks after deployment. Responsibility is affected because weak Dependability can expose patients to harm. Auditability is affected because reviewers need evidence of repeat testing and threshold breaches. Interpretability is affected because unexplained variance reduces confidence in why outputs change. Dependability is directly scored through stability and change control. Traceability is affected because the organisation must connect changes in output behaviour to identifiable run conditions and system versions.

By making these signals explicit, the item improves governance readiness: the organisation can justify deployment, limit the workflow to lower-risk cases, or halt use pending remediation. Dependability therefore supports a better operational decision than either enthusiasm or caution alone.

Detailed link to RAIDT

Dependability links to RAIDT in four ways.

First, it supports RAIDT's core idea that governance should be based on evidence about actual use, not broad statements about AI systems in the abstract.
Second, it attaches that judgement to the run, meaning one configured use for one task, at one time, in one organisational context.
Third, it feeds directly into the evidence pack and the five-pillar score profile by showing whether repeated use and controlled change remain within acceptable bounds.
Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning because unstable behaviour becomes visible, discussable, and actionable.

Dependability ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

This chain matters because RAIDT does not treat dependable performance as an informal impression. It treats it as something to be evidenced, scored, reviewed, and improved over time.

Link to the five RAIDT pillars

Responsibility

Dependability supports Responsibility because an organisation cannot responsibly deploy a generative AI workflow if its behaviour is unstable under normal use or foreseeable change.