S5.04 - Dependability

S5.04 ? Dependability

flowchart LR
    A[One-off success is not enough] --> B[RAIDT - run-level evidence framework]
    A2[Drift, variance, and unmanaged change] --> B
    B --> C[[Dependability]]
    C --> D[Repeat runs and variance evidence]
    C --> E[Evidence pack]
    C --> F[Score profile]
    C --> G[Controlled change and drift monitoring]
    E --> H[Reviewer reconstruction]
    F --> I[Governance readiness]
    G --> I
    J[Healthcare, finance, public services, enterprise productivity] --> C
    K[Threshold checks and change logs] --> C

? Star S5 - RAIDT Pillars and Scoring

Star context: Places Dependability within RAIDT's five-pillar score profile so that stability, consistency, and controlled change can be judged through run-level evidence rather than assumed through policy claims alone.


Academic picture
Definition / background

Dependability in RAIDT asks whether a particular generative AI configuration behaves stably, safely, and predictably enough for a defined organisational purpose when it is used repeatedly, exposed to foreseeable variation, or changed over time. It is therefore a governance judgement about the reliability of a run in context, not an abstract claim that a model is generally trustworthy.

Conceptually, the term sits close to ideas such as reliability, robustness, consistency, resilience, and operational assurance, but RAIDT gives it a more specific scope. Reliability often refers to whether a system works as intended; robustness often refers to resistance to perturbation; resilience often refers to recovery after failure. Dependability in RAIDT gathers these concerns into a practical review question: can this run be used again, under known conditions, with confidence that its behaviour will remain within acceptable bounds?

This matters in generative AI governance because many organisational failures arise not from a single spectacular error, but from unstable performance across time, users, prompts, integrations, or model versions. A run that looks acceptable once may still be unsuitable for operational deployment if it produces wide variation, degrades under workload, drifts after an update, or fails silently when upstream conditions change.

Within RAIDT, Dependability belongs inside the five-pillar model because run-level governance requires more than documenting intent or preserving logs. A run-level evidence pack should show not only what was done, but whether the run can be relied upon under expected conditions. The resulting score profile then makes Dependability visible alongside Responsibility, Auditability, Interpretability, and Traceability, allowing reviewers to see trade-offs rather than hiding them.

Why this concept matters

Dependability solves a central governance problem: organisations often approve generative AI uses on the basis of demos, vendor assurances, or one-off successful outputs, even though operational use depends on repeatability and controlled change. Without an explicit Dependability lens, governance can confuse isolated success with stable readiness.

The concept also helps avoid a common confusion between output quality and operational reliability. A single high-quality answer does not prove that a system is dependable. Dependability asks whether acceptable performance can be reproduced, monitored, and defended across the conditions that matter for real work.

If Dependability is missing, organisations risk adopting systems that behave inconsistently, fail unpredictably under pressure, or become unsafe after model, prompt, or workflow changes. This weakens reviewability, makes incidents harder to interpret, and leaves management without a sound basis for deciding whether a use case should scale, pause, or be redesigned.

For RAIDT, Dependability is one of the mechanisms that shifts AI governance from principles to operational control. It forces the governance conversation to ask what evidence demonstrates stability, what thresholds define acceptable variation, and what change-control process protects continued readiness.

Key idea: Dependability matters because RAIDT must show that a run is not only documented, but stable enough to be reviewed, repeated, and governed with confidence.

What this item measures
Practical example / likely audience question

Audience question

What evidence supports a Dependability score, and how is that different from simply saying the model worked well in testing?

Answer

The concern behind the question is that many AI evaluations focus on point performance rather than operational stability. A model may perform well in a benchmark or in a small pilot, yet still be unsuitable for live organisational use because outputs vary too widely across runs, edge cases are not controlled, or later changes are not monitored.

In RAIDT, evidence for Dependability includes repeat runs, variance measures, stability thresholds, drift monitoring, and change-control records. The direct question is not merely, "Did it work once?" but, "Can we show that this configured run behaves consistently enough, under expected conditions, for governance to rely on it?" A strong answer therefore combines technical evidence with governance evidence: repeated execution results, acceptable error ranges, records of updates, and documented responses when thresholds are breached.

For example, a team using a large language model to draft procurement summaries may find that the tool performs well on five sampled cases. That is encouraging, but it does not yet establish Dependability. RAIDT would ask whether the same configuration has been rerun across a wider sample, whether summary quality varies across document formats, whether updates to the model or prompt template changed performance, and whether these shifts are logged and reviewed. This is stronger than a generic AI governance approach because it ties the judgement to a specific run and produces an auditable basis for continued use.

Practical example in RAIDT terms

Consider a healthcare use case in which a generative AI system drafts clinic follow-up letters from structured consultation notes. The run-level issue is not simply whether one output looks fluent; it is whether the configured workflow produces consistently accurate and clinically safe letters across repeated cases, across different clinicians, and after updates to the prompt template or underlying model.

In RAIDT terms, the evidence pack would need repeat-run results, variance measures for omission and wording changes, clinically defined acceptance thresholds, a record of model or template changes, and drift checks after deployment. Responsibility is affected because weak Dependability can expose patients to harm. Auditability is affected because reviewers need evidence of repeat testing and threshold breaches. Interpretability is affected because unexplained variance reduces confidence in why outputs change. Dependability is directly scored through stability and change control. Traceability is affected because the organisation must connect changes in output behaviour to identifiable run conditions and system versions.

By making these signals explicit, the item improves governance readiness: the organisation can justify deployment, limit the workflow to lower-risk cases, or halt use pending remediation. Dependability therefore supports a better operational decision than either enthusiasm or caution alone.

Detailed link to RAIDT

Dependability links to RAIDT in four ways.

First, it supports RAIDT's core idea that governance should be based on evidence about actual use, not broad statements about AI systems in the abstract.
Second, it attaches that judgement to the run, meaning one configured use for one task, at one time, in one organisational context.
Third, it feeds directly into the evidence pack and the five-pillar score profile by showing whether repeated use and controlled change remain within acceptable bounds.
Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning because unstable behaviour becomes visible, discussable, and actionable.

Dependability ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

This chain matters because RAIDT does not treat dependable performance as an informal impression. It treats it as something to be evidenced, scored, reviewed, and improved over time.

Link to the five RAIDT pillars

Responsibility

Dependability supports Responsibility because an organisation cannot responsibly deploy a generative AI workflow if its behaviour is unstable under normal use or foreseeable change.

Example evidence / implication:

Auditability

Dependability supports Auditability by requiring records that allow reviewers to inspect repeat runs, compare outputs, and understand whether changes were tested and approved.

Example evidence / implication:

Interpretability

Dependability interacts with Interpretability because unexplained shifts in behaviour are harder to trust and harder to govern. Interpretable patterns can help determine whether variation is acceptable or symptomatic of a deeper issue.

Example evidence / implication:

Dependability

Dependability is the pillar that directly measures stability, consistency, controlled change, and resilience under expected operating conditions.

Example evidence / implication:

Traceability

Dependability depends on Traceability because instability cannot be diagnosed or governed unless outputs can be linked back to prompts, versions, inputs, users, and workflow conditions.

Example evidence / implication:

Dependability is the strongest direct link here, but the pillar only becomes robust when the other four pillars provide the surrounding governance conditions that make stability evidence meaningful.

Why this item is more than a generic concept

In general AI governance, dependability may simply mean that a system is dependable in a broad common-sense sense: it usually works, appears reliable, or has passed a technical evaluation. In RAIDT, Dependability is more precise. It asks whether a specific run can be shown, through evidence, to behave consistently enough for a defined organisational purpose.

The RAIDT meaning is more operational because it is tied to run-level evidence, evidence packs, scoring, and review procedures. Instead of treating dependability as a vague virtue, RAIDT turns it into a governable property with observable indicators, thresholds, and consequences for approval, monitoring, and redesign.

Common misunderstanding

Misunderstanding

Dependability means proving that the model will always produce correct outputs.

Correction

Dependability does not prove perfection, and RAIDT does not require impossible guarantees. Instead, it asks whether behaviour is stable and well-controlled enough for the intended context, with known limits and monitoring in place. For example, a student-support chatbot may still make occasional mistakes, yet be rated as more dependable if it shows low variance on common queries, has clear escalation routes for uncertain cases, and is revalidated after prompt changes. The correction is therefore practical: Dependability is about evidence-backed operational confidence, not absolute certainty.

Boundary and limitation

Dependability does not prove that a system is normatively justified, legally compliant, fair in all respects, or accurate in every case. A stable system can still be biased, misleading, or poorly aligned with organisational values. Equally, a highly interpretable system may still be operationally fragile.

The concept also depends on the quality of the testing regime. If repeat runs are too narrow, if thresholds are poorly defined, or if meaningful sources of variation are ignored, a Dependability judgement may give false confidence. RAIDT handles this limitation by embedding Dependability within a broader evidence pack and score profile, so that stability claims are assessed alongside responsibility, traceability, interpretability, and auditability rather than in isolation.

Implementation levels

Manual implementation

A researcher or small team can apply Dependability manually by running repeated task trials, recording output variation, noting failure patterns, and documenting any prompt or model changes in a simple evidence log. Manual review can still be rigorous if the run definition, sample selection, and acceptance thresholds are explicit.

Semi-automated implementation

Semi-automated implementation adds structured templates, metadata capture, scoring sheets, and periodic checks that compare current outputs with previously accepted baselines. This reduces reviewer burden and improves consistency across teams without requiring a full governance platform.

Fully automated implementation

At scale, Dependability can be implemented through orchestration layers, logging systems, dashboards, and governance pipelines that automatically capture run metadata, schedule repeat tests, compare outputs against thresholds, flag drift, and route exceptions for review. In this form, RAIDT becomes a continuous oversight mechanism rather than a one-off assessment exercise.

Practical use in the RAIDT project

Within the RAIDT project, Dependability helps explain why run-level evidence is necessary in the first place: organisational users need to know not only what a system can do, but whether it does so consistently enough to support accountable practice. In Paper 08 Foundations, the concept strengthens the theoretical case that governance should be anchored in situated runs rather than abstract model claims. In Paper 09 Empirical Validation, it provides a basis for measuring score variation, repeatability, calibration, and reviewer confidence. In Paper 10 Policy Pathways, it supports arguments for operational oversight, threshold-based assurance, and post-deployment review.

The item is also useful in sector playbooks, evidence pack design, scoring rubric development, and governance interventions. For supervision meetings and viva defence, it helps answer a recurring challenge: why RAIDT needs a separate pillar for stability and controlled change instead of assuming that documentation alone is enough.

Key audience questions to prepare for

Q1. How is Dependability different from simple accuracy?

Accuracy concerns whether outputs are correct against some criterion. Dependability is broader: it concerns whether acceptable performance is stable across repeat use, foreseeable variation, and change over time. A run can be accurate on a sample yet still be operationally undependable.

Q2. Why score Dependability separately if repeat runs already exist elsewhere in the evidence pack?

Because the existence of evidence is not the same as the governance judgement derived from it. RAIDT separates evidence collection from pillar scoring so reviewers can see how strongly the evidence supports operational confidence.

Q3. Can a system be dependable in one context and not in another?

Yes. RAIDT is run-level by design. A model may be dependable for low-risk drafting in one workflow and undependable for high-stakes decision support elsewhere because tasks, inputs, oversight, and tolerance for variation differ.

Q4. Does high Dependability eliminate the need for human oversight?

No. High Dependability may justify more confident use, but oversight still depends on risk, domain, and organisational responsibility. Stability supports governance; it does not replace it.

Q5. What is the strongest practical sign of poor Dependability?

A widening gap between expected and observed behaviour across repeated use, especially when updates or context changes occur without clear explanation, revalidation, or documented response.

Suggested citation concepts to support this item
Short explanation for presentation

Dependability in RAIDT means asking whether a specific generative AI run behaves stably and safely enough for repeated organisational use. It is not a claim that the model is always correct, and it is not just a synonym for technical reliability. Instead, it is a run-level governance judgement based on evidence such as repeat runs, variance measures, drift monitoring, and change-control records. This matters because organisations often approve AI workflows after a small number of successful examples, even though operational use depends on consistency over time and under changing conditions. By making Dependability a scored pillar, RAIDT turns stability from an informal impression into something reviewable, contestable, and auditable. That helps supervisors, reviewers, and practitioners explain why a system should scale, stay limited, or be reworked before further use.

One-line takeaway

Dependability is the RAIDT pillar that judges whether a run remains stable, controlled, and usable in practice because governance readiness depends on evidence of consistent behaviour, not isolated success.

Related items in RAIDT pillars and scoring
Mentioned in reference-paper summaries (5)

Paper summaries live in Port/93-References/pdf_summaries/. Each file listed below contains the key term at least once.

Anchored questions
Powered by Forestry.md