S5.11 - Repeat_runs

S5.11 ? Repeat runs

flowchart LR
    A[Single good output can mislead] --> B[RAIDT
Run-level evidence framework]
    A2[GenAI outputs may vary across runs] --> B
    A3[One-off demos do not prove dependability] --> B
    H[Healthcare discharge summaries] --> C
    I[Finance compliance drafting] --> C
    J[Public-service support] --> C
    K[Enterprise knowledge work] --> C
    B --> C[[Repeat runs
Bounded variation across similar configured uses]]
    C --> D[Evidence pack
Comparative run evidence]
    C --> E[Score profile
Especially Dependability]
    C --> F[Reviewer reconstruction]
    C --> G[Governance readiness
Reviewability, contestability, audit readiness]
    D --> G
    E --> G
    F --> G

? Star S5 - RAIDT Pillars and Scoring

Star context: This item sits inside the RAIDT scoring logic by showing how repeated execution turns an apparently successful GenAI output into evidence about consistency, bounded variation, and governance reliability across the five pillars.

Academic picture

Definition / background

Repeat runs are multiple executions of the same or near-equivalent GenAI task configuration so that variation in outputs, performance, reasoning traces, or downstream consequences can be examined systematically. In governance terms, the purpose is not repetition for its own sake; it is to determine whether a claimed capability is robust enough to support organisational use, human oversight, and score assignment.

Conceptually, repeat runs emerge from a simple problem in generative AI: stochastic systems can produce materially different outputs even when the user believes the task is "the same". A single favourable run may therefore conceal instability, brittleness, prompt sensitivity, or context dependence. Repeat runs address this by turning one-off impressions into comparative run-level evidence.

Within RAIDT, repeat runs belong most clearly to the scoring and dependability logic of the framework, but they also affect the other four pillars. They help show whether observed behaviour is bounded, reconstructable, explainable, and governable. This matters because RAIDT does not ask organisations to assert that a system is reliable in principle; it asks them to show, through documented runs, what happened, under what conditions, and with what degree of repeatability.

Repeat runs are related to but not identical with benchmarking, validation, or replication. Benchmarking often compares systems on a standardised test set. Validation assesses whether a system meets a requirement. Replication usually concerns reproducing a study or experiment. Repeat runs in RAIDT are narrower and more operational: they test whether a particular configured use, in a particular context, produces evidence that is sufficiently stable to justify governance confidence at the run level.

Why this concept matters

Without repeat runs, organisations can mistake isolated success for dependable performance. That creates a governance gap: senior stakeholders may approve a workflow because they have seen one convincing answer, while hidden variability remains unmeasured. In practice, this can lead to inconsistent outputs, uneven user experience, weak escalation thresholds, and overconfident score profiles.

Repeat runs solve a practical assurance problem. They reveal whether outputs remain within acceptable bounds when the same task is executed again, whether small configuration changes produce disproportionate effects, and whether the organisation can explain why a result should be trusted. This helps distinguish a system that is genuinely governable from one that is merely impressive in demonstrations.

For organisations using GenAI, the concept matters because governance decisions are rarely made at the level of model theory alone. They are made at the level of actual work: drafting, summarising, classifying, recommending, retrieving, and assisting decisions. Repeat runs provide a disciplined way to test those work practices and to connect evidence directly to RAIDT evidence packs and pillar scores.

Key idea: Repeat runs matter because they turn single-use claims about GenAI performance into inspectable evidence about stability, variability, and governance readiness.

What this item enables

It enables a run-level assessment of whether output quality is stable enough for organisational use.
It enables the detection of hidden variability that a single successful run would not reveal.
It enables more defensible scoring for Dependability, supported by evidence rather than impression.
It enables reviewers to compare repeated executions and identify prompt sensitivity, contextual drift, or failure patterns.
It enables stronger evidence packs by attaching variability findings to a concrete task configuration.
It enables governance interventions such as tighter prompt templates, reviewer checkpoints, escalation thresholds, or use-case restrictions.

Practical example / likely audience question

Audience question

If a GenAI system already produced a good answer once, why should we repeat the run before making a governance judgement?

Answer

The concern behind this question is the assumption that one good output is representative of normal system behaviour. In deterministic software, that assumption is often reasonable. In generative AI, it is much weaker because output formation can vary with model stochasticity, hidden system updates, retrieval changes, prompt wording, latency conditions, or user interaction differences.

The direct answer is that repeat runs test whether the first success was typical or accidental. If five closely matched runs all remain within acceptable quality and risk boundaries, governance confidence is stronger. If one run is excellent, two are mediocre, and two introduce unsupported claims, then the governance conclusion changes substantially. The item therefore protects RAIDT from overclaiming on the basis of isolated evidence.

A practical example is a policy summarisation task for internal compliance staff. One run may produce a concise and accurate summary. Repeating the run across the same source policy, with controlled prompt and environment settings, may show that later outputs omit critical exceptions or add invented obligations. RAIDT handles this better than a generic AI governance approach because it ties the repeated evidence to a specific run configuration, documents the variation, and reflects that evidence in the pillar score profile rather than treating the first good result as sufficient proof.

Practical example in RAIDT terms

Consider a healthcare administration use case in which a GenAI assistant drafts discharge-summary explanations for patients in plain language. The run-level issue is that the same clinical source note may lead to different emphasis, omissions, or risk language across repeated executions, even when the prompt template appears unchanged.

In RAIDT terms, the organisation would preserve the task definition, prompt template, model version, input record characteristics, reviewer criteria, and the outputs from several repeat runs. The evidence needed would include run timestamps, configuration settings, output comparisons, reviewer judgements, and notes on whether each output remained within acceptable safety and clarity bounds.

The affected pillars are strongest in Dependability, because consistency across runs is directly under examination, but Responsibility is also involved because patient-facing communication creates accountability for harm or confusion. Auditability and Traceability matter because reviewers must be able to reconstruct which repeated runs were performed and what differences emerged. Interpretability matters where the team needs to explain why certain variations occurred or why they remain tolerable.

This improves governance readiness because the organisation can show not merely that the assistant once produced a suitable discharge explanation, but that repeated use under defined conditions behaves within a managed and reviewable envelope. That is the kind of claim a supervisor, auditor, or safety committee can inspect.

Detailed link to RAIDT

Repeat runs link to RAIDT in four ways.

First, they support RAIDT's core idea that GenAI governance should be grounded in evidence from actual uses rather than principles stated in the abstract.
Second, they strengthen the run as the unit of governance by testing whether one configured use behaves consistently when executed more than once.
Third, they enrich both the evidence pack and the score profile by supplying comparative evidence about stability, drift, failure modes, and acceptable bounds.
Fourth, they improve reviewability, contestability, audit readiness, and organisational learning because repeated outcomes can be reconstructed, challenged, and used to refine controls.

Repeat runs ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

Link to the five RAIDT pillars

Responsibility

Repeat runs support Responsibility by showing whether the organisation has taken reasonable steps to check that a use case does not rely on an isolated good outcome before it affects people, decisions, or services.