S10.05 - Repeated_runs

S10.05 ? Repeated runs

flowchart LR
    A[Single-run evaluation hides variance] --> B[RAIDT - run-level evidence framework]
    B --> C[[Repeated runs]]
    H[Healthcare, finance, law, public services, cybersecurity] --> C
    C --> D[Run-level evidence pack]
    C --> E[RAIDT score profile]
    C --> I[Reviewer reconstruction]
    D --> F[Reviewability and contestability]
    E --> G[Governance readiness]
    I --> G

? Star S10 - Empirical Programme, Domains and Sector Playbooks

Star context: Shows how RAIDT is tested, calibrated and applied across domains and sector-specific playbooks by repeating comparable runs so that variation becomes visible, reviewable and governable rather than hidden behind a single impressive output.

Academic picture

Definition / background

Repeated runs means deliberately executing the same or closely controlled generative AI task more than once so that output variation can be inspected, documented and assessed. In RAIDT, this is not a generic testing habit but a run-level governance mechanism. A run is one configured use of a GenAI system for a specific task, at a specific time, in a specific context. Repeating that run, or a tightly comparable version of it, exposes whether the system behaves consistently enough for organisational use.

Conceptually, the idea draws on repeated measurement, reliability checking and empirical robustness assessment. However, RAIDT adapts those ideas for governance rather than for model science alone. The aim is not simply to say whether a model performs well on average. The aim is to show whether claims about responsible use can be supported by evidence across multiple executions of the same practical task.

This matters because a single successful output can be misleading. One run may look accurate, interpretable and policy-compliant, while another run under the same apparent conditions may omit key facts, change tone, introduce unsupported content or fail to follow instructions. Repeated runs therefore help separate a one-off good result from a pattern that can support reviewable governance claims.

Within RAIDT, repeated runs connect directly to the run-level evidence pack and the five-pillar score profile. The evidence pack records what happened across runs, including prompts, settings, timestamps, outputs, assessments and reviewer notes. The score profile then uses that evidence to inform judgements across Responsibility, Auditability, Interpretability, Dependability and Traceability. Repeated runs belong inside RAIDT because they convert variability from a hidden property of GenAI into explicit evidence for governance.

Repeated runs are also distinct from adjacent concepts. They are not the same as benchmarking across unrelated tasks, broad replication across institutions, or stress testing under deliberately changing conditions. Their specific value is controlled repetition around a defined run so that stability, variance and failure modes can be understood at the level where governance decisions are actually made.

Why this concept matters

Repeated runs solve a basic but often neglected governance problem: organisational actors frequently evaluate generative AI through isolated demonstrations. That creates false confidence because the audience sees one output but not the spread of possible outputs. When repeated runs are missing, governance discussions drift back towards assertions such as "the model usually does this" or "we tested it and it looked fine" without evidential grounding.

In organisational use, this matters because decisions about adoption, oversight, escalation and human review often depend on whether a system behaves with acceptable consistency. Repeated runs make that question operational. They reveal whether good performance is robust, fragile, or highly context-sensitive. They also help prevent confusion between prompt engineering success in one instance and dependable behaviour across comparable uses.

For RAIDT, the concept is central to moving from principles to operational governance. Principles can say that systems should be safe, fair, explainable or auditable. Repeated runs help show whether those expectations survive contact with actual use. They therefore support contestability, audit readiness and continuous improvement because organisations can inspect not just one outcome, but the pattern of outcomes.

Key idea: repeated runs matter because governance claims about GenAI should rest on observable patterns of behaviour, not on one-off outputs.

What this item enables

Reveals run-to-run variance that would be invisible in a single demonstration.
Supports more defensible Dependability judgements by showing whether outputs are stable enough for the task.
Helps distinguish systematic design problems from stochastic model variation.
Provides comparative material for reviewer reconstruction and quality assurance.
Strengthens evidence packs with multiple outputs, assessor notes and variance observations.
Improves score-profile credibility because ratings can be justified against repeated evidence rather than isolated examples.
Supports escalation rules when variance exceeds an acceptable threshold for the domain.
Enables organisational learning about which prompts, settings or safeguards produce more reliable behaviour.

Practical example / likely audience question

Audience question

Why not single runs?

Answer

GenAI behaviour can vary, so repeated evidence is necessary for stability claims.

The concern behind the question is usually efficiency: if one output looks correct, why spend time repeating the task? The problem is that one apparently strong answer can conceal instability. A model may summarise a policy correctly on one run and omit a critical exception on the next. If governance relies on the first run alone, the organisation is not assessing dependable use; it is assessing a lucky instance.

A practical example is a GenAI assistant drafting a compliance summary for internal staff. One run may produce a clear, policy-aligned explanation, while later runs introduce ambiguity about escalation thresholds or reporting duties. RAIDT handles this better than a generic AI governance approach because it requires the organisation to treat each run as evidence and then compare runs systematically. Instead of saying only that the tool was "tested", RAIDT can show how many times it was run, what changed, what stayed stable, and how that evidence affected the score profile.

Practical example in RAIDT terms

In healthcare administration, imagine a GenAI system used to draft discharge-information summaries for patients with diabetes. The use case appears simple: the clinician provides structured notes and the system generates a patient-friendly summary. The run-level issue is that repeated runs with the same notes and prompt template may differ in how clearly they present medication timing, warning signs and follow-up instructions.

Under RAIDT, the evidence needed would include the prompt template, model and configuration details, timestamps, all repeated outputs, reviewer annotations, and a record of which omissions or ambiguities appeared across runs. The most affected pillars would be Dependability, Auditability and Traceability, with Responsibility also implicated because the organisation must decide what level of variation is acceptable for patient communication.

Repeated runs improve governance readiness here by showing whether the system is consistently safe enough for supervised use, whether certain failure patterns recur, and whether extra controls are needed before deployment. A single strong output might support optimism; repeated runs support a defensible governance judgement.

Detailed link to RAIDT

Repeated runs links to RAIDT in four ways.

First, it connects to RAIDT's core idea that governance should be anchored in evidence from actual uses of generative AI rather than in abstract compliance statements.
Second, it links directly to the run because repetition only has value when each run is defined, comparable and documented at the level of task, configuration, time and context.
Third, it strengthens both the evidence pack and the score profile by supplying comparative material about stability, variance, failure modes and reviewer judgement.
Fourth, it supports reviewability, contestability, audit readiness and organisational learning because repeated runs allow others to inspect not only what happened once, but what tends to happen across comparable executions.

Repeated runs ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

Link to the five RAIDT pillars

Responsibility

Repeated runs support Responsibility by forcing explicit decisions about test design, acceptable variance, escalation thresholds and human oversight. They make it harder for a team to rely on informal impressions or selective examples when claiming a tool is ready for use.