S10.05 - Repeated_runs
S10.05 ? Repeated runs
flowchart LR
A[Single-run evaluation hides variance] --> B[RAIDT - run-level evidence framework]
B --> C[[Repeated runs]]
H[Healthcare, finance, law, public services, cybersecurity] --> C
C --> D[Run-level evidence pack]
C --> E[RAIDT score profile]
C --> I[Reviewer reconstruction]
D --> F[Reviewability and contestability]
E --> G[Governance readiness]
I --> G? Star S10 - Empirical Programme, Domains and Sector Playbooks
Star context: Shows how RAIDT is tested, calibrated and applied across domains and sector-specific playbooks by repeating comparable runs so that variation becomes visible, reviewable and governable rather than hidden behind a single impressive output.
Academic picture
Definition / background
Repeated runs means deliberately executing the same or closely controlled generative AI task more than once so that output variation can be inspected, documented and assessed. In RAIDT, this is not a generic testing habit but a run-level governance mechanism. A run is one configured use of a GenAI system for a specific task, at a specific time, in a specific context. Repeating that run, or a tightly comparable version of it, exposes whether the system behaves consistently enough for organisational use.
Conceptually, the idea draws on repeated measurement, reliability checking and empirical robustness assessment. However, RAIDT adapts those ideas for governance rather than for model science alone. The aim is not simply to say whether a model performs well on average. The aim is to show whether claims about responsible use can be supported by evidence across multiple executions of the same practical task.
This matters because a single successful output can be misleading. One run may look accurate, interpretable and policy-compliant, while another run under the same apparent conditions may omit key facts, change tone, introduce unsupported content or fail to follow instructions. Repeated runs therefore help separate a one-off good result from a pattern that can support reviewable governance claims.
Within RAIDT, repeated runs connect directly to the run-level evidence pack and the five-pillar score profile. The evidence pack records what happened across runs, including prompts, settings, timestamps, outputs, assessments and reviewer notes. The score profile then uses that evidence to inform judgements across Responsibility, Auditability, Interpretability, Dependability and Traceability. Repeated runs belong inside RAIDT because they convert variability from a hidden property of GenAI into explicit evidence for governance.
Repeated runs are also distinct from adjacent concepts. They are not the same as benchmarking across unrelated tasks, broad replication across institutions, or stress testing under deliberately changing conditions. Their specific value is controlled repetition around a defined run so that stability, variance and failure modes can be understood at the level where governance decisions are actually made.
Why this concept matters
Repeated runs solve a basic but often neglected governance problem: organisational actors frequently evaluate generative AI through isolated demonstrations. That creates false confidence because the audience sees one output but not the spread of possible outputs. When repeated runs are missing, governance discussions drift back towards assertions such as "the model usually does this" or "we tested it and it looked fine" without evidential grounding.
In organisational use, this matters because decisions about adoption, oversight, escalation and human review often depend on whether a system behaves with acceptable consistency. Repeated runs make that question operational. They reveal whether good performance is robust, fragile, or highly context-sensitive. They also help prevent confusion between prompt engineering success in one instance and dependable behaviour across comparable uses.
For RAIDT, the concept is central to moving from principles to operational governance. Principles can say that systems should be safe, fair, explainable or auditable. Repeated runs help show whether those expectations survive contact with actual use. They therefore support contestability, audit readiness and continuous improvement because organisations can inspect not just one outcome, but the pattern of outcomes.
Key idea: repeated runs matter because governance claims about GenAI should rest on observable patterns of behaviour, not on one-off outputs.
What this item enables
- Reveals run-to-run variance that would be invisible in a single demonstration.
- Supports more defensible Dependability judgements by showing whether outputs are stable enough for the task.
- Helps distinguish systematic design problems from stochastic model variation.
- Provides comparative material for reviewer reconstruction and quality assurance.
- Strengthens evidence packs with multiple outputs, assessor notes and variance observations.
- Improves score-profile credibility because ratings can be justified against repeated evidence rather than isolated examples.
- Supports escalation rules when variance exceeds an acceptable threshold for the domain.
- Enables organisational learning about which prompts, settings or safeguards produce more reliable behaviour.
Practical example / likely audience question
Audience question
Why not single runs?
Answer
GenAI behaviour can vary, so repeated evidence is necessary for stability claims.
The concern behind the question is usually efficiency: if one output looks correct, why spend time repeating the task? The problem is that one apparently strong answer can conceal instability. A model may summarise a policy correctly on one run and omit a critical exception on the next. If governance relies on the first run alone, the organisation is not assessing dependable use; it is assessing a lucky instance.
A practical example is a GenAI assistant drafting a compliance summary for internal staff. One run may produce a clear, policy-aligned explanation, while later runs introduce ambiguity about escalation thresholds or reporting duties. RAIDT handles this better than a generic AI governance approach because it requires the organisation to treat each run as evidence and then compare runs systematically. Instead of saying only that the tool was "tested", RAIDT can show how many times it was run, what changed, what stayed stable, and how that evidence affected the score profile.
Practical example in RAIDT terms
In healthcare administration, imagine a GenAI system used to draft discharge-information summaries for patients with diabetes. The use case appears simple: the clinician provides structured notes and the system generates a patient-friendly summary. The run-level issue is that repeated runs with the same notes and prompt template may differ in how clearly they present medication timing, warning signs and follow-up instructions.
Under RAIDT, the evidence needed would include the prompt template, model and configuration details, timestamps, all repeated outputs, reviewer annotations, and a record of which omissions or ambiguities appeared across runs. The most affected pillars would be Dependability, Auditability and Traceability, with Responsibility also implicated because the organisation must decide what level of variation is acceptable for patient communication.
Repeated runs improve governance readiness here by showing whether the system is consistently safe enough for supervised use, whether certain failure patterns recur, and whether extra controls are needed before deployment. A single strong output might support optimism; repeated runs support a defensible governance judgement.
Detailed link to RAIDT
Repeated runs links to RAIDT in four ways.
First, it connects to RAIDT's core idea that governance should be anchored in evidence from actual uses of generative AI rather than in abstract compliance statements.
Second, it links directly to the run because repetition only has value when each run is defined, comparable and documented at the level of task, configuration, time and context.
Third, it strengthens both the evidence pack and the score profile by supplying comparative material about stability, variance, failure modes and reviewer judgement.
Fourth, it supports reviewability, contestability, audit readiness and organisational learning because repeated runs allow others to inspect not only what happened once, but what tends to happen across comparable executions.
Repeated runs ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness
Link to the five RAIDT pillars
Responsibility
Repeated runs support Responsibility by forcing explicit decisions about test design, acceptable variance, escalation thresholds and human oversight. They make it harder for a team to rely on informal impressions or selective examples when claiming a tool is ready for use.
Example evidence / implication:
- Documented rationale for how many repeats were conducted for the task.
- Clear governance rule for when high variance triggers further review or restricted use.
Auditability
Repeated runs strongly reinforce Auditability because auditors and reviewers can inspect a set of comparable outputs instead of a single curated example. This makes it easier to see how the judgement was reached and whether the evidence base is adequate.
Example evidence / implication:
- Stored outputs from each repeated run with reviewer comments.
- Comparison table showing stable behaviours, variable behaviours and notable failures.
Interpretability
Repeated runs support Interpretability by showing which aspects of system behaviour remain understandable across attempts and which aspects drift. This does not guarantee full model interpretability, but it helps interpret observed behaviour at the task level.
Example evidence / implication:
- Notes identifying recurring patterns in structure, omissions or phrasing across runs.
- Explanation of which output elements remain stable enough to support user understanding.
Dependability
Dependability is the pillar most directly affected. Repeated runs reveal whether apparently good performance is repeatable, fragile or erratic. Without repetition, dependability claims are weak because they rest on isolated outcomes.
Example evidence / implication:
- Frequency of key errors, omissions or policy deviations across repeated runs.
- Assessment of whether variation stays within an acceptable range for the use case.
Traceability
Repeated runs strengthen Traceability by linking each output to a documented run, timestamp, configuration and assessment trail. This matters when an organisation needs to reconstruct what happened and why governance decisions were made.
Example evidence / implication:
- Unique identifiers for each repeated run and its associated evidence.
- Versioned record of prompt, model settings and review outcomes across repeats.
Repeated runs have the strongest direct effect on Dependability, Auditability and Traceability, while Responsibility and Interpretability are supported through the structure they impose on review and explanation.
Why this item is more than a generic concept
In general AI governance, repeated runs may simply mean "test the system more than once". In RAIDT, repeated runs have a narrower and more operational meaning: they are controlled repetitions tied to the run as the unit of governance, captured in evidence packs, and used to justify score-profile judgements. The RAIDT meaning is therefore more actionable because it specifies what is repeated, what is recorded, how the evidence is reviewed, and how the findings affect governance readiness.
Common misunderstanding
Misunderstanding
If a team performs repeated runs, it has proven the system is reliable.
Correction
Repeated runs do not prove universal reliability. They show how the system behaves across the repeated conditions that were actually tested. For example, if a public-service chatbot produces consistent answers across ten repeated benefit-eligibility prompts, that supports a local dependability claim for those tested conditions. It does not prove the same level of reliability for all claimant profiles, all policy changes or all future model versions. RAIDT handles this by treating repeated runs as bounded evidence that informs, but does not replace, ongoing review and scoped governance judgement.
Boundary and limitation
Repeated runs do not remove the need for domain expertise, human judgement or broader evaluation. They do not prove causal reasons for variation, and they do not guarantee that future outputs will remain stable after model updates, policy changes or context shifts. Repetition can also create a false sense of security if the repeated task is too narrow, too easy or insufficiently representative of real work.
In RAIDT, the limitation is handled by treating repeated runs as one component of a larger evidence strategy. The organisation still needs clear task scoping, reviewer criteria, escalation pathways, and a transparent account of what the repeated evidence does and does not support. Repeated runs strengthen governance; they do not replace it.
Implementation levels
Manual implementation
A researcher or small team can run the same task several times, save each output, and compare them in a simple review sheet. Even a spreadsheet or structured note can record run identifiers, timestamps, prompt versions, observed differences and assessor judgements.
Semi-automated implementation
Templates, metadata forms and lightweight dashboards can make repeated runs easier to conduct consistently. A team might use standardised run cards, automatic timestamp capture, structured comparison tables and pre-defined scoring prompts to support repeatable review.
Fully automated implementation
At scale, a platform or orchestration layer can trigger repeated runs automatically, log model and prompt metadata, compute variance indicators, route anomalies for review, and feed the resulting evidence directly into RAIDT evidence packs and governance dashboards. In this form, repeated runs become part of an operational governance pipeline rather than an occasional manual check.
Practical use in the RAIDT project
Within the RAIDT project, repeated runs are especially important for explaining why the framework treats the run as the unit of governance. In Paper 08 Foundations, the concept helps justify why evidence should be attached to concrete uses rather than to abstract model descriptions. In Paper 09 Empirical Validation, it supports the empirical logic for comparing outputs across comparable runs rather than relying on one illustrative example. In Paper 10 Policy Pathways, it helps translate governance principles into practical expectations about evidence, review and assurance.
The concept is also useful across sector playbooks because acceptable variance is domain-specific. A productivity assistant may tolerate stylistic variation, while healthcare, law or public-service uses may require much tighter consistency and stronger escalation. For supervisor explanation, viva defence and journal positioning, repeated runs provide a clear answer to the question of how RAIDT turns variability into a governable object rather than ignoring it.
Key audience questions to prepare for
Q1. How many repeated runs are enough?
There is no universal number. The right number depends on task criticality, observed variance, domain risk and review burden. RAIDT's contribution is to make the rationale explicit and evidential rather than arbitrary.
Q2. Are repeated runs too expensive for routine governance?
They can be costly if used indiscriminately, but the cost should be compared with the governance risk of acting on misleading single-run evidence. Repetition can also be targeted at higher-risk tasks, new configurations and disputed use cases.
Q3. Does repeating the same prompt create an artificial test?
It can if the task is unrealistically narrow. That is why repeated runs should be paired with realistic scenarios and clear documentation of scope. The aim is not to mimic all future use, but to expose variance within a defined governance unit.
Q4. What if the model changes between runs?
Then the repeated runs must be interpreted with that change in view. Traceability becomes critical: version and timestamp records are needed so reviewers can distinguish normal variance from configuration or model drift.
Q5. How does this improve governance rather than just evaluation?
It improves governance because the repeated evidence informs accountability, auditability, escalation and deployment decisions. RAIDT uses repeated runs not just to measure behaviour, but to justify whether organisational use is reviewable and defensible.
Suggested citation concepts to support this item
- repeated measurement reliability in AI evaluation
- output variance in large language models
- stochasticity and consistency in generative AI systems
- robustness assessment for language-model applications
- dependable AI and repeated testing methods
- auditability of generative AI outputs
- traceability and reproducibility in AI governance
- human oversight thresholds for variable AI outputs
- empirical validation methods for organisational AI use
- run-level governance evidence for generative AI
Short explanation for presentation
Repeated runs means testing the same GenAI task more than once so that output variability becomes visible and governable. In RAIDT, this matters because a single successful answer can create false confidence about system quality. By repeating a defined run and recording each output, RAIDT builds evidence about stability, inconsistency, failure modes and reviewability. That evidence then feeds the run-level evidence pack and supports the five-pillar score profile, especially Dependability, Auditability and Traceability. The concept is important for supervision and viva discussion because it shows how RAIDT moves beyond principle-based governance. Instead of saying that a system is trustworthy in general, RAIDT asks whether repeated evidence from comparable uses is strong enough to justify organisational confidence, oversight and deployment decisions.
One-line takeaway
Repeated runs is the disciplined repetition of a defined GenAI task so RAIDT can judge governance readiness from patterns of evidence rather than from a single output.
Related items in empirical programme, domains and sector playbooks
- S10.01 ? Empirical programme
- S10.02 ? 14 domains
- S10.03 ? 20 scenarios per domain
- S10.04 ? 6 configurations
- S10.06 ? Governance readiness as outcome
- S10.07 ? Healthcare
- S10.08 ? Finance
- S10.09 ? Law and public services
- S10.10 ? Cybersecurity
- S10.11 ? Education
- S10.12 ? Environment
- S10.13 ? Crisis and emergency response
- S10.14 ? Supply chain
- S10.15 ? Ageing calibration
Anchored questions
- Audience question: Why not single runs? Answer: GenAI behaviour can vary, so repeated evidence is necessary for stability claims.