S5.11 - Repeat_runs

S5.11 ? Repeat runs

flowchart LR
    A[Single good output can mislead] --> B[RAIDT
Run-level evidence framework]
    A2[GenAI outputs may vary across runs] --> B
    A3[One-off demos do not prove dependability] --> B
    H[Healthcare discharge summaries] --> C
    I[Finance compliance drafting] --> C
    J[Public-service support] --> C
    K[Enterprise knowledge work] --> C
    B --> C[[Repeat runs
Bounded variation across similar configured uses]]
    C --> D[Evidence pack
Comparative run evidence]
    C --> E[Score profile
Especially Dependability]
    C --> F[Reviewer reconstruction]
    C --> G[Governance readiness
Reviewability, contestability, audit readiness]
    D --> G
    E --> G
    F --> G

? Star S5 - RAIDT Pillars and Scoring

Star context: This item sits inside the RAIDT scoring logic by showing how repeated execution turns an apparently successful GenAI output into evidence about consistency, bounded variation, and governance reliability across the five pillars.


Academic picture
Definition / background

Repeat runs are multiple executions of the same or near-equivalent GenAI task configuration so that variation in outputs, performance, reasoning traces, or downstream consequences can be examined systematically. In governance terms, the purpose is not repetition for its own sake; it is to determine whether a claimed capability is robust enough to support organisational use, human oversight, and score assignment.

Conceptually, repeat runs emerge from a simple problem in generative AI: stochastic systems can produce materially different outputs even when the user believes the task is "the same". A single favourable run may therefore conceal instability, brittleness, prompt sensitivity, or context dependence. Repeat runs address this by turning one-off impressions into comparative run-level evidence.

Within RAIDT, repeat runs belong most clearly to the scoring and dependability logic of the framework, but they also affect the other four pillars. They help show whether observed behaviour is bounded, reconstructable, explainable, and governable. This matters because RAIDT does not ask organisations to assert that a system is reliable in principle; it asks them to show, through documented runs, what happened, under what conditions, and with what degree of repeatability.

Repeat runs are related to but not identical with benchmarking, validation, or replication. Benchmarking often compares systems on a standardised test set. Validation assesses whether a system meets a requirement. Replication usually concerns reproducing a study or experiment. Repeat runs in RAIDT are narrower and more operational: they test whether a particular configured use, in a particular context, produces evidence that is sufficiently stable to justify governance confidence at the run level.

Why this concept matters

Without repeat runs, organisations can mistake isolated success for dependable performance. That creates a governance gap: senior stakeholders may approve a workflow because they have seen one convincing answer, while hidden variability remains unmeasured. In practice, this can lead to inconsistent outputs, uneven user experience, weak escalation thresholds, and overconfident score profiles.

Repeat runs solve a practical assurance problem. They reveal whether outputs remain within acceptable bounds when the same task is executed again, whether small configuration changes produce disproportionate effects, and whether the organisation can explain why a result should be trusted. This helps distinguish a system that is genuinely governable from one that is merely impressive in demonstrations.

For organisations using GenAI, the concept matters because governance decisions are rarely made at the level of model theory alone. They are made at the level of actual work: drafting, summarising, classifying, recommending, retrieving, and assisting decisions. Repeat runs provide a disciplined way to test those work practices and to connect evidence directly to RAIDT evidence packs and pillar scores.

Key idea: Repeat runs matter because they turn single-use claims about GenAI performance into inspectable evidence about stability, variability, and governance readiness.

What this item enables
Practical example / likely audience question

Audience question

If a GenAI system already produced a good answer once, why should we repeat the run before making a governance judgement?

Answer

The concern behind this question is the assumption that one good output is representative of normal system behaviour. In deterministic software, that assumption is often reasonable. In generative AI, it is much weaker because output formation can vary with model stochasticity, hidden system updates, retrieval changes, prompt wording, latency conditions, or user interaction differences.

The direct answer is that repeat runs test whether the first success was typical or accidental. If five closely matched runs all remain within acceptable quality and risk boundaries, governance confidence is stronger. If one run is excellent, two are mediocre, and two introduce unsupported claims, then the governance conclusion changes substantially. The item therefore protects RAIDT from overclaiming on the basis of isolated evidence.

A practical example is a policy summarisation task for internal compliance staff. One run may produce a concise and accurate summary. Repeating the run across the same source policy, with controlled prompt and environment settings, may show that later outputs omit critical exceptions or add invented obligations. RAIDT handles this better than a generic AI governance approach because it ties the repeated evidence to a specific run configuration, documents the variation, and reflects that evidence in the pillar score profile rather than treating the first good result as sufficient proof.

Practical example in RAIDT terms

Consider a healthcare administration use case in which a GenAI assistant drafts discharge-summary explanations for patients in plain language. The run-level issue is that the same clinical source note may lead to different emphasis, omissions, or risk language across repeated executions, even when the prompt template appears unchanged.

In RAIDT terms, the organisation would preserve the task definition, prompt template, model version, input record characteristics, reviewer criteria, and the outputs from several repeat runs. The evidence needed would include run timestamps, configuration settings, output comparisons, reviewer judgements, and notes on whether each output remained within acceptable safety and clarity bounds.

The affected pillars are strongest in Dependability, because consistency across runs is directly under examination, but Responsibility is also involved because patient-facing communication creates accountability for harm or confusion. Auditability and Traceability matter because reviewers must be able to reconstruct which repeated runs were performed and what differences emerged. Interpretability matters where the team needs to explain why certain variations occurred or why they remain tolerable.

This improves governance readiness because the organisation can show not merely that the assistant once produced a suitable discharge explanation, but that repeated use under defined conditions behaves within a managed and reviewable envelope. That is the kind of claim a supervisor, auditor, or safety committee can inspect.

Detailed link to RAIDT

Repeat runs link to RAIDT in four ways.

First, they support RAIDT's core idea that GenAI governance should be grounded in evidence from actual uses rather than principles stated in the abstract.
Second, they strengthen the run as the unit of governance by testing whether one configured use behaves consistently when executed more than once.
Third, they enrich both the evidence pack and the score profile by supplying comparative evidence about stability, drift, failure modes, and acceptable bounds.
Fourth, they improve reviewability, contestability, audit readiness, and organisational learning because repeated outcomes can be reconstructed, challenged, and used to refine controls.

Repeat runs ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

Link to the five RAIDT pillars

Responsibility

Repeat runs support Responsibility by showing whether the organisation has taken reasonable steps to check that a use case does not rely on an isolated good outcome before it affects people, decisions, or services.

Example evidence / implication:

Auditability

Repeat runs support Auditability because they create a comparable evidence trail. Auditors and reviewers can inspect not only the chosen output but also the spread of results across repeated executions.

Example evidence / implication:

Interpretability

Repeat runs support Interpretability by helping teams explain variation. Where outputs differ, the organisation can investigate which prompt elements, retrieval states, settings, or contextual factors appear to influence behaviour.

Example evidence / implication:

Dependability

Repeat runs are especially important for Dependability because they provide direct evidence about stability, bounded performance, and operational confidence. This is the pillar most strongly affected by the concept.

Example evidence / implication:

Traceability

Repeat runs support Traceability by ensuring that each repeated execution can be linked back to its configuration, context, timing, and review outcome. Without that link, repetition loses governance value.

Example evidence / implication:

Repeat runs most strongly affect Dependability, but their governance value depends on support from Auditability and Traceability.

Why this item is more than a generic concept

In general AI governance, repeat runs may simply mean testing a system several times to see whether it behaves similarly. In RAIDT, the concept is more operational and more accountable. It is not just repeated testing in the abstract; it is repeated execution of a defined run so that variation becomes evidence that can be stored, reviewed, scored, and contested.

The RAIDT meaning is therefore stronger than a generic robustness claim. It connects repetition to the evidence pack, to pillar scoring, and to governance judgement. That move matters because it allows organisations to explain not only that they tested the system more than once, but how those repeated runs affected approval, restriction, monitoring, and continuous improvement.

Common misunderstanding

Misunderstanding

If repeated runs produce outputs that are not identical, the system has failed and cannot be governed.

Correction

Governance does not require all outputs to be identical. Many GenAI uses legitimately allow stylistic or surface variation. The real question is whether the variation stays within acceptable substantive bounds for the task. For example, three customer-service draft replies may differ in wording while still preserving the same correct policy position, required escalation language, and tone constraints. Repeat runs therefore test bounded acceptability, not strict textual sameness.

Boundary and limitation

Repeat runs do not prove that a system is universally reliable across all users, all contexts, or future model updates. They only provide evidence about the conditions that were actually repeated. A well-behaved repeated run set may still fail when inputs shift, policies change, retrieval sources drift, or the organisation moves into a higher-risk domain.

Repeat runs also do not replace broader validation, human review, or impact assessment. A system may be consistent yet consistently wrong, biased, or unsuitable for the task. RAIDT handles this limitation by locating repeat runs within a broader evidence structure: they contribute to pillar scoring, but they do not override other forms of evidence about responsibility, explanation quality, or trace completeness.

The method works best when the organisation has defined what counts as a comparable run, what variation is acceptable, and what action follows when repeated outputs diverge. Without those conditions, repetition can create data without producing governance clarity.

Implementation levels

Manual implementation

A researcher or small team can run the same prompt-template task several times, save each output, and compare them against a simple review rubric. Manual notes can record which differences are trivial, which are substantive, and whether the task remains acceptable for use.

Semi-automated implementation

A semi-automated approach can use structured templates, metadata forms, and comparison tables that automatically capture run identifiers, timestamps, model settings, and reviewer outcomes. This reduces inconsistency in documentation and makes repeated-run evidence easier to include in the RAIDT evidence pack.

Fully automated implementation

At scale, a platform or orchestration layer can trigger repeated executions automatically, log prompt and model metadata, compare outputs against predefined criteria, and feed stability findings into dashboards or scoring workflows. In a mature governance pipeline, repeated-run results can generate alerts, lower confidence scores, or require escalation before deployment or continued use.

Practical use in the RAIDT project

In the RAIDT project, repeat runs are useful in at least four connected ways. In Paper 08 Foundations, they help justify why run-level evidence is needed for generative systems whose behaviour cannot be responsibly characterised through one-off examples. In Paper 09 Empirical Validation, they provide a practical mechanism for examining whether scoring decisions are supported by observable variability patterns rather than subjective impression.

In Paper 10 Policy Pathways, repeat runs can be translated into governance guidance for organisations that need clear thresholds for acceptable consistency, escalation, and review. They are also relevant to sector playbooks because different domains will define acceptable variation differently: healthcare may emphasise omission risk, finance may emphasise compliance fidelity, and education may emphasise factual stability and pedagogic clarity.

For the evidence pack and scoring rubric, repeat runs supply concrete material that strengthens reviewer explanations and viva defence. They help answer a common supervisory challenge: how RAIDT moves from broad governance aspiration to auditable method. They are also useful in journal positioning because they show that RAIDT operationalises governance through inspectable evidence rather than normative rhetoric alone.

Key audience questions to prepare for

Q1. How many repeat runs are enough to make a governance judgement?

There is no universal number. The appropriate count depends on task risk, output variability, and the cost of error. RAIDT's contribution is not to impose a fixed number for all cases, but to require a documented rationale for why the chosen level of repetition is sufficient for the use being governed.

Q2. Are repeat runs just another name for benchmarking?

No. Benchmarking usually evaluates performance against a standard task set. Repeat runs in RAIDT examine whether a specific configured use remains acceptably stable across repeated executions in its actual organisational context.

Q3. What if outputs vary but all of them are acceptable?

That can still support governance readiness. The critical issue is whether the variation stays within defined substantive bounds. RAIDT therefore distinguishes harmful instability from acceptable diversity of expression.

Q4. Can repeat runs replace human review?

No. Repeat runs strengthen the evidence base, but human judgement is still needed to define tolerances, interpret divergence, and decide whether a use should be approved, restricted, or redesigned.

Q5. Why does this belong in scoring rather than only in testing?

Because repeated-run evidence affects governance confidence. If a use is unstable across repetitions, that should influence the score profile, especially Dependability, rather than remaining an isolated technical observation.

Suggested citation concepts to support this item
Short explanation for presentation

Repeat runs are a simple but important part of RAIDT because they test whether one apparently good GenAI result is actually representative of system behaviour. In generative systems, a single successful answer can be misleading if later runs under similar conditions produce omissions, contradictions, or unsafe variation. RAIDT treats the run as the unit of governance, so repeating the run creates evidence about stability at exactly the level where organisations make decisions about use, review, and accountability. That evidence can then be placed into the evidence pack and reflected in the five-pillar score profile, especially Dependability. The value of repeat runs is therefore not just technical testing; it is governance strengthening through documented, reviewable, and contestable evidence.

One-line takeaway

Repeat runs are repeated executions of a defined GenAI use that make variability visible because RAIDT turns those repeated outcomes into run-level evidence for scoring and governance readiness.

Related items in RAIDT pillars and scoring
Anchored questions

No separate anchored questions section was present in the original note.

Powered by Forestry.md