S3.07 - Replayability

S3.07 ? Replayability

flowchart LR
    A[Traditional limitation
Policies, model docs, and generic logs do not recreate one concrete run] --> B[RAIDT
Run-level evidence framework]
    H[Practical run fields
prompt, inputs, settings, timestamps, outputs, reviewer notes, repeat-run tests] --> C[[Replayability
evidential revisiting of a run]]
    B --> C
    C --> D[Evidence pack]
    C --> E[RAIDT score profile]
    C --> I[Reviewability and contestability]
    D --> F[Reviewer reconstruction]
    E --> G[Governance readiness and organisational learning]
    I --> J[Audit readiness and policy alignment]

? Star S3 - Run-Level Evidence Logic

Star context: Explains the proof-object logic inside RAIDT by showing how a governed run can be revisited, re-examined, and, where appropriate, re-run under recorded conditions so that evidence supports reconstruction, comparison, and challenge.

Academic picture

Definition / background

Replayability is the ability to revisit a specific GenAI run by re-running it, approximating it, or evidentially reconstructing it under recorded conditions. In RAIDT, the concept sits inside run-level evidence logic because governance depends on whether another reviewer can return to the original event and examine how it happened. For deterministic software, replayability may imply an exact rerun. For generative AI, that standard is often unrealistic because outputs can vary across time, providers, model versions, sampling conditions, and surrounding context. RAIDT therefore treats replayability as an evidence-based governance capability rather than as a narrow technical guarantee of identical output.

Conceptually, replayability is closely related to reconstructability and comparability, but it is not identical to either. Reconstructability asks whether the original run can be rebuilt in evidential terms. Comparability asks whether runs can be meaningfully assessed against one another. Replayability adds the operational question of whether the run can be revisited through a bounded rerun, a controlled approximation, or a structured repeat-run test that reveals whether the original event stands up to scrutiny.

This matters in GenAI governance because many disputes arise after the moment of use. A reviewer may need to test whether an output depended on a fragile prompt, whether a later model update changes the result, whether a human review step corrected a defect, or whether an organisational control would produce a better outcome if the run were attempted again. Replayability therefore supports reviewability, challenge, and learning from incidents.

Within RAIDT, replayability belongs to the logic of the evidence pack and score profile. If a run cannot be replayed or credibly approximated from the evidence retained, the evidence pack is weaker and several pillar judgements become harder to justify. If a run can be replayed or reconstructively revisited, governance claims become more defensible because reviewers can inspect process stability, control effectiveness, and the conditions under which the run occurred.

Why this concept matters

Replayability solves a practical governance problem: organisations may record that a GenAI tool was used, yet still be unable to revisit the event in a way that supports supervision, audit, contestability, or improvement. Without replayability, a problematic output can quickly become a matter of assertion, memory, or anecdote. With replayability, the organisation has a way to test what happened and whether the same controls would hold under renewed scrutiny.

The concept also prevents confusion between reproducibility in scientific experimentation and replayability in operational governance. RAIDT is not demanding that every GenAI output be reproduced bit for bit. It is demanding that the run be documented well enough to support a meaningful revisit. That distinction is essential in generative AI, where nondeterminism is common but accountability still requires evidence.

If replayability is missing, several risks appear: weak post hoc review, shallow auditability, poor organisational learning, inability to challenge questionable outputs, and over-reliance on generic supplier assurances. RAIDT uses replayability to move governance away from one-off claims and towards inspectable operational evidence.

Key idea: Replayability matters because RAIDT needs each governed run to be revisitable through evidence, not lost once the original output has been produced.

What this item enables

A reviewer to revisit a disputed or important GenAI run under documented conditions.
Controlled repeat-run testing to check whether outcomes are stable, fragile, or highly context-sensitive.
Stronger reconstruction of how prompts, inputs, settings, and human actions shaped the output.
Better evidence packs because claims about a run can be checked rather than merely asserted.
More defensible score profiles, especially for Auditability, Dependability, and Traceability.
Organisational learning from failure, near misses, edge cases, and workflow redesign.
More credible challenge and contestability when stakeholders ask whether a run was appropriate or reliable.

Practical example / likely audience question

Audience question

Does RAIDT require exact deterministic replay of a GenAI run for replayability to count?

Answer

The concern behind this question is that generative AI systems often do not behave like conventional deterministic software. If exact replay were the standard, many real organisational uses of GenAI would fail automatically, even where governance evidence is otherwise strong. The direct answer is therefore no: RAIDT does not require exact output identity in every case.

Instead, RAIDT asks whether the organisation can revisit the run in a governance-relevant way. That may involve rerunning the same prompt and inputs with the same settings, approximating the original conditions as closely as possible, recording any known differences such as model version drift, and assessing whether the new result materially supports or challenges the original decision. The issue is whether the run remains reviewable, not whether the system behaves like a fixed calculator.

For example, if a public-sector analyst used a GenAI tool to draft a briefing note, a replayable run would preserve the prompt, source documents, system settings, timestamp, model or provider identifier, and reviewer edits. A later rerun might not generate identical wording, but it can still reveal whether the briefing logic is stable, whether sensitive details were handled properly, and whether the review control was adequate. RAIDT handles this better than a generic AI governance approach because it ties replayability to run-level evidence, evidence packs, and pillar-based assessment rather than to a vague aspiration for reproducibility.

Practical example in RAIDT terms

Consider a finance setting in which an anti-money-laundering analyst uses a GenAI assistant to draft a suspicious activity case summary from transaction notes, internal policy guidance, and investigator comments. The GenAI use case is efficient and operationally plausible, but the run-level issue is whether the summary can later be revisited if the case is escalated, challenged, or sampled during compliance review.

The evidence needed for replayability includes the task definition, the prompt template, the transaction notes supplied, any masked or redacted source materials, the model and version identifier, relevant settings, the generated summary, the analyst's edits, the final submitted narrative, timestamps, and the reviewer or approver record. If a regulator or internal assurance team later asks why certain risk indicators were emphasised, the organisation must be able to revisit the run rather than rely on recollection.

The RAIDT pillars affected are clear. Responsibility is affected because a named role must remain accountable for checking the summary. Auditability is affected because a later reviewer must be able to inspect what happened. Interpretability is affected because the reasoning path from prompt and source notes to narrative output needs to be intelligible enough for review. Dependability is affected because repeated or similar cases should not produce erratic governance quality. Traceability is affected because the run must be linked to source inputs, timestamps, edits, and downstream submission. Replayability improves governance readiness here by making the case reviewable after the event, not merely complete at the moment of drafting.

Detailed link to RAIDT

Replayability links to RAIDT in four ways.

First, it supports the core RAIDT idea that governance should rest on evidence from actual use events, not only on policy statements or model-level descriptions.

Second, it strengthens the run as the unit of governance because a run that cannot be revisited is harder to inspect, challenge, or learn from.

Third, it increases the value of the evidence pack and the defensibility of the RAIDT score profile by showing whether the retained evidence is sufficient for meaningful replay, approximation, or repeat-run testing.

Fourth, it supports reviewability, contestability, audit readiness, and organisational learning because replayable runs can be revisited when incidents, queries, or governance reviews arise.

Replayability ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

Link to the five RAIDT pillars

Responsibility

Replayability supports Responsibility by ensuring that the organisational actors involved in a run remain visible when the run is revisited later.