C0.03 - Run-level_evidence
C0.03 ? Run-level evidence
flowchart LR
A[Traditional governance artefacts
model cards, policy, supplier assurances] --> B[RAIDT
run-level evidence framework]
H[Practical run fields
prompt, inputs, settings, outputs, review notes] --> C[[Run-level evidence
reconstructable proof of one run]]
B --> C
C --> D[Evidence pack]
C --> E[RAIDT score profile]
D --> F[Reviewer reconstruction and contestability]
E --> G[Governance readiness and organisational learning]← Star C0 - RAIDT Core, Definition, Values, Claims and Innovation
Star context: Defines the project identity of RAIDT by showing that responsible governance of GenAI in organisational work depends on evidence from the level of the individual run, not only from model descriptions or high-level policy claims.
Definition / background
Run-level evidence is the recorded proof needed to reconstruct, review, and evaluate one specific use of a generative AI system. In RAIDT, this means evidence tied to a single run: one configured use of GenAI for a defined task, at a particular time, in a particular organisational context. The concept matters because many governance artefacts describe systems in general terms, whereas governance failures, disputes, and improvements usually arise from what happened in a specific instance of use.
Conceptually, run-level evidence sits between raw technical logging and broad governance documentation. It is more specific than a policy, model card, or risk register, because it captures the actual circumstances of use rather than only the intended design or declared controls. At the same time, it is more governance-relevant than a narrow system log because it can include contextual, procedural, and human-review information needed for organisational accountability.
Within RAIDT, run-level evidence is foundational. It provides the material from which a run-level evidence pack can be assembled and from which a five-pillar RAIDT score profile can be justified. Without this evidence layer, scoring risks becoming impressionistic, governance claims remain difficult to test, and post hoc review becomes weak or incomplete.
This item therefore belongs in RAIDT Core because it defines the evidential basis of the whole framework. RAIDT does not primarily ask whether an organisation has a principle, nor whether a model provider has published a general description. It asks whether the organisation can show, for one real run, what happened, under what conditions, with what trace, and with what basis for review.
Why this concept matters
Run-level evidence solves a central governance problem in generative AI: organisations often know that they should govern AI use, but they lack a reliable unit of proof for examining an actual use event. When a questionable output appears, when a decision must be justified, or when a reviewer asks how a result was produced, abstract policy language is insufficient. A governance framework needs evidence that is granular enough to support reconstruction.
The concept also prevents a common confusion between system-level assurance and use-level accountability. A model may be documented and approved at a high level, yet still be used badly, inconsistently, or inappropriately in a particular run. Run-level evidence makes it possible to distinguish between what the system is said to be capable of and what was actually done with it in a real organisational setting.
If run-level evidence is missing, several risks follow: weak auditability, superficial assurance, poor contestability, limited learning from incidents, and difficulty defending practice to supervisors, regulators, clients, or internal governance bodies. RAIDT uses this concept to move governance from broad principles to operational scrutiny.
Key idea: Run-level evidence matters because responsible GenAI governance depends on being able to inspect one real use event rather than relying only on general descriptions or policy assertions.
What this item captures
- The specific task, purpose, and context of one GenAI run.
- The configured conditions of use, including relevant system settings, prompts, inputs, and constraints.
- The output or outputs generated during that run.
- Human actions around the run, such as review, editing, approval, escalation, or override.
- The trace needed for later reconstruction, explanation, challenge, or audit.
- The evidential basis for scoring the run across Responsibility, Auditability, Interpretability, Dependability, and Traceability.
- The link between an individual use event and wider organisational governance readiness.
Practical example / likely audience question
Audience question
Why is run-level evidence needed if an organisation already has model documentation, AI policy, and standard operating procedures?
Answer
The concern behind this question is understandable: if governance artefacts already exist, why add another layer? The direct answer is that model documentation and policy documents usually describe the system or the organisation in general, whereas run-level evidence shows what occurred in one actual use event. Those are not interchangeable forms of assurance.
For example, a hospital may have an approved policy for using a GenAI drafting assistant and may rely on a vendor's technical documentation. Yet if a discharge-summary draft contains a misleading statement, the key governance question is not only whether the model was approved in principle. The question is what prompt was used, what patient information was supplied, what output was produced, who reviewed it, what changes were made, and whether the run met organisational safeguards. That requires run-level evidence.
RAIDT handles this better than a generic AI governance approach because it treats the run as the unit of governance. Instead of stopping at principles or provider claims, it asks whether the organisation can reconstruct and assess the exact event under review. This makes governance more operational, more reviewable, and more useful for learning and accountability.
Practical example in RAIDT terms
Consider a healthcare setting in which a clinician uses a GenAI system to draft a patient follow-up letter after an outpatient consultation. The GenAI use case is legitimate and time-saving, but the run-level issue is whether the generated letter accurately reflects the consultation, protects sensitive information, and was appropriately reviewed before being sent.
The evidence needed includes the task definition, the prompt template, any source notes used as input, the model or tool version, the generated draft, the clinician's edits, the final approved version, and a record of whether the output was checked against the patient record. Responsibility is affected because the organisation must show who was accountable for checking the draft. Auditability is affected because a reviewer must be able to reconstruct the run. Interpretability is affected because reviewers need to understand how the draft emerged from the prompt and source material. Dependability is affected because output quality and process reliability matter in patient communication. Traceability is affected because the run must be linked to time, actor, and artefacts.
In governance-readiness terms, run-level evidence improves the organisation's position because it allows a disputed output to be examined as a concrete case rather than as an anecdote. It supports internal assurance, supervisory review, incident analysis, and practical refinement of workflow controls.
Detailed link to RAIDT
Run-level evidence links to RAIDT in four ways.
First, it gives operational form to the RAIDT core idea that governance should be based on evidence from actual organisational use, not only on high-level claims.
Second, it is inseparable from the concept of the run, because the run is the unit of governance and run-level evidence is the proof attached to that unit.
Third, it provides the raw material for the run-level evidence pack and the justification for the RAIDT score profile across the five pillars.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning by making individual GenAI events reconstructable and examinable.
Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness
Link to the five RAIDT pillars
Responsibility
Run-level evidence supports Responsibility by showing who initiated, reviewed, approved, or relied upon a GenAI run, and under what organisational purpose or authority.
Example evidence / implication:
- Named role or function attached to the run and its review step.
- Record of whether human sign-off or escalation was required.
Auditability
This item has a particularly strong effect on Auditability because it determines whether the run can be reconstructed by another person after the event.
Example evidence / implication:
- Preserved prompt, input context, output, timestamps, and review notes.
- Sufficient documentation for an internal auditor or supervisor to follow the sequence of events.
Interpretability
Run-level evidence supports Interpretability by documenting enough context to explain how an output emerged in practice, even when the internal model remains only partially interpretable.
Example evidence / implication:
- Prompt wording and relevant task instructions captured alongside the output.
- Reviewer notes explaining why the output was accepted, edited, or rejected.
Dependability
This item supports Dependability by allowing repeated scrutiny of whether a process produces stable, safe, and usable outcomes across comparable runs.
Example evidence / implication:
- Comparison of expected versus observed output quality in the run.
- Record of detected errors, corrections, or failure modes.
Traceability
Run-level evidence is also central to Traceability because it links the run to time, actor, tool configuration, source material, and downstream use.
Example evidence / implication:
- Timestamped record connecting the run to the relevant task and artefacts.
- Clear chain from source inputs to generated output and reviewed final outcome.
Run-level evidence affects all five pillars, but it is especially fundamental for Auditability and Traceability because those pillars are weakened immediately if the run cannot be reconstructed.
Why this item is more than a generic concept
In general AI governance, evidence may mean any document, assurance statement, benchmark result, or policy artefact used to support a governance claim. In RAIDT, run-level evidence has a narrower and more operational meaning: it is the structured proof associated with one actual run of GenAI in organisational work.
The RAIDT meaning is more operational because it is tied to concrete use events, to the creation of an evidence pack, to scoring across the five pillars, and to governance readiness. It is therefore not merely a documentation idea. It is an evidential mechanism for examining how governance performs in practice.
Common misunderstanding
Misunderstanding
Run-level evidence is just another name for technical logging.
Correction
Technical logging may form part of run-level evidence, but it is not the whole of it. RAIDT requires evidence that is meaningful for governance review, not only for system diagnostics. For example, a system log may show that an API call occurred at a certain time, but it may not show the organisational purpose of the run, the prompt rationale, the human reviewer, the decision to approve or reject the output, or the downstream consequence of using it. Run-level evidence therefore combines technical trace with procedural and contextual information needed for accountable oversight.
Boundary and limitation
Run-level evidence does not prove that a model is universally safe, fair, or reliable across all contexts. It also does not replace broader governance tools such as model evaluation, procurement checks, legal compliance review, staff training, or organisational policy. Its strength is granularity, but that granularity can become burdensome if capture requirements are unrealistic or poorly designed.
The concept works best when evidence capture is proportionate, structured, and aligned to the task risk. If organisations expect exhaustive capture for every trivial use, adoption may be resisted and data quality may degrade. RAIDT handles this limitation by treating run-level evidence as purposeful governance evidence rather than indiscriminate data accumulation. The goal is sufficient reconstruction and reviewability, not maximal logging for its own sake.
Implementation levels
Manual implementation
A researcher or small team can apply run-level evidence manually by keeping a structured record for each important GenAI run. This may include a template for task purpose, prompt, input materials, output, reviewer comments, decision outcome, and lessons learned.
Semi-automated implementation
Semi-automated implementation can use templates, forms, metadata capture, and workflow checkpoints to reduce burden. For example, a drafting tool might automatically attach timestamps, user identifiers, and prompt fields while still requiring a human reviewer to complete contextual notes and approval status.
Fully automated implementation
At scale, a platform, wrapper, orchestration layer, or governance pipeline can capture run metadata, version details, artefacts, review actions, and scoring inputs automatically. A dashboard or evidence service can then assemble these into evidence packs, support audit queries, and feed governance readiness reporting across teams or functions.
Practical use in the RAIDT project
Within the RAIDT project, this item is especially useful for Paper 08 Foundations because it clarifies the framework's central claim that governance should attach to the run rather than only to the model or policy layer. It also matters for Paper 09 Empirical Validation because empirical testing of RAIDT depends on whether meaningful run-level evidence can actually be captured, reviewed, and scored in practice.
For Paper 10 Policy Pathways, run-level evidence provides a bridge between conceptual governance language and implementable organisational controls. It is also relevant to sector playbooks because the exact composition of evidence will vary across domains even though the RAIDT logic remains stable. In the evidence pack and scoring rubric, this item provides the factual substrate. In influence methods and governance interventions, it helps explain why RAIDT offers a practical route from abstract assurance to operational accountability.
For supervision meetings, viva defence, and journal positioning, this concept is valuable because it succinctly answers a hard question: what exactly is the evidence base of RAIDT? The answer is that RAIDT is grounded in the reconstructable record of one actual GenAI use event.
Key audience questions to prepare for
Q1. Is run-level evidence only useful in high-risk sectors?
No. It is most visibly necessary in high-risk contexts, but the concept is useful across organisational settings because any context can produce disputes, errors, or review needs. What changes is the depth of evidence required, not the basic value of the concept.
Q2. Does run-level evidence create too much administrative burden?
It can if implemented poorly. RAIDT addresses this by encouraging proportionate capture aligned to task significance and governance need. The purpose is not exhaustive bureaucracy but adequate reconstruction and review.
Q3. How is run-level evidence different from an audit trail?
An audit trail is often narrower and more technical. Run-level evidence can include audit-trail elements, but it also includes contextual, procedural, and evaluative material needed for meaningful governance judgement.
Q4. Why not evaluate governance only at the system or model level?
Because many practical governance failures arise from how a system is used in context. System-level evaluation remains necessary, but it does not eliminate the need to examine concrete use events.
Q5. What makes this concept distinctive in RAIDT?
RAIDT makes run-level evidence a formal governance unit connected to evidence packs, five-pillar scoring, and governance readiness. That makes it more actionable than a general call for better documentation.
Suggested citation concepts to support this item
- AI audit trails in organisational decision-making
- Generative AI governance evidence and accountability
- Human oversight and documentation in AI-assisted workflows
- Model cards versus use-case level governance
- Sociotechnical logging and traceability in AI systems
- Operational accountability for generative AI deployments
- AI incident review and post hoc reconstruction
- Evidence-based AI governance in regulated sectors
- Documentation practices for human-AI collaboration
- Organisational assurance mechanisms for generative AI use
Short explanation for presentation
Run-level evidence is the core evidential idea inside RAIDT. It refers to the recorded proof needed to reconstruct and review one specific use of a generative AI system in organisational work. This matters because most existing governance artefacts describe models, suppliers, or policies in general terms, whereas real governance questions arise in concrete episodes of use. RAIDT therefore treats the run as the unit of governance and asks whether that run can be examined after the fact. If the answer is yes, the organisation can build an evidence pack, justify a five-pillar score profile, and improve governance readiness through review and learning. If the answer is no, governance remains largely declarative. In that sense, run-level evidence is what makes RAIDT operational rather than merely aspirational.
One-line takeaway
Run-level evidence is the reconstructable proof of one GenAI use event because RAIDT makes the individual run the practical unit of governance.