S8.06 - Post-run_review
S8.06 ? Post-run review
flowchart LR
A1[Ad hoc retrospective checks]
A2[Weak reconstruction of completed runs]
A3[Policy claims without case-level evidence]
B[RAIDT - run-level evidence framework]
C[[Post-run review]]
D1[Evidence pack strengthened]
D2[Score profile checked or revised]
D3[Reviewer reconstruction]
D4[Corrective action and learning]
E[Governance move: reviewability, contestability, audit readiness]
F1[Reviewer forms]
F2[Monitoring dashboard flags]
F3[Escalation workflow]
F4[Public services and enterprise use cases]
A1 --> B
A2 --> B
A3 --> B
B --> C
C --> D1
C --> D2
C --> D3
C --> D4
D1 --> E
D2 --> E
D3 --> E
D4 --> E
F1 --> C
F2 --> C
F3 --> C
F4 --> C? Star S8 - Implementation and Operations
Star context: Shows how RAIDT can be adopted manually, semi-automatically or through orchestration, and how it becomes part of real governance routines through structured review after real runs have taken place.
Academic picture
Definition / background
Post-run review is the structured examination of a completed generative AI run after the system has produced an output and the run can be reconstructed from available evidence. In RAIDT, it is typically applied to sampled, flagged, high-risk, or contested runs rather than assumed to occur identically for every single run. The purpose is to determine whether the run was appropriately configured, sufficiently evidenced, responsibly used, and suitably documented for governance, learning, and possible challenge.
Conceptually, post-run review sits between simple monitoring and full incident investigation. Monitoring watches patterns over time; incident investigation responds to a known failure or harm; post-run review examines a particular completed run in enough depth to judge evidence quality, contextual appropriateness, user impact, and next actions. It therefore functions as an operational governance checkpoint rather than as a purely technical evaluation step.
This matters in generative AI governance because many important issues only become visible after a run has occurred: weak prompts, missing provenance, insufficient human oversight, overconfident interpretation of an output, or a mismatch between policy expectations and operational reality. Without post-run review, organisations may claim assurance in the abstract while lacking a defensible account of what actually happened in a specific case.
Inside RAIDT, post-run review belongs naturally because RAIDT treats the run as the unit of governance. The review draws on run-level evidence, tests the completeness of the evidence pack, and may confirm or challenge the run's five-pillar score profile across Responsibility, Auditability, Interpretability, Dependability, and Traceability. In that sense, post-run review is one of the main ways RAIDT turns evidence into reviewability and reviewability into governance readiness.
Why this concept matters
Post-run review solves a practical governance problem: organisations need a credible way to look back at a completed GenAI use episode and decide whether it was acceptable, well-evidenced, and organisationally defensible. It prevents governance from stopping at high-level principles or pre-deployment intentions.
It also avoids a common confusion. A system may be deployed with policies, model cards, and general guidance, yet still produce runs that cannot later be reconstructed or justified. Post-run review makes the question more concrete: what happened in this run, what evidence supports the account, and what should happen next?
If post-run review is missing, several risks follow. Errors may be noticed but not formally learned from. High-risk uses may proceed without retrospective scrutiny. Weak evidence practices remain hidden. Score profiles may become performative rather than evidential. Most seriously, the organisation may be unable to respond convincingly when a supervisor, auditor, regulator, or affected user asks how a specific output was produced and reviewed.
For organisations using GenAI in real work, post-run review is therefore a bridge from aspiration to operation. It helps move governance from "we intend to be responsible" toward "we can show how this run was reviewed, what the evidence showed, and what we changed as a result".
Key idea: Post-run review matters because it turns a completed GenAI run into a reviewable governance object rather than an unexamined historical event.
What this item enables
- Structured retrospective examination of completed runs, especially sampled, flagged, high-risk, or contested cases.
- Checking whether the run-level evidence pack is complete enough for reconstruction and justified review.
- Testing whether provisional pillar scores remain credible when examined against real evidence.
- Identifying user-impact issues, policy deviations, prompt weaknesses, documentation gaps, or unsafe workarounds.
- Triggering corrective action, escalation, retraining, workflow redesign, or tighter gating where needed.
- Producing organisational learning that can improve future runs rather than treating each problematic run as isolated.
Practical example / likely audience question
Audience question
Is every GenAI run supposed to be reviewed by an expert after it happens?
Answer
The concern behind this question is usually one of feasibility. If governance appears to require expert review of every output, the process looks too expensive, too slow, and too intrusive for real organisational use. That concern is reasonable, but it misstates what post-run review means in RAIDT.
The direct answer is no. RAIDT does not assume that every run receives the same depth of retrospective expert scrutiny. Instead, post-run review is typically risk-based and evidence-based. Some runs are sampled routinely; some are reviewed because monitoring or user feedback flags them; some are reviewed because they involve sensitive domains, unusual prompts, important decisions, or downstream consequences.
A practical example is a public-sector team using GenAI to draft responses to citizen enquiries. Most routine, low-risk drafts may only require lightweight logging and periodic sample review. By contrast, a run that generated a potentially misleading eligibility statement, relied on outdated policy text, or was challenged by a caseworker would merit deeper post-run review. RAIDT handles this better than a generic AI governance approach because it does not rely on vague assurances that "oversight exists"; it specifies what run-level evidence should be available, how the run can be reconstructed, how the pillar profile can be checked, and what governance action follows from the review.
Practical example in RAIDT terms
Consider a local authority using a GenAI assistant to draft housing support letters for caseworkers. One completed run is flagged because the draft letter appears to imply that an applicant is ineligible for support when the underlying policy is more nuanced.
The run-level issue is not simply that the output may be wrong. The deeper governance question is whether the run can be reconstructed and judged properly. Reviewers need the prompt, the model or service version, any retrieved policy material, time and context of use, user edits, the final issued text if applicable, and any reviewer notes or policy checks associated with that run.
In RAIDT terms, the evidence pack for that run should allow the reviewer to ask: Was the user relying on approved guidance? Was the output interpretable enough for a caseworker to challenge it? Was the system dependable in this task context? Can the reasoning chain from prompt to final decision be traced? The most affected pillars are likely to be Responsibility, Dependability, and Traceability, with Auditability also central because the organisation must show how the review took place.
Post-run review improves governance readiness here by converting a potentially problematic draft into a documented learning event. The organisation can update guidance, adjust prompts, refine escalation rules, and record why this kind of run now receives closer attention in future.
Detailed link to RAIDT
Post-run review links to RAIDT in four ways.
First, it supports RAIDT's core idea that governance should attach to actual runs rather than only to abstract system descriptions or policy statements.
Second, it depends on the run being reconstructable from run-level evidence, including context, configuration, inputs, outputs, and review metadata.
Third, it tests and strengthens the quality of the evidence pack and may confirm, challenge, or refine the run's five-pillar score profile.
Fourth, it advances reviewability, contestability, audit readiness, and organisational learning by making completed runs open to disciplined retrospective scrutiny.
Post-run review ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness
This chain matters because post-run review is one of the clearest moments where RAIDT becomes operational. A run is no longer only generated; it becomes examinable, discussable, and governable.
Link to the five RAIDT pillars
Responsibility
Post-run review strengthens Responsibility by checking whether appropriate human judgement, role clarity, and escalation expectations were actually present in the run, not merely described in policy.
Example evidence / implication:
- Reviewer notes showing whether the human user accepted, challenged, or modified the model output.
- Records of whether the run followed approved workflows, role permissions, or domain-specific supervision requirements.
Auditability
This item has a particularly strong effect on Auditability because the review itself requires a clear evidential trail and a method for reconstructing what happened.
Example evidence / implication:
- Review forms, timestamps, and decision records showing who reviewed the run and on what basis.
- Documentation of why the run was sampled or flagged and what governance action followed.
Interpretability
Post-run review informs Interpretability by revealing whether the output was understandable enough for a human reviewer to assess meaning, limits, and appropriateness in context.
Example evidence / implication:
- Notes showing whether the output's claims, structure, or assumptions could be interpreted by the reviewer.
- Evidence that ambiguous, overconfident, or opaque outputs required additional checking or could not be safely relied upon.
Dependability
Post-run review helps assess whether the run performed consistently and acceptably for the intended task, especially when the output may affect real organisational work.
Example evidence / implication:
- Comparison of the output against approved sources, expected quality thresholds, or downstream task requirements.
- Identification of recurring failure modes that suggest the process is not dependable enough without tighter controls.
Traceability
This item also strongly affects Traceability because the review depends on linking a concrete output back to the prompt, context, model configuration, evidence sources, and reviewer actions.
Example evidence / implication:
- Logged prompt, model version, retrieval context, and output history for the reviewed run.
- Cross-links from the reviewed run to policy documents, tickets, reviewer comments, and corrective actions.
Post-run review touches all five pillars, but it is especially consequential for Auditability and Traceability because weak evidence makes serious retrospective review impossible.
Why this item is more than a generic concept
In general AI governance, post-hoc review may simply mean looking back at an output after something has gone wrong. It is often informal, selective, and weakly evidenced. The review may depend on recollection, screenshots, or broad system-level reporting rather than on a reconstructable record of a specific use episode.
In RAIDT, post-run review is more operational. It refers to the retrospective examination of a specific run as a bounded governance object, using run-level evidence, an evidence pack, and a pillar-based assessment structure. The review is therefore not just reflective; it is evidential, comparable across runs, and usable for assurance, challenge, and improvement.
That is the key difference. RAIDT does not treat review as a vague management practice. It ties review to the evidence required to justify claims about governance quality.
Common misunderstanding
Misunderstanding
Post-run review is just another name for monitoring, or else it is only used when something has already gone badly wrong.
Correction
Monitoring and post-run review are related but not identical. Monitoring usually observes patterns, rates, and signals across many runs over time. Post-run review examines one completed run in depth to determine what happened, whether the evidence is sufficient, whether the output was acceptable, and whether further action is required.
For example, a dashboard might show that a summarisation tool has rising user override rates. That is monitoring. A reviewer then selects a flagged run, inspects its prompt, output, source material, and reviewer notes, and determines that the model systematically omits caveats in high-stakes cases. That is post-run review. In RAIDT, the distinction matters because governance needs both broad signals and case-level reconstruction.
Boundary and limitation
Post-run review does not prove that a run was harmless, fair, or correct in every substantive sense. It is only as good as the evidence retained, the judgement applied, and the review criteria used. If logging is weak, if context is missing, or if reviewers lack appropriate domain knowledge, the review may be incomplete or misleading.
It also does not replace pre-deployment evaluation, real-time controls, monitoring, or corrective action. A retrospective review can identify problems and support learning, but it cannot prevent every problematic run from occurring in the first place. Nor can a sampling strategy guarantee that every rare but serious issue will be seen.
RAIDT handles these limitations by placing post-run review within a wider governance chain: runs are evidenced, some are gated, many are monitored, selected cases are reviewed, and findings feed into corrective action and implementation changes. In other words, post-run review is necessary but not sufficient.
Implementation levels
Manual implementation
A researcher or small team can apply post-run review manually by selecting a sample of completed runs, gathering the prompt and output records, checking them against a reviewer form or rubric, and recording whether evidence completeness, pillar implications, or corrective actions need attention. This is feasible even with simple document folders and spreadsheets, provided the run can still be reconstructed.
Semi-automated implementation
Semi-automated implementation adds structured metadata, templates, and review triggers. For example, runs may be automatically tagged by task type, risk level, or exception condition; reviewer forms may pre-populate key fields; and dashboards may surface runs that deserve attention. Human judgement remains central, but the evidential assembly and triage become more efficient.
Fully automated implementation
At scale, a platform, wrapper, orchestration layer, or governance pipeline can implement post-run review by logging run metadata automatically, routing flagged runs into queues, generating draft evidence packs, attaching provisional score profiles, and recording reviewer decisions in a searchable audit trail. Full automation should support review work, not remove meaningful human scrutiny where the consequences justify it.
Practical use in the RAIDT project
Within the RAIDT project, post-run review is useful as the operational bridge between framework design and empirical governance practice. In Paper 08 Foundations, it helps explain how a run becomes reviewable after generation and why governance should not stop at principles or pre-use controls. In Paper 09 Empirical Validation, it provides a concrete mechanism for checking whether the evidence pack and pillar scores remain credible when reviewers inspect real cases. In Paper 10 Policy Pathways, it supports claims about audit readiness, accountability, and institutional uptake because it shows how organisations can operationalise retrospective scrutiny without pretending that every run needs the same level of oversight.
It is also relevant to sector playbooks, reviewer forms, the evidence pack design, and the scoring rubric. For viva defence and supervisor explanation, this item is particularly useful because it answers the practical question, "What happens after the model has already produced an output?" RAIDT's answer is that the run remains governable: it can be reconstructed, reviewed, challenged, and used for organisational learning.
Key audience questions to prepare for
Q1. If RAIDT is run-level, does post-run review apply to every run?
No. RAIDT is run-level in what it governs, not necessarily in requiring identical review effort for every run. Review depth can be risk-based, sampled, or triggered, provided the organisation can justify the criteria and retain the necessary evidence.
Q2. How is post-run review different from a normal quality assurance check?
A normal quality check may only judge output quality. Post-run review in RAIDT examines the broader governance status of the run: evidence completeness, reviewability, role responsibility, pillar implications, and whether corrective action or escalation is required.
Q3. Why not rely on monitoring dashboards alone?
Dashboards reveal patterns, but they rarely explain a specific case in enough depth for contestability or audit. Post-run review provides the case-level reconstruction needed when a particular run must be justified, challenged, or learned from.
Q4. What makes post-run review practical rather than burdensome?
It becomes practical when tied to sampling rules, risk triggers, structured evidence capture, and review templates. RAIDT makes the process more efficient by defining what a run is and what evidence should already exist when review is needed.
Q5. What if a run cannot be reconstructed well enough to review?
That is itself a governance finding. It suggests a weakness in evidence capture, traceability, or workflow design. In RAIDT, failure to reconstruct a run is not a minor inconvenience; it is evidence of reduced governance readiness.
Suggested citation concepts to support this item
- post-deployment AI assurance
- retrospective review of AI-supported decisions
- human oversight in generative AI operations
- audit trails for large language model use
- case-based review in algorithmic accountability
- AI incident review and organisational learning
- risk-based sampling for AI governance
- operational governance of generative AI in public services
- traceability and reviewability in sociotechnical AI systems
- assurance evidence for responsible AI deployment
Short explanation for presentation
Post-run review is the part of RAIDT that asks what happens after a generative AI run has already occurred. Rather than assuming that governance ends once an output is produced, RAIDT treats the completed run as something that can still be examined, challenged, and learned from. A reviewer can inspect whether the evidence pack is complete, whether the output was appropriate in context, whether the pillar scores remain justified, and whether corrective action is needed. This is important because many governance failures only become visible after use, especially in high-risk or contested cases. In practice, RAIDT does not require every run to receive identical expert scrutiny. Instead, it supports sampled, flagged, and risk-based review so that organisations can move from broad policy claims toward reviewable, auditable evidence about what actually happened in specific runs.
One-line takeaway
Post-run review is the structured retrospective examination of a completed run because RAIDT makes governance depend on reviewable run-level evidence rather than on assertion alone.