S8.11 - Reproducibility_pack

S8.11 ? Reproducibility pack

flowchart LR
    A[Background problem:
outputs kept without enough run context] --> B[RAIDT:
run-level evidence framework]
    B --> C[[Reproducibility pack:
artefacts to reconstruct a specific run]]
    H[Healthcare, enterprise reporting,
academic writing, public services] --> C
    I[Prompts, logs, hashes,
scoring sheets, data references,
scripts and repository pointers] --> C
    C --> D[Evidence pack quality]
    C --> E[Score profile credibility]
    C --> F[Reviewer reconstruction
and contestability]
    D --> G[Governance readiness]
    E --> G
    F --> G
    C --> J[Organisational learning
and corrective action]

? Star S8 - Implementation and Operations

Star context: Shows how RAIDT is embedded in day-to-day implementation and review practice, including how each run can be reconstructed, checked, challenged, and improved through operational evidence.

Academic picture

Definition / background

A reproducibility pack is the organised bundle of artefacts that allows a RAIDT run to be reconstructed, inspected, and, where feasible, repeated. In practical terms, it includes the evidence required to understand how a specific generative AI run produced a given output under a defined configuration, task, and context.

Conceptually, the idea draws on established expectations from computational research, audit trails, quality assurance, and scientific reproducibility. In those traditions, a result is stronger when another reviewer can inspect the process that generated it rather than relying only on a final claim. RAIDT adapts this logic to generative AI governance by treating the run, not the model in the abstract, as the meaningful unit of review.

This matters because GenAI systems are highly context-sensitive. A change in prompt wording, retrieval context, model version, temperature setting, document set, user role, or post-processing step can materially alter the output. A reproducibility pack therefore differs from a generic project archive or a simple log dump. It is more focused than a general project repository, and more governance-oriented than purely technical debugging records.

Within RAIDT, the reproducibility pack belongs inside the wider run-level evidence pack. It supports the interpretation of the five-pillar score profile by showing what was actually done in a particular run, what evidence exists for scoring decisions, and whether a reviewer could retrace the process behind a claim. It therefore connects directly to run-level evidence, evidence-pack completeness, and the credibility of governance assessments.

Why this concept matters

The reproducibility pack solves a common governance problem: organisations often retain outputs but not the full context needed to understand how those outputs were generated. Without that context, review becomes weak, challenges become harder to resolve, and improvement efforts depend on memory rather than evidence.

It also prevents a recurrent confusion in AI governance. Many governance discussions speak about transparency, documentation, or accountability at a high level, but do not specify what artefacts must be available for a concrete run. The reproducibility pack turns those broad aspirations into a practical set of inspectable materials.

If this item is missing, the risks are immediate. Reviewers may be unable to reconstruct a result, identify whether a failure arose from prompt design or model behaviour, test whether a different configuration would have changed the outcome, or defend the reliability of tables and claims in a paper, report, or operational decision process. In organisational settings, that weakens contestability, slows remediation, and undermines confidence in governance claims.

For RAIDT, the reproducibility pack is one of the mechanisms that moves governance from principle to operation. It provides the documentary basis for reviewability, supports evidence-led scoring, and helps ensure that governance is tied to what actually happened in a run rather than what stakeholders assume happened.

Key idea: A reproducibility pack matters because RAIDT can only support credible governance when a specific run can be reconstructed from evidence rather than defended through assertion alone.

What this item enables

Reconstruction of a specific GenAI run, including its task, context, configuration, and outputs.
Inspection of prompts, inputs, model settings, versions, retrieval sources, and post-processing steps.
Verification that tables, summaries, recommendations, or classifications can be traced back to identifiable run evidence.
Comparison between repeated runs to see whether observed differences reflect model instability, context drift, or operator choices.
Review of whether scoring decisions in the RAIDT profile are supported by concrete artefacts.
Stronger handover between researchers, supervisors, reviewers, auditors, and operational teams.
Organisational learning through retained evidence that supports corrective action and process improvement.

Practical example / likely audience question

Audience question

Why is reproducibility part of governance?

Answer

The concern behind this question is usually that reproducibility sounds like a purely technical or academic matter, rather than a governance requirement. The direct answer is that governance depends on the ability to inspect and challenge how a result was produced. If a reviewer cannot reconstruct the steps behind a run, then oversight is limited to trusting a conclusion rather than examining the process that generated it.

In practice, consider a paper that reports a set of GenAI-assisted coding, summarisation, or classification results. If the author can only provide the final outputs, reviewers cannot tell whether the result depended on a fragile prompt, an undocumented model update, a selective choice of examples, or an unrecorded human correction step. A reproducibility pack addresses that problem by preserving the operational record behind the claim.

RAIDT handles this better than a generic AI governance approach because it ties reproducibility to a specific run and places it inside a broader evidence framework. Rather than saying only that systems should be documented, RAIDT asks whether this run can be reviewed, reconstructed, scored, and defended with evidence.

Practical example in RAIDT terms

In healthcare administration, a hospital team uses a generative AI system to draft discharge-summary explanations from structured patient notes for internal clinician review. One run produces a concise summary that appears useful, but a clinician later questions whether a medication instruction was omitted because of the model, the prompt wording, or a retrieval issue.

In RAIDT terms, the run-level issue is not simply whether the model is generally capable. The issue is whether this specific run can be reconstructed and assessed. The required evidence includes the prompt template, patient-note input boundaries, model and version, inference settings, timestamp, retrieval context, generated output, human edits, reviewer comments, and any scoring sheet used to assess risk and adequacy.

The pillars most affected are Auditability, Dependability, and Traceability, with Responsibility and Interpretability also implicated. The reproducibility pack improves governance readiness by allowing the hospital to review exactly what happened, identify where the omission arose, justify any corrective action, and demonstrate that operational governance is based on inspectable evidence rather than informal recollection.

Detailed link to RAIDT

Reproducibility pack links to RAIDT in four ways.

First, it reinforces RAIDT's core idea that governance should focus on evidence from a specific run rather than broad claims about a model or vendor.
Second, it operationalises the run as the unit of review by preserving the artefacts needed to revisit how that run was configured, executed, and interpreted.
Third, it strengthens both the evidence pack and the score profile because reviewers can examine the basis for ratings rather than accepting them as unsupported judgements.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning by making runs inspectable after the fact.

Reproducibility pack ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

This chain matters because reproducibility is the bridge between generating evidence and being able to use that evidence credibly in supervision, operational review, policy alignment, and audit contexts.

Link to the five RAIDT pillars

Responsibility

The reproducibility pack supports Responsibility by clarifying who configured the run, who reviewed it, and what procedural safeguards were or were not applied. It makes it easier to assign ownership for decisions and corrections.