Q094 - How_does_the_empirical_programme_compare_governance_readines

Q094 — How does the empirical programme compare governance readiness across runs?

← RAIDT · Star S10 - Empirical Programme, Domains and Sector Playbooks · primary item: S10.01 · Empirical programme

Controlled variation makes run-level governance differences visible, scorable, and discussable.

Appears in sources

qa_deck_100#slide 96 · Empirical programme, calibration, procurement, and assurance

Answer

The empirical programme compares governance readiness across runs by standardising what is captured, what is scored, and what is held constant. For every run, RAIDT records a run-level evidence pack containing identifiers and timestamps, prompt or template versions, model and tool settings, retrieved context where relevant, outputs, and recorded checks. That evidence pack is the scored object. Reviewers then apply the same rubric across the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability), using the anchors 1=missing / 3=partial / 5=audit-ready. Because the same scoring logic is applied to each run, runs produced in different domains or with different influence configurations can be compared on a common governance basis.

The comparison is strengthened by repetition and aggregation. In the empirical paper, each case is repeated 10-12 times to surface run-to-run variance, since Dependability is not only about average performance but also about stability under repeated execution. Results are then aggregated by configuration and domain, while retaining the score profile rather than collapsing everything into a single headline number. The composite mean is used as a summary indicator, but RAIDT keeps the five-pillar profile visible because trade-offs matter: a run may be highly interpretable yet weakly auditable, or safer in tone yet less traceable if refusal logic is not logged. In this way, governance readiness is compared across runs through common evidence fields, stable rubric anchors, repeat-run testing, and profile-based interpretation.

Practical example

In the healthcare playbook, repeated clinical summarisation runs illustrate the comparison logic clearly. One condition may use a prompt-only setup, while another uses structured prompting plus retrieval and stronger logging. Both runs are scored against the same pillars. If the second condition stores prompt versions, output hashes, reviewer rationale, and retrieval pointers, it can score much higher on Auditability and Traceability even when both outputs look similarly readable.

The same repeated-run design also exposes Dependability. If identical healthcare cases generate noticeably different summaries under a weakly constrained configuration, the evidence pack reveals that instability. A more tightly instrumented configuration can then be compared not only for better provenance, but also for lower variance across repeats.

Sources in RAIDT papers

09-RAIDT_Empirical_M_V50.docx
21-RAIDT_Sector_Playbook_Healthcare_V2