Q144 - Why_are_calibration_reviewer_consistency_and_repeat_runs_nec

Q144 — Why are calibration, reviewer consistency, and repeat runs necessary?

← RAIDT · Star S5 - RAIDT Pillars and Scoring · primary item: S5.10 · Calibration

Appears in sources

integrated_82#Q3.18

Answer

Calibration, reviewer consistency, and repeat runs are necessary because RAIDT is designed to measure governance readiness at run level, not merely to reward fluent outputs. Since RAIDT treats the run as the unit of governance, the scored object is the run-level evidence pack, and the judgement must be anchored to inspectable artefacts rather than to reviewer impression. The scoring appendix makes this explicit by requiring evidence pointers and common anchors, which can be summarised as anchors 1=missing / 3=partial / 5=audit-ready. Without calibration, different reviewers may interpret the same evidence threshold differently, so a score profile would reflect scorer drift rather than the actual condition of the run. Calibration therefore creates shared exemplars, decision rules, and a stable interpretation of the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability).

Reviewer consistency is also necessary because RAIDT compares runs, scenarios, and configurations, including influence methods as governance interventions. If reviewers do not apply the same rubric in the same way, apparent differences between prompt structures, retrieval settings, alignment layers, or oversight workflows may be artefacts of inconsistent scoring rather than real governance effects. Repeat runs are necessary for a second reason: dependability cannot be inferred from a single successful output. The foundations paper argues for variance-aware repeat-run testing because generative systems can change materially across repeated executions under near-identical conditions. Repeat runs therefore expose instability that single-run review can hide, making dependability judgements more defensible and allowing organisations to detect weak controls, unreliable provenance, or fragile oversight arrangements before contested use.

Practical example

Consider a hospital using a GenAI assistant to draft discharge summaries. One team reviews a run generated with a structured prompt, retrieval augmentation, and clinician sign-off. If reviewers are not calibrated, one may award high Auditability because the output looks clear, while another may mark it lower because the retrieval snapshot hash is missing. RAIDT requires both reviewers to use the same worked examples and evidence pointers so the score depends on the run-level evidence pack, not on writing style.

Repeat runs then test whether the same configuration remains dependable. If the same case is run several times and the uncertainty statement, cited guidance, or escalation flag changes materially, the organisation has evidence that the configuration is unstable. That matters directly for patient safety, because the system may appear acceptable in one instance yet fail to support consistent, reviewable use in routine practice.

Sources in RAIDT papers

08-RAIDT_Foundations_M_V50
00-RAIDT_Scoring_v1
13-RAIDT-Evidence-Review_M_v10