S5.10 - Calibration

S5.10 — Calibration

flowchart LR
    A[Background problem:
inconsistent reviewer interpretation
anchor ambiguity
uneven evidence thresholds] --> B[RAIDT:
run-level evidence framework]
    B --> C[[Calibration:
aligning reviewers around anchors,
evidence expectations, and adjudication]]
    H[Practical fields and tools:
healthcare, finance, public services,
training sets, rubric templates, dashboards] --> C
    C --> D[Evidence pack review consistency]
    C --> E[RAIDT score profile defensibility]
    C --> F[Reviewability and contestability]
    D --> G[Governance readiness and organisational learning]
    E --> G
    F --> G

← Star S5 - RAIDT Pillars and Scoring

Star context: Defines the five governance dimensions and how scoring becomes consistent, reviewable, and decision-useful without collapsing important trade-offs into unsupported judgement.

Academic picture

Definition / background

Calibration is the structured process of aligning reviewers around how RAIDT scoring anchors should be interpreted, what kinds of evidence count, how borderline cases should be handled, and when escalation or adjudication is required. In simple terms, it reduces avoidable variation in scoring by making reviewers compare the same kinds of run-level evidence against the same practical standards.

Conceptually, calibration sits between a rubric and its real-world use. A scoring framework can appear clear on paper yet still produce inconsistent outcomes if different reviewers read anchor language differently or apply different assumptions about what counts as sufficient evidence. Calibration addresses that gap by using examples, worked cases, reviewer discussion, decision rules, and evidence pointers so that the scoring process becomes more stable and reviewable.

In GenAI governance, calibration matters because many governance claims are made through mixed evidence, partial documentation, and context-sensitive judgement. RAIDT does not assume that responsible use can be inferred from policy text alone. It asks reviewers to examine a specific run, its configuration, its task context, and the evidence assembled around it. Calibration therefore helps ensure that a RAIDT score reflects the run and the rubric rather than the personal habits of whichever reviewer happened to assess it.

Calibration is related to, but different from, concepts such as training, reviewer consistency, and inter-rater reliability. Training gives reviewers the basic method; consistency is the desired outcome; inter-rater reliability is one way of evaluating agreement. Calibration is the operational process that helps produce those outcomes by aligning interpretation before and during scoring. Within RAIDT, it belongs naturally alongside run-level evidence packs, scoring anchors, repeat runs, and evidence-based scoring because it is what turns a theoretical rubric into a defensible review practice.

Why this concept matters

Without calibration, a governance framework can look rigorous while still producing unstable judgements. One reviewer may treat a partial audit trail as acceptable, another may score it as inadequate, and a third may compensate for missing evidence with background trust in the team. When that happens, the score profile becomes less meaningful because it partly reflects reviewer variation rather than the actual quality of the run-level evidence.

Calibration solves this by making scoring assumptions explicit. It helps reviewers distinguish between low evidence and low performance, between absent documentation and genuine non-compliance, and between acceptable contextual variation and unjustified discretion. For organisations using GenAI, this matters because governance decisions are often escalatory: scores may influence deployment approval, remediation priorities, assurance claims, procurement decisions, or policy responses.

In RAIDT specifically, calibration helps move governance from principles to operational judgement. It reduces the risk that evidence-based scoring becomes assertion-based scoring in practice. It also supports contestability because a challenged score can be traced back to shared anchors, examples, and adjudication logic rather than defended as a purely individual opinion.

Key idea: Calibration matters because RAIDT scoring is only governance-ready when different reviewers can interpret the same run-level evidence in a sufficiently aligned and defensible way.

What this item enables

Alignment of reviewers around shared interpretations of RAIDT scoring anchors.
More consistent treatment of evidence completeness, quality, and relevance.
Clearer handling of ambiguous, borderline, or mixed-quality runs.
Better separation between judgement that is justified and judgement that is merely intuitive.
More credible comparison across runs, teams, time periods, and deployment contexts.
Stronger adjudication pathways for high-risk or disputed assessments.
Improved confidence that the RAIDT score profile reflects the run rather than reviewer drift.

Practical example / likely audience question

Audience question

Who scores RAIDT, and how do you stop the result from becoming just one reviewer's opinion?

Answer

The concern behind this question is that governance scoring can become subjective if it depends too heavily on expert judgement. The direct answer is that RAIDT scoring can be performed by trained reviewers, but those reviewers should not operate in isolation from calibration. Subject-matter experts may help design anchors, clarify domain-specific evidence expectations, and adjudicate difficult or high-risk cases, while trained reviewers apply the rubric to the evidence pack in a structured way.

For example, two reviewers may assess the same run and agree that the system output looks useful, yet disagree on whether the documentation is sufficient to support a strong Auditability or Traceability score. Calibration resolves this not by eliminating judgement, but by aligning what counts as acceptable evidence, what threshold distinguishes medium from high performance, and when missing material should trigger a lower score or escalation.

RAIDT handles this better than generic AI governance approaches because it ties the discussion to a concrete run, a defined evidence pack, and named scoring pillars. Instead of asking whether a system is responsible in the abstract, reviewers assess whether a specific use instance is sufficiently evidenced and governable under shared standards.

Practical example in RAIDT terms

Consider a healthcare organisation using a generative AI assistant to draft outpatient follow-up letters after clinician consultations. One run concerns a cardiology clinic letter generated for a patient with medication changes and a planned diagnostic review. The run-level issue is not simply whether the output reads well, but whether the organisation can evidence how the letter was produced, reviewed, corrected, and approved in that specific instance.

The evidence pack for this run might include the task definition, prompt or workflow configuration, model and version metadata, human review notes, edit history, sign-off records, and any policy constraints applied to clinical use. Calibration is needed because one reviewer might view clinician sign-off as sufficient evidence of dependability, while another might require clearer traceability of prompt configuration and post-generation edits before giving a strong score.

The most affected RAIDT pillars are Auditability, Dependability, and Traceability, with Responsibility and Interpretability also implicated. Calibration improves governance readiness here by ensuring that reviewers use the same thresholds for what counts as adequate clinical oversight, sufficient reconstruction detail, and acceptable evidence of safe operational use. That makes the resulting score profile more credible for internal assurance, quality improvement, and external scrutiny.

Detailed link to RAIDT

Calibration links to RAIDT in four ways.

First, it supports RAIDT's core idea that governance should be grounded in evidence rather than principle-only claims.
Second, it operates at the level of the run by aligning how reviewers interpret evidence from one configured use in one context.
Third, it stabilises the translation from evidence pack to score profile by making anchor interpretation more consistent.
Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning because scoring decisions can be explained through shared standards rather than reviewer intuition alone.

Calibration → Run-level evidence → Evidence pack → RAIDT score profile → Governance readiness

In this chain, calibration is the human-alignment mechanism that helps the rest of the RAIDT process function reliably. Without it, even a well-designed evidence pack and rubric may yield unstable decisions; with it, the score profile becomes more defensible as a governance artefact.

Link to the five RAIDT pillars

Responsibility

Calibration helps reviewers apply responsibility-related criteria consistently, especially where accountability, role clarity, or human oversight must be inferred from multiple pieces of evidence rather than a single formal document.