S5.10 - Calibration
S5.10 — Calibration
flowchart LR
A[Background problem:
inconsistent reviewer interpretation
anchor ambiguity
uneven evidence thresholds] --> B[RAIDT:
run-level evidence framework]
B --> C[[Calibration:
aligning reviewers around anchors,
evidence expectations, and adjudication]]
H[Practical fields and tools:
healthcare, finance, public services,
training sets, rubric templates, dashboards] --> C
C --> D[Evidence pack review consistency]
C --> E[RAIDT score profile defensibility]
C --> F[Reviewability and contestability]
D --> G[Governance readiness and organisational learning]
E --> G
F --> G← Star S5 - RAIDT Pillars and Scoring
Star context: Defines the five governance dimensions and how scoring becomes consistent, reviewable, and decision-useful without collapsing important trade-offs into unsupported judgement.
Academic picture
Definition / background
Calibration is the structured process of aligning reviewers around how RAIDT scoring anchors should be interpreted, what kinds of evidence count, how borderline cases should be handled, and when escalation or adjudication is required. In simple terms, it reduces avoidable variation in scoring by making reviewers compare the same kinds of run-level evidence against the same practical standards.
Conceptually, calibration sits between a rubric and its real-world use. A scoring framework can appear clear on paper yet still produce inconsistent outcomes if different reviewers read anchor language differently or apply different assumptions about what counts as sufficient evidence. Calibration addresses that gap by using examples, worked cases, reviewer discussion, decision rules, and evidence pointers so that the scoring process becomes more stable and reviewable.
In GenAI governance, calibration matters because many governance claims are made through mixed evidence, partial documentation, and context-sensitive judgement. RAIDT does not assume that responsible use can be inferred from policy text alone. It asks reviewers to examine a specific run, its configuration, its task context, and the evidence assembled around it. Calibration therefore helps ensure that a RAIDT score reflects the run and the rubric rather than the personal habits of whichever reviewer happened to assess it.
Calibration is related to, but different from, concepts such as training, reviewer consistency, and inter-rater reliability. Training gives reviewers the basic method; consistency is the desired outcome; inter-rater reliability is one way of evaluating agreement. Calibration is the operational process that helps produce those outcomes by aligning interpretation before and during scoring. Within RAIDT, it belongs naturally alongside run-level evidence packs, scoring anchors, repeat runs, and evidence-based scoring because it is what turns a theoretical rubric into a defensible review practice.
Why this concept matters
Without calibration, a governance framework can look rigorous while still producing unstable judgements. One reviewer may treat a partial audit trail as acceptable, another may score it as inadequate, and a third may compensate for missing evidence with background trust in the team. When that happens, the score profile becomes less meaningful because it partly reflects reviewer variation rather than the actual quality of the run-level evidence.
Calibration solves this by making scoring assumptions explicit. It helps reviewers distinguish between low evidence and low performance, between absent documentation and genuine non-compliance, and between acceptable contextual variation and unjustified discretion. For organisations using GenAI, this matters because governance decisions are often escalatory: scores may influence deployment approval, remediation priorities, assurance claims, procurement decisions, or policy responses.
In RAIDT specifically, calibration helps move governance from principles to operational judgement. It reduces the risk that evidence-based scoring becomes assertion-based scoring in practice. It also supports contestability because a challenged score can be traced back to shared anchors, examples, and adjudication logic rather than defended as a purely individual opinion.
Key idea: Calibration matters because RAIDT scoring is only governance-ready when different reviewers can interpret the same run-level evidence in a sufficiently aligned and defensible way.
What this item enables
- Alignment of reviewers around shared interpretations of RAIDT scoring anchors.
- More consistent treatment of evidence completeness, quality, and relevance.
- Clearer handling of ambiguous, borderline, or mixed-quality runs.
- Better separation between judgement that is justified and judgement that is merely intuitive.
- More credible comparison across runs, teams, time periods, and deployment contexts.
- Stronger adjudication pathways for high-risk or disputed assessments.
- Improved confidence that the RAIDT score profile reflects the run rather than reviewer drift.
Practical example / likely audience question
Audience question
Who scores RAIDT, and how do you stop the result from becoming just one reviewer's opinion?
Answer
The concern behind this question is that governance scoring can become subjective if it depends too heavily on expert judgement. The direct answer is that RAIDT scoring can be performed by trained reviewers, but those reviewers should not operate in isolation from calibration. Subject-matter experts may help design anchors, clarify domain-specific evidence expectations, and adjudicate difficult or high-risk cases, while trained reviewers apply the rubric to the evidence pack in a structured way.
For example, two reviewers may assess the same run and agree that the system output looks useful, yet disagree on whether the documentation is sufficient to support a strong Auditability or Traceability score. Calibration resolves this not by eliminating judgement, but by aligning what counts as acceptable evidence, what threshold distinguishes medium from high performance, and when missing material should trigger a lower score or escalation.
RAIDT handles this better than generic AI governance approaches because it ties the discussion to a concrete run, a defined evidence pack, and named scoring pillars. Instead of asking whether a system is responsible in the abstract, reviewers assess whether a specific use instance is sufficiently evidenced and governable under shared standards.
Practical example in RAIDT terms
Consider a healthcare organisation using a generative AI assistant to draft outpatient follow-up letters after clinician consultations. One run concerns a cardiology clinic letter generated for a patient with medication changes and a planned diagnostic review. The run-level issue is not simply whether the output reads well, but whether the organisation can evidence how the letter was produced, reviewed, corrected, and approved in that specific instance.
The evidence pack for this run might include the task definition, prompt or workflow configuration, model and version metadata, human review notes, edit history, sign-off records, and any policy constraints applied to clinical use. Calibration is needed because one reviewer might view clinician sign-off as sufficient evidence of dependability, while another might require clearer traceability of prompt configuration and post-generation edits before giving a strong score.
The most affected RAIDT pillars are Auditability, Dependability, and Traceability, with Responsibility and Interpretability also implicated. Calibration improves governance readiness here by ensuring that reviewers use the same thresholds for what counts as adequate clinical oversight, sufficient reconstruction detail, and acceptable evidence of safe operational use. That makes the resulting score profile more credible for internal assurance, quality improvement, and external scrutiny.
Detailed link to RAIDT
Calibration links to RAIDT in four ways.
First, it supports RAIDT's core idea that governance should be grounded in evidence rather than principle-only claims.
Second, it operates at the level of the run by aligning how reviewers interpret evidence from one configured use in one context.
Third, it stabilises the translation from evidence pack to score profile by making anchor interpretation more consistent.
Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning because scoring decisions can be explained through shared standards rather than reviewer intuition alone.
Calibration → Run-level evidence → Evidence pack → RAIDT score profile → Governance readiness
In this chain, calibration is the human-alignment mechanism that helps the rest of the RAIDT process function reliably. Without it, even a well-designed evidence pack and rubric may yield unstable decisions; with it, the score profile becomes more defensible as a governance artefact.
Link to the five RAIDT pillars
Responsibility
Calibration helps reviewers apply responsibility-related criteria consistently, especially where accountability, role clarity, or human oversight must be inferred from multiple pieces of evidence rather than a single formal document.
Example evidence / implication:
- Reviewers use shared expectations for what counts as adequate human oversight in a run.
- Escalation rules are aligned for cases where roles, approvals, or decision rights appear unclear.
Auditability
Calibration strongly affects Auditability because reviewers must agree on what level of documentation and reconstructability is sufficient before a run can be considered audit-ready.
Example evidence / implication:
- Reviewers use the same threshold for acceptable logging, review notes, and decision documentation.
- Borderline cases are adjudicated using worked examples rather than ad hoc interpretation.
Interpretability
Calibration supports Interpretability by aligning how reviewers judge whether the reasoning, prompt structure, model behaviour, or explanatory material is understandable enough for the relevant audience.
Example evidence / implication:
- Reviewers share examples of what counts as minimally sufficient explanation for non-technical stakeholders.
- Anchor notes distinguish between output plausibility and genuine interpretive transparency.
Dependability
Calibration is important for Dependability because claims about reliability, robustness, or safe operational use are often overestimated when reviewers rely on confidence or anecdote instead of shared evidence standards.
Example evidence / implication:
- Reviewers align on what evidence of repeatability, review quality, or fallback handling is needed.
- Missing validation material leads to consistent score consequences across comparable runs.
Traceability
Calibration strongly affects Traceability because this pillar depends on consistent judgement about whether a run can be reconstructed from the available metadata, artefacts, and decision trail.
Example evidence / implication:
- Reviewers use common criteria for prompt, version, timestamp, and workflow trace sufficiency.
- The score reflects evidence completeness rather than a reviewer's personal tolerance for missing provenance detail.
Calibration affects all five pillars, but it is especially consequential for Auditability, Dependability, and Traceability because those pillars are most vulnerable to inconsistent interpretation of evidence sufficiency.
Why this item is more than a generic concept
In general AI governance, calibration may simply mean aligning assessors, tuning measurement instruments, or improving consistency in evaluation. In RAIDT, calibration has a more operational meaning: it is the disciplined alignment of reviewers around run-level evidence, scoring anchors, and decision thresholds for a specific governance framework.
That matters because RAIDT does not treat scoring as a generic maturity exercise. It treats scoring as a review of a concrete run and its evidence pack. Calibration is therefore not an optional quality improvement extra; it is part of what makes RAIDT usable as a practical governance method rather than a high-level principles checklist.
Common misunderstanding
Misunderstanding
Calibration means forcing reviewers to agree on everything or removing human judgement from RAIDT scoring.
Correction
Calibration does not require artificial unanimity, and it does not replace expert judgement. It structures judgement so that disagreement becomes interpretable rather than arbitrary. For example, calibrated reviewers may still disagree about a difficult healthcare run, but they can explain that disagreement in terms of shared anchors, missing evidence, or domain-specific risk thresholds. That is much more valuable than unexplained variation because it supports adjudication, learning, and framework refinement.
Boundary and limitation
Calibration does not prove that a GenAI system is safe, fair, lawful, or effective. It also does not guarantee perfect agreement between reviewers, especially in novel, high-risk, or weakly documented cases. If the rubric is poorly designed, the evidence pack is sparse, or the organisational context changes quickly, calibration can only improve consistency within those limits.
Calibration also requires maintenance. Reviewer alignment can drift over time, particularly when new use cases, models, domains, or policies are introduced. RAIDT handles this limitation by pairing calibration with scoring anchors, repeat runs, evidence-based scoring, and escalation or adjudication pathways. In other words, calibration is a necessary governance support, but not a substitute for good evidence, good rubric design, or periodic review.
Implementation levels
Manual implementation
A researcher or small team can apply calibration manually by running joint review sessions, discussing sample runs, comparing provisional scores, documenting why anchors were interpreted in particular ways, and recording decision rules for future scoring.
Semi-automated implementation
Semi-automated calibration can be supported through structured templates, annotated scoring rubrics, evidence checklists, adjudication logs, and dashboards that show where reviewers diverge. This makes alignment work easier to maintain and reuse across repeated assessments.
Fully automated implementation
At scale, a governance platform or orchestration layer can support calibration by versioning rubrics, surfacing comparable historical cases, flagging reviewer divergence patterns, enforcing required metadata fields, routing disputed runs for adjudication, and generating audit-ready records of how scoring standards were applied.
Practical use in the RAIDT project
Within the RAIDT project, calibration is useful in several connected ways. In Paper 08 Foundations, it helps explain why a scoring framework needs a disciplined review process rather than relying on abstract principles or isolated expert opinion. In Paper 09 Empirical Validation, it provides a basis for discussing reviewer consistency, score stability, and the practical credibility of the RAIDT method. In Paper 10 Policy Pathways, calibration helps show how organisations can operationalise governance so that assurance processes remain comparable across teams and over time.
It is also useful for sector playbooks because each domain can specify examples of adequate evidence without abandoning the shared RAIDT logic. In the evidence pack and scoring rubric, calibration clarifies how anchors should be interpreted in practice. In influence methods and governance interventions, it provides a concrete answer to sceptical audiences who ask how evidence-based scoring avoids collapsing into expert subjectivity. For supervision, viva defence, and journal positioning, calibration helps articulate that RAIDT is not just conceptually neat; it is methodologically disciplined.
Key audience questions to prepare for
Q1. Why is calibration necessary if the rubric is already written down?
A written rubric reduces ambiguity, but it does not remove it. Calibration is needed because reviewers still interpret examples, thresholds, and missing evidence differently unless those interpretations are aligned through practice and discussion.
Q2. Is calibration the same as inter-rater reliability?
No. Inter-rater reliability is an outcome metric or evaluation lens for agreement. Calibration is the process used to improve alignment in how reviewers interpret and apply the rubric.
Q3. Does calibration make RAIDT too resource-intensive for organisations?
Not necessarily. Manual calibration can begin with a small number of sample runs and clear anchor notes. The resource cost is often justified because it prevents unstable scores, poor assurance decisions, and repeated disputes later.
Q4. What happens when calibrated reviewers still disagree?
That disagreement becomes more valuable because it can be traced to a specific ambiguity, evidence gap, or risk threshold. RAIDT can then use adjudication, escalation, or rubric refinement rather than treating disagreement as noise.
Q5. Why is calibration especially important for GenAI governance?
GenAI use is often context-sensitive, rapidly changing, and supported by incomplete documentation. Calibration helps ensure that reviewers assess those runs against shared evidence expectations instead of relying on inconsistent personal standards.
Suggested citation concepts to support this item
- reviewer calibration in audit and assurance practice
- inter-rater reliability in governance assessment
- rubric calibration in qualitative evaluation
- evidence-based scoring methods in AI governance
- human factors in model evaluation and oversight
- consistency and adjudication in high-stakes review
- sociotechnical governance of generative AI systems
- measurement validity and scoring alignment
- organisational assurance for AI deployment
- operationalising responsible AI through review processes
Short explanation for presentation
Calibration is the process that aligns reviewers on how to interpret RAIDT scoring anchors and evidence requirements. It matters because a governance framework can appear rigorous on paper but still produce unstable scores if different reviewers apply different standards to the same run. In RAIDT, calibration makes the move from run-level evidence to a five-pillar score profile more defensible by reducing avoidable variation in judgement. It does not eliminate expertise or disagreement; instead, it makes scoring more transparent, contestable, and repeatable. That is important for supervision, policy discussion, and organisational adoption because it shows that RAIDT is not only a conceptual framework but also a practical review method that can support audit readiness and continuous improvement.
One-line takeaway
Calibration is the structured alignment of reviewer judgement because RAIDT depends on consistent interpretation of run-level evidence to produce defensible governance scores.
Related items in RAIDT pillars and scoring
Mentioned in reference-paper summaries (5)
Paper summaries live in Port/93-References/pdf_summaries/. Each file listed below contains the key term at least once.
REF-026__Crisan-2022.mdREF-028__D'Amour-2020.mdREF-059__Karpukhin-2020.mdREF-099__Schro-2022.mdREF-102__Sendak-2020.md