Q014 - What_must_be_true_before_RAIDT_scoring_is_meaningful

Q014 — What must be true before RAIDT scoring is meaningful?

← RAIDT · Star S5 - RAIDT Pillars and Scoring · primary item: S5.13 · Evidence-based scoring

Scoring is valid only when evidence is concrete, comparable, and tied to one run.

Appears in sources
Answer

For RAIDT scoring to be meaningful, the organisation must first treat the run as the unit of governance. The scored object is therefore the run-level evidence pack for one configured use in context, not the model in general and not the fluency of the generated prose. The papers are explicit that governance claims become credible only when a specific run can be reconstructed and judged against its stated task, risk, and oversight arrangements. If that basic unit is missing, scoring collapses into impressionistic judgement about style or apparent usefulness rather than evidence-based governance.

A second precondition is evidential completeness anchored to inspectable artefacts. The run-level evidence pack must contain enough information to reconstruct what happened: identifiers for prompts and templates, model deployment and parameters, retrieval snapshot identifiers and hashes where retrieval was used, output integrity hashes, checks performed, and documented human review or escalation. Narrative summaries do not meet this threshold unless their key claims are anchored to logs, repositories, or other stable identifiers. In RAIDT terms, scoring is meaningful only when an independent reviewer can inspect evidence pointers rather than trust after-the-fact description.

A third precondition is disciplined measurement. Reviewers need shared anchors 1=missing / 3=partial / 5=audit-ready, calibration through worked examples, and conservative treatment of high-stakes runs. The resulting score profile across the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability) measures governance readiness, not factual correctness or legal compliance by itself. In short, RAIDT scoring becomes meaningful only when a run is evidentially reconstructable, contextually bounded, and scored through a common rubric rather than through opinion.

Practical example

Consider a public-service eligibility advice run. A caseworker asks a GenAI assistant to draft guidance for a claimant, and the answer cites a benefits rule. If the team keeps only the polished response, RAIDT scoring is not meaningful: reviewers cannot tell which policy version was retrieved, which prompt template was active, or whether a supervisor approved use of the draft. A persuasive answer could still mask a weak governance process.

Scoring becomes meaningful once the same run is stored as a run-level evidence pack: run ID and timestamp, prompt version, model deployment ID, retrieval snapshot of the exact rule text with hashes, output hash, and oversight record showing whether the caseworker accepted or escalated the advice. At that point, low or high scores are defensible because they are attached to reconstructable evidence rather than to writing quality alone.

Sources in RAIDT papers
Powered by Forestry.md