Q143 - How_does_scoring_work_in_RAIDT_and_what_do_low_basic_and_str

Q143 — How does scoring work in RAIDT and what do low, basic, and strong scores mean?

← RAIDT · Star S5 - RAIDT Pillars and Scoring · primary item: S5.06 · Scoring anchors

Appears in sources

integrated_82#Q3.17

Answer

Scoring in RAIDT is a structured assessment of governance readiness at run level. The scored object is not the model in the abstract and not the generated text alone; it is the run-level evidence pack derived from the run record. Reviewers assess the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability) on a 1-5 scale, justify each judgement with evidence pointers, and retain the resulting score profile rather than collapsing everything into a single opaque number. A composite mean can be reported, but the papers emphasise that the profile is primary because governance trade-offs are often the substantive finding.

Operationally, scoring can be manual in early pilots, partially automated where logs can verify objective evidence fields, or more highly automated for evidence-completeness checks and repeat-run stability tests. Low scores correspond to the lower end of the anchor scale: missing evidence, failed pillar intent, or non-reconstructable runs. A basic score corresponds to the middle anchor, where evidence is present but partial and may be tolerable only in lower-risk settings. Strong scoring corresponds to the upper end of the scale, especially 5, where the evidence pack supports reconstruction, review, and justified use for the stated task. RAIDT therefore uses scoring to show how well governance has been operationalised, not to certify factual correctness or legal compliance. The practical consequence is that low scores trigger concrete governance improvements, while strong scores indicate that the run is evidentially ready for scrutiny.

Practical example

In a cybersecurity workflow, a GenAI tool may draft an incident-triage summary. A low score means the organisation cannot show the prompt version, system configuration, checks performed, or whether repeat runs under fixed settings produce stable outputs. The summary may still sound competent, but Dependability and Auditability remain weak because the run cannot be scrutinised properly.

A basic score means the team has captured some of that evidence, perhaps the prompt and output plus a reviewer note, but stability testing or provenance remains incomplete. A strong score requires more: logged identifiers, output integrity records, documented checks, and evidence that repeat-run behaviour is sufficiently stable for the workflow. The resulting score profile then tells the organisation whether the tool is governable enough for operational use or whether escalation is needed.

Sources in RAIDT papers

00-RAIDT_Scoring_v1
08-RAIDT_Foundations_M_V50