Q281 - What_is_in_the_evidence_pack_and_how_are_the_five_pillars_sc

Q281 — What is in the evidence pack, and how are the five pillars scored?

← RAIDT · Star S5 - RAIDT Pillars and Scoring · primary item: S5.13 · Evidence-based scoring

Appears in sources

workshop_table17#tag-band S4–S5 · 65–95 min

Answer

The RAIDT run-level evidence pack is the evidential bundle for one governed run. Across the papers, its minimum contents include the purpose and stakes of the use case, the run identifier and timestamp, the prompt or template and its version, model/provider/version or deployment ID, decoding settings, tool configuration or tool-call traces, any user-supplied inputs, retrieval settings, retrieved passage or document identifiers with hashes, the produced output with an integrity hash, recorded checks, and the human oversight actions taken such as review, approval, edits, or escalation. Where full text cannot be retained openly, the papers still require immutable identifiers and hashes so later verification remains possible.

The five pillars are then scored from this pack rather than from the apparent quality of the answer. Responsibility is scored from evidence of constraints, safety checks, uncertainty communication, and oversight decisions. Auditability is scored from reconstructability and record integrity. Interpretability is scored from explanation structure, bounded reasoning formats, and explicit limits statements. Dependability is scored from evidence about behavioural stability, especially repeat-run summaries, dispersion, perturbation tests, and monitoring. Traceability is scored from provenance: preserved retrieval snapshots, source identifiers, tool-chain records, prompt provenance, and versioned alignment or adapter artefacts. The scoring logic is anchored, not impressionistic: anchors 1=missing / 3=partial / 5=audit-ready.

RAIDT may report a composite mean, but the substantive result is the score profile across the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability). High scores therefore signal strong governance evidence for that run, not a general certification of the model and not a guarantee of factual correctness or legal compliance. The method is evidence-based because every score should be justified by evidence pointers inside the run-level evidence pack.

Practical example

In the hospital discharge-summary example, the evidence pack would contain the discharge-summary prompt template and version, the model deployment ID, the retrieval snapshot for the internal guideline excerpt, hashes for the prompt and output, and the clinician's review decision before the summary entered the patient record. It would also show the structured output requirement and the explicit uncertainty statement.

Scoring follows directly from those artefacts. Responsibility improves if the clinician review and limitation statement are present. Auditability reaches a high level only if the run can be reconstructed from identifiers and hashes. Interpretability is strengthened by the fixed schema and uncertainty section. Dependability remains lower unless repeat-run evidence is also available, which is why the worked paper example scores that pillar more cautiously. Traceability depends on preserving the retrieved guideline snapshot rather than merely citing a source in the text.

Sources in RAIDT papers

08-RAIDT_Foundations_M_V50
00-RAIDT_Scoring_v1
13-RAIDT-Evidence-Review_M_v10