Q159 - How_do_the_empirical_programme_and_the_sector_playbooks_make

Q159 — How do the empirical programme and the sector playbooks make RAIDT testable?

← RAIDT · Star S12 - Programme Architecture and Supervisory Navigation · primary item: S12.05 · Sector playbooks

Appears in sources
Answer

The empirical programme makes RAIDT testable by converting governance from a general aspiration into an evidence-centred measurement design. Rather than judging a model in the abstract, RAIDT defines the run as the unit of governance and scores the run-level evidence pack for each material use. This means that the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability) are not treated as slogans; they become observable outcomes in a score profile grounded in captured prompts, configurations, retrieved context, outputs, checks, and review notes. In the design science programme summarised across the foundations and academic-logic papers, influence configurations are compared through repeated runs, variance-aware testing, and reviewer calibration. That structure allows claims about evidence completeness, retrieval grounding, alignment, and stacked configurations to be examined empirically rather than defended narratively.

Sector playbooks make the same logic testable under domain constraints rather than only in generic scenarios. The programme explicitly moves from cross-domain evaluation to sector calibration and playbook development, while keeping the same evidence grammar, the same score profile, and the same anchors 1=missing / 3=partial / 5=audit-ready. This matters because RAIDT treats influence methods as governance interventions, so healthcare, finance, public services, education, and cybersecurity can test whether different interventions actually improve governance readiness when the required evidence is present. Playbooks therefore do not dilute the theory; they stress-test its boundary conditions by showing whether a common run-level evidence pack and scoring logic remain workable when sector duties, escalation rules, provenance requirements, and oversight thresholds become more demanding.

Practical example

In a healthcare note-summarisation workflow, a clinician uses GenAI to draft a summary for a high-risk presentation. Under RAIDT, the run-level evidence pack records the prompt template ID, model deployment ID, decoding settings, any retrieval snapshot hash, the output, the output hash, and the safety check and human oversight flag. The healthcare playbook then adds sector-specific expectations such as uncertainty wording, red-flag escalation, and explicit review responsibility.

This makes the task testable in two ways. First, repeated runs can compare baseline prompting, structured prompting, retrieval augmentation, alignment, or stacked configurations using the same score profile. Second, reviewers can apply the same anchors 1=missing / 3=partial / 5=audit-ready to check whether the run is genuinely reconstructable. The academic-logic paper reports that, in a worked healthcare calibration, Auditability and Traceability can move from around 2 under uninstrumented prompting to near 4.8-5.0 when prompts, retrieval snapshots, adapter versions, and review logs are properly captured.

Sources in RAIDT papers
Powered by Forestry.md