Q157 - What_did_the_empirical_programme_test

Q157 — What did the empirical programme test?

← RAIDT · Star S10 - Empirical Programme, Domains and Sector Playbooks · primary item: S10.01 · Empirical programme

Appears in sources

integrated_82#Q4.6

Answer

The empirical programme tested a theory-led set of propositions about how specific influence methods alter governance readiness when runs are properly evidenced. It tested whether RAG improves Traceability and Auditability when retrieval snapshots are stored; whether complete run-level evidence packs improve Auditability because runs become reconstructable; whether LoRA/PEFT improves Dependability by reducing instability across repeats; whether structured prompting improves Interpretability by forcing clearer structure and uncertainty signalling; whether RLHF-type alignment improves Responsibility by encouraging safer, policy-aligned behaviour; and whether stacked configurations produce the strongest composite outcomes by combining behavioural control with richer evidence capture.

It also tested a broader explanatory claim about what drives RAIDT outcomes. The programme asked whether no single method maximises all five pillars at once, and whether observed differences are explained more by configuration and evidence practices than by model vendor choice within the tested capability band. To do this, the study evaluated 280 scenario-configuration cases across fourteen domains, repeated each case 10-12 times, captured a run-level evidence pack for every execution, and scored each run on the five pillars using stable anchors. The empirical programme therefore tested both the RAIDT rubric as a practical measurement instrument and the substantive proposition that influence methods as governance interventions change what can be audited, interpreted, depended upon, and traced in organisational GenAI use.

Practical example

A clear illustration is the public-service eligibility scenario associated with RAG. The programme tested whether retrieval actually improves governance, not merely whether it produces more convincing advice. If a run retrieves policy clauses but fails to preserve the retrieval snapshot, the advice may sound grounded while remaining hard to reconstruct or contest. In RAIDT terms, that should limit Auditability and Traceability.

By contrast, when the run-level evidence pack stores the exact passages, identifiers, and versions used, an auditor can reconstruct the basis of the advice and check whether the output matched the approved rule set. The test is therefore about governable evidence, not just answer quality.

Sources in RAIDT papers

09-RAIDT_Empirical_M_V50.docx