Q158 - What_were_the_main_empirical_findings

Q158 — What were the main empirical findings?

← RAIDT · Star S10 - Empirical Programme, Domains and Sector Playbooks · primary item: S10.06 · Governance readiness as outcome

Appears in sources

integrated_82#Q4.7

Answer

The main empirical finding was that governance readiness can be measured as an outcome in its own right when the run as the unit of governance is assessed through a run-level evidence pack rather than through output fluency alone. In the cross-domain validation, 280 scenario-configuration cases across fourteen domains were repeated 10-12 times, and each run was scored against the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability) using anchors 1=missing / 3=partial / 5=audit-ready. Across this programme, baseline prompting often produced plausible text, but its score profile was typically weakened by missing versioning, absent source linkage, and limited reconstructability. This is the core empirical contribution: a run could appear competent at task level while still being weakly governable.

A second finding was that influence methods as governance interventions shifted different pillars in patterned ways. Structured prompting improved interpretability and, to a lesser extent, responsibility, but did not repair audit gaps on its own. RAG improved auditability and traceability when retrieval snapshots were stored. LoRA/PEFT improved dependability and reduced dispersion, provided adapter lineage and training provenance were logged. RLHF-type conditions strengthened responsibility, especially in safety-sensitive settings, but could weaken interpretability or traceability when refusal logic was opaque. The strongest overall outcome came from stacked configurations, which reduced trade-offs and produced the highest composite readiness. In the ageing-society healthcare instantiation, this pattern was especially clear: baseline composite scoring was 3.2, while fully instrumented configurations reached 5.0, and auditability/traceability rose from around 2.0 under baseline prompting to 4.8-5.0 when evidence capture was complete. Empirically, then, governance readiness depended more on evidencing and configuration discipline than on model branding alone.

Practical example

In a healthcare note-summarisation workflow, a hospital might first deploy a prompt-only assistant to draft a discharge summary from a clinical note. The text may read well, yet the governance team cannot properly review it because the run-level evidence pack contains only the prompt and output, with no retrieval snapshot, no template identifier, and no adapter or alignment version. Under RAIDT, the score profile would therefore show weak Auditability and Traceability even if clinicians judged the prose acceptable.

If the same workflow is rebuilt with a structured template, a stored retrieval snapshot of clinical guidance, versioned PEFT/LoRA artefacts, and logged reviewer checks, the run becomes materially more governable. A safety committee can reconstruct what evidence was supplied, what configuration was active, and why uncertainty or escalation language appeared. That is the practical significance of the findings: the organisation is not merely asking whether the model wrote a fluent summary, but whether the run was evidenced enough to be reviewed, challenged, and governed as an audit-ready event.

Sources in RAIDT papers

09-RAIDT_Empirical_M_V50
20-RAIDT_AgeingSoc_M_V50