Q093 - What_is_the_empirical_programme_testing_in_RAIDT

Q093 — What is the empirical programme testing in RAIDT?

← RAIDT · Star S10 - Empirical Programme, Domains and Sector Playbooks · primary item: S10.01 · Empirical programme

It tests whether run-level evidence and scoring can distinguish stronger from weaker governed GenAI use.

Appears in sources

qa_deck_100#slide 95 · Empirical programme, calibration, procurement, and assurance

Answer

The empirical programme in RAIDT is testing two linked claims. First, it tests a measurement claim: that governance readiness can be observed from a run-level evidence pack rather than inferred from policy statements, model cards, or vendor assurances. For that reason, RAIDT treats the run as the unit of governance and scores each run across the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability). Each run receives a score profile using the anchors 1=missing / 3=partial / 5=audit-ready, with a composite mean reported only as a secondary summary. The core question is whether those scores remain meaningful and comparable across domains, scenarios, influence configurations, and repeated executions.

Second, the programme tests a governance mechanism claim: that influence methods as governance interventions change what can be reconstructed, reviewed, and contested after a run has occurred. The study therefore varies baseline prompting, structured prompting, retrieval-augmented generation (RAG), LoRA/PEFT, RLHF-type alignment, and stacked configurations while keeping the scenario logic and scoring rubric stable. It asks whether these configurations shift pillar-level governance readiness in interpretable ways: for example, whether RAG raises Auditability and Traceability when retrieval snapshots are stored, whether LoRA/PEFT improves Dependability by reducing dispersion, whether RLHF-type raises Responsibility, and whether stacked configurations reduce cross-pillar trade-offs. In RAIDT terms, the empirical programme is not primarily testing raw model capability; it is testing whether governance readiness becomes measurable from recorded run evidence across organisational settings.

Practical example

A concrete example is healthcare note summarisation. A clinic can run the same triage note through different configurations and compare the resulting score profile. A prompt-only run may look clinically fluent, but if the organisation keeps only the final text, the run-level evidence pack is thin and weak on Auditability and Traceability. A more governed run adds a structured prompt, logged safety constraints, retrieval snapshots where supporting material is used, versioned adapters where fine-tuning is used, and reviewer checks.

The empirical programme then asks whether the higher RAIDT profile comes from genuinely better evidencing rather than from better prose alone. In the healthcare materials, the decisive issue is not whether the summary sounds polished, but whether escalation triggers, uncertainty statements, versions, and review steps are visible enough for later reconstruction and challenge.

Sources in RAIDT papers

09-RAIDT_Empirical_M_V50.docx
20-RAIDT_AgeingSoc_M_V50