Empirical Programme, Domains and Sector Playbooks

flowchart LR
    A[Governance gap in GenAI use] --> B[RAIDT run-level governance]
    B --> C[Star S10 empirical programme]
    C --> D[Domains and scenarios]
    C --> E[Configurations and repeats]
    D --> F[Evidence packs]
    E --> F
    F --> G[Five-pillar score profile]
    G --> H[Governance readiness]
    H --> I[Sector playbooks and policy]

<- Circle 3 - Academic, adoption and boundary layer

Ring: Evidence, validation and sector adoption star

Function

Defines how RAIDT is empirically tested, comparatively assessed, and translated into domain-sensitive sector playbooks. This star turns RAIDT from a conceptual framework into an evidence-bearing programme for governing concrete GenAI runs in organisational work.

Role in the project

This star sits at the junction of empirical validation, sector application, and implementation design. It shows how RAIDT moves from foundational claims about the run as the unit of governance to testable study designs, measurable outcomes, and practical adoption guidance. In project terms, S10 links:

It therefore contributes to evidence, implementation, scoring, governance interventions, and adoption strategy.

Main questions answered by this star
Workshop discussion prompts
Items in this star (16)
Main message

RAIDT treats the run as the unit of governance for generative AI in organisational work. A run is one configured use of a GenAI system for a specific task, at a specific time, in a specific context. It includes the prompt or instruction, the model and tool configuration, retrieved context where RAG is used, the output produced, and the human or automated checks applied before action is taken. This matters because risk does not arise only from the model in the abstract. It arises when a model is used for a real organisational purpose with particular stakes, constraints, users, and consequences.

Star S10 explains how that claim is tested. It is the part of the RAIDT project that asks what kind of empirical programme is needed if run-level governance is to be more than a plausible idea. In other words, S10 is the project's validation architecture. It sets out how RAIDT can be examined across domains, scenarios, and configurations so that the framework's claims are supported by evidence rather than assertion.

The governance problem is clear. Many responsible AI discussions remain too general to guide day-to-day use. Organisations may endorse accountability, transparency, safety, or fairness, but still struggle to show how those principles apply to one concrete use of a chatbot, drafting assistant, retrieval system, or decision-support tool. At the same time, technical evaluations often focus on model benchmarks, prompt performance, RLHF-style alignment controls, or parameter-efficient fine-tuning such as LoRA and related PEFT approaches. Those issues matter, but they do not by themselves show whether a specific organisational use is governable.

RAIDT responds by shifting attention to the run. A run-level evidence pack captures what was asked, how the system was configured, what contextual material was retrieved, what output was generated, and what checks were applied. The five-pillar score profile then evaluates that run through Responsibility, Auditability, Interpretability, Dependability, and Traceability. S10 matters because it shows how these two outputs are validated across varied conditions rather than assumed to work everywhere in the same way.

An empirical programme is necessary because governance claims are comparative. It is not enough to demonstrate one successful run. The project must examine what happens when the same task is repeated, when prompts are varied, when RAG is introduced or removed, when tool settings change, when human review becomes stronger or weaker, and when the organisational domain changes. S10 therefore treats empirical diversity as a requirement of good governance research, not as an optional add-on.

The item structure in this star makes that logic explicit. Fourteen domains and twenty scenarios per domain create breadth. Six configurations create controlled variation in model, prompt, retrieval, and checking conditions. Repeated runs allow the project to observe stability, drift, inconsistency, and failure patterns over time. Governance readiness is treated as an outcome, which is important because RAIDT is not simply a record-keeping exercise. The framework should show whether a run can be justified, reviewed, challenged, and traced sufficiently for organisational use.

Sector playbooks translate these findings into practice. A playbook is not just a checklist for one industry. It is a structured interpretation layer that shows how the same RAIDT logic needs to be applied differently when the stakes, evidence requirements, and regulatory pressures change. In healthcare, evidence quality, provenance of retrieved guidance, and escalation rules may be central. In finance, audit trails, control ownership, and retention requirements may dominate. In law and public services, contestability and procedural fairness become especially important. In crisis and emergency response, time pressure and incomplete information mean that uncertainty has to be governed explicitly rather than hidden.

This is where the five pillars become operational. Responsibility concerns who initiated, reviewed, or approved the run and whether accountability is clear. Auditability concerns whether another person can reconstruct what happened after the event. Interpretability concerns whether the basis, assumptions, and limits of the output can be understood well enough for responsible use. Dependability concerns whether similar runs behave consistently enough for the intended organisational purpose. Traceability concerns whether the lineage of prompt, model, context, tool use, output, and subsequent action can be followed. S10 gives the empirical basis for asking whether those pillars are robust across settings rather than merely desirable in theory.

The star also matters for standards and policy alignment. Instruments such as the EU AI Act, ISO/IEC 42001, and the NIST AI RMF all expect organisations to show evidence of oversight, documentation, accountability, and risk management. Yet these instruments do not themselves provide a run-level evidencing method for everyday generative AI use. RAIDT's empirical programme can therefore make a methodological contribution by showing how broad governance expectations are instantiated in concrete runs, evidence packs, and score profiles. That is one reason S10 is central to the policy pathway in Paper 10.

S10 is equally important for the project's treatment of uncertainty. Generative AI is probabilistic, prompt-sensitive, and often unstable across model versions, retrieval states, or operational contexts. Managerial uncertainty is therefore not a side issue; it is built into the governance problem itself. By comparing repeated runs across domains and configurations, the empirical programme can identify where uncertainty is tolerable, where stronger controls are required, and where use should be restricted or redesigned.

There are limits. No empirical programme can prove universal validity across all future models, domains, or regulatory conditions. Sector playbooks may need regular updating as tools and expectations evolve. Scenario design can also bias what appears governable. For that reason, S10 should be read as a structured validation strategy rather than a claim that one study settles the governance question. Its value lies in cumulative evidence: well-defined runs, transparent evidence packs, defensible scoring, and careful sector interpretation.

For supervisors, this star demonstrates that RAIDT is not only a conceptual framework. It contains a credible route from foundations to empirical validation, and from validation to implementation and policy. In short, S10 explains how RAIDT earns its claims.

Key questions and answers

Q1. What is the empirical programme in RAIDT?

Answer:
The empirical programme is the structured research design used to test whether RAIDT works across multiple domains, scenarios, and configurations. It specifies how runs are sampled, repeated, compared, and evaluated so that the framework can be judged with evidence rather than rhetoric.

Practical example:
A researcher runs the same drafting task in healthcare, finance, and education, with and without RAG, and compares the resulting evidence packs and pillar scores.

Link to RAIDT:
This is the process through which the run-level evidence pack and the five-pillar profile become empirical instruments.

Q2. Why does RAIDT need multi-domain testing?

Answer:
Governance quality is context-dependent. A configuration that is acceptable for low-stakes educational support may be unsuitable for legal advice or clinical support. Multi-domain testing shows what remains stable in RAIDT and what must be adapted through sector playbooks.

Practical example:
A summarisation assistant may look dependable in internal training but become unacceptable when used for citizen advice because the burden of explanation and contestability is much higher.

Link to RAIDT:
Cross-domain testing helps calibrate evidence requirements, score thresholds, and governance interventions.

Q3. Why are repeated runs important?

Answer:
Repeated runs reveal consistency, volatility, and hidden failure modes. One good output does not show that a system is dependable for organisational use.

Practical example:
A procurement support prompt is run ten times over several weeks. Citation quality and confidence language vary noticeably, showing a level of instability that a single demonstration would miss.

Link to RAIDT:
Repeated runs directly strengthen the Dependability pillar and make the evidence pack more credible.

Q4. What problem do sector playbooks solve?

Answer:
Sector playbooks solve the translation problem between general governance principles and domain-specific implementation. They show how the same RAIDT framework should be applied when stakes, actors, and rules differ.

Practical example:
A finance playbook may require stronger logging and formal sign-off than an education playbook because of audit and compliance expectations.

Link to RAIDT:
Playbooks preserve the common run-level structure while adapting controls and thresholds to sector conditions.

Q5. How does this star connect to the run-level evidence pack?

Answer:
S10 defines the empirical conditions under which evidence packs are gathered and compared. It clarifies what should be captured for each run and how those records support validation, audit, and governance review.

Practical example:
For each crisis-response run, the evidence pack stores the prompt, model version, retrieved advisories, output, reviewer notes, timestamps, and action taken.

Link to RAIDT:
Without S10, the evidence pack is only a design idea. With S10, it becomes a repeatable empirical instrument.

Q6. How does S10 support RAIDT scoring?

Answer:
Scoring becomes useful only when it is calibrated against variation in task type, domain risk, and execution conditions. S10 provides the comparative evidence needed to judge whether a score is stable, sensitive, and decision-relevant.

Practical example:
RAG may improve Traceability in public services because sources are logged, but reduce Interpretability if the retrieved material is opaque to frontline users.

Link to RAIDT:
This star ensures that scoring is tied to governance decisions and interventions rather than cosmetic measurement.

Q7. What is governance readiness as an outcome?

Answer:
Governance readiness means that a run is sufficiently evidenced, understandable, reviewable, and controllable for organisational deployment. It shifts attention from output quality alone to practical governability.

Practical example:
A model may draft fluent HR guidance, but if no reviewer is assigned and no retrieval sources are documented, the run is not governance-ready.

Link to RAIDT:
Governance readiness is a downstream interpretation of the evidence pack and the five-pillar profile.

Q8. Why is this star important for supervisors?

Answer:
Supervisors need to see that RAIDT has a coherent empirical pathway, not just a persuasive concept. S10 shows what will be tested, what counts as evidence, and how findings inform later papers and practical outputs.

Practical example:
During supervision, this star allows the researcher to explain sample design, outcome logic, and sector translation without collapsing into either abstract theory or tool-specific detail.

Link to RAIDT:
S10 is the bridge connecting foundational theory, empirical validation, evidence packs, scoring, and policy translation.

Practical examples
Evidence needed / what to capture
Link to RAIDT project
Citation ideas to support this note
Boundaries and limitations
Conclusion

This star explains how RAIDT moves from theory into evidence. The project's core claim is that the run, rather than the model in the abstract, is the right unit of governance for organisational generative AI use. S10 shows how that claim can be tested. It sets out an empirical programme that varies domains, scenarios, configurations, and repeated runs so that we can assess whether RAIDT's evidence pack and five-pillar score profile actually improve governance readiness.

Its importance is twofold. First, it gives Paper 09 a clear validation pathway by showing what counts as comparative evidence and how readiness can be treated as an outcome. Second, it gives Paper 10 a route into policy and sector application through playbooks for healthcare, finance, law, public services, cybersecurity, education, environment, crisis response, supply chains, and ageing-related contexts. This means S10 is the bridge between conceptual foundations and practical adoption.

Slides
Slide 1 - why this star matters

Purpose:
Frame S10 as the empirical bridge between RAIDT theory and practical adoption.

Key message:
S10 explains how RAIDT is tested, evidenced, and made usable across sectors.

Slide content:

  • RAIDT governs the run, not only the model
  • Empirical testing is needed to support that claim
  • S10 links validation, scoring, and sector application
  • It turns theory into an evidence programme

Speaker note:
Position S10 as the part of the project that shows how the run-level claim is tested across real organisational uses rather than assumed.

Visual idea:
Bridge diagram linking foundations, empirical validation, and sector adoption.

Link to RAIDT:
This slide situates S10 between the run-level evidence pack, the five-pillar score profile, and the adoption pathway.

Citation support to mention if asked:
Responsible AI governance literature and Information Systems work on technology-in-use.

Slide 2 - the governance problem

Purpose:
Explain the gap that motivates the empirical programme.

Key message:
High-level AI principles and technical benchmarks do not by themselves show whether a specific GenAI use is governable.

Slide content:

  • Principles are often too abstract for daily practice
  • Benchmarks miss organisational context
  • GenAI risk appears at point of use
  • RAIDT addresses governance at run level

Speaker note:
Emphasise that organisations govern concrete uses under time pressure, uncertainty, and domain constraints. That is why run-level evidence matters.

Visual idea:
Comparison graphic: abstract principles vs technical benchmarks vs run-level governance.

Link to RAIDT:
This slide motivates why evidence packs and pillar scores must be tied to specific runs.

Citation support to mention if asked:
Responsible AI critiques, socio-technical governance research, and GenAI risk studies.

Slide 3 - design of the empirical programme

Purpose:
Show how S10 operationalises RAIDT into a research design.

Key message:
RAIDT is validated through structured comparison across domains, scenarios, configurations, and repeated runs.

Slide content:

  • 14 domains
  • 20 scenarios per domain
  • 6 configurations
  • Repeated runs with governance readiness as outcome

Speaker note:
Breadth comes from domains and scenarios. Controlled variation comes from configurations. Repetition reveals instability and edge cases. The outcome is not only output quality but governability.

Visual idea:
Matrix showing domains x scenarios x configurations x repeats.

Link to RAIDT:
This is the empirical structure that generates comparable evidence packs and score profiles.

Citation support to mention if asked:
Comparative case design and repeated-measures evaluation literature.

Slide 4 - repeated runs and governance readiness

Purpose:
Explain why one successful demonstration is insufficient.

Key message:
Dependable governance requires evidence about consistency, variation, and uncertainty over time.

Slide content:

  • GenAI outputs can vary across runs
  • Prompt and RAG changes alter outcomes
  • Repetition exposes instability
  • Governance readiness depends on evidence, not fluency alone

Speaker note:
Connect uncertainty to managerial decision-making. If similar runs behave differently, governance needs stronger checks, narrower boundaries, or escalation rules.

Visual idea:
Simple plot showing score variation across repeated runs.

Link to RAIDT:
Repeated runs strengthen the Dependability pillar and improve the credibility of the evidence pack.

Citation support to mention if asked:
Studies on prompt sensitivity, uncertainty, model drift, and human oversight.

Slide 5 - sector playbooks

Purpose:
Explain how the same RAIDT logic is translated across sectors.

Key message:
Sector playbooks adapt RAIDT to different risk profiles, evidence demands, and regulatory pressures without losing the run-level structure.

Slide content:

  • Same framework, different sector emphasis
  • Healthcare: provenance and escalation
  • Finance: audit trails and controls
  • Public services and crisis response: contestability and uncertainty

Speaker note:
Stress that the playbook idea prevents RAIDT from becoming either too generic or too fragmented. The framework stays stable, but implementation guidance changes with domain conditions.

Visual idea:
Hub-and-spoke diagram with RAIDT at the centre and sector playbooks around it.

Link to RAIDT:
Sector playbooks are the adoption layer that translates run-level governance into practical organisational use.

Citation support to mention if asked:
Domain governance materials and sector-specific responsible AI guidance.

Slide 6 - policy, standards, and uncertainty

Purpose:
Show why S10 matters beyond the empirical study itself.

Key message:
S10 provides a practical route for aligning run-level governance with standards, regulation, and uncertainty management.

Slide content:

  • Supports documentation and oversight
  • Relevant to EU AI Act expectations
  • Can inform ISO/IEC 42001 and NIST AI RMF practices
  • Makes uncertainty visible at run level

Speaker note:
Clarify that RAIDT is not a substitute for regulation or standards. Its value is that it gives organisations a concrete evidencing method that policy frameworks can recognise.

Visual idea:
Standards mapping table linking evidence fields and pillars to governance requirements.

Link to RAIDT:
This slide supports Paper 10 and shows how evidence packs can serve assurance and policy translation.

Citation support to mention if asked:
EU AI Act materials, ISO/IEC 42001 documentation, NIST AI RMF guidance, and uncertainty literature.

Slide 7 - limits and project contribution

Purpose:
Close by clarifying what S10 does and does not claim.

Key message:
S10 does not prove universal validity, but it gives RAIDT a credible path from theory to evidence, scoring, and sector application.

Slide content:

  • Not universal or once-for-all validation
  • Sector playbooks will need updating
  • Scoring still requires judgement
  • Strong contribution: empirical pathway for RAIDT

Speaker note:
End by making the contribution precise. S10 does not settle all governance questions, but it gives the project a defensible empirical pathway that supervisors, reviewers, and practitioners can evaluate.

Visual idea:
Contribution map showing foundations -> validation -> policy -> sector use.

Link to RAIDT:
This slide summarises why S10 is essential to Papers 08, 09, and 10 and to the overall credibility of the project.

Citation support to mention if asked:
Methodology literature on scope conditions, external validity, and cumulative evidence.

Powered by Forestry.md