Empirical Programme, Domains and Sector Playbooks

#raidt/S10

flowchart LR
    A[Governance gap in GenAI use] --> B[RAIDT run-level governance]
    B --> C[Star S10 empirical programme]
    C --> D[Domains and scenarios]
    C --> E[Configurations and repeats]
    D --> F[Evidence packs]
    E --> F
    F --> G[Five-pillar score profile]
    G --> H[Governance readiness]
    H --> I[Sector playbooks and policy]

<- Circle 3 - Academic, adoption and boundary layer

Ring: Evidence, validation and sector adoption star

Function

Defines how RAIDT is empirically tested, comparatively assessed, and translated into domain-sensitive sector playbooks. This star turns RAIDT from a conceptual framework into an evidence-bearing programme for governing concrete GenAI runs in organisational work.

Role in the project

This star sits at the junction of empirical validation, sector application, and implementation design. It shows how RAIDT moves from foundational claims about the run as the unit of governance to testable study designs, measurable outcomes, and practical adoption guidance. In project terms, S10 links:

Paper 08, where the conceptual and methodological foundations of run-level governance are established.
Paper 09, where the framework, evidence fields, and scores are empirically validated across scenarios and repeated runs.
Paper 10, where findings are translated into policy pathways, standards alignment, and sector-facing guidance.

It therefore contributes to evidence, implementation, scoring, governance interventions, and adoption strategy.

Main questions answered by this star

What does an empirical programme for RAIDT look like in practice?
Why does RAIDT need multi-domain and multi-scenario testing rather than a single proof-of-concept?
What governance problem is solved by comparing repeated runs across domains, configurations, and organisational contexts?
What evidence would show that RAIDT improves governance readiness rather than merely documenting AI use?
How do domains, scenarios, model configurations, and repeated runs feed into the run-level evidence pack?
How do empirical findings inform the five RAIDT pillars of Responsibility, Auditability, Interpretability, Dependability, and Traceability?
How do sector playbooks convert empirical findings into usable guidance for different forms of organisational work?
How does this star help supervisors judge whether RAIDT is methodologically coherent, empirically credible, and policy-relevant?

Workshop discussion prompts

10-20 min - Which design choices are necessary to show that RAIDT governs runs rather than merely describing them?
20-40 min - How should domains, scenarios, RAG use, model configurations, and human checks be varied so that the evidence base is credible across sectors?
40-60 min - What would count as persuasive evidence for supervisors, reviewers, and practitioners that RAIDT improves governance readiness, supports scoring, and informs sector playbooks?

Items in this star (16)

Main message

RAIDT treats the run as the unit of governance for generative AI in organisational work. A run is one configured use of a GenAI system for a specific task, at a specific time, in a specific context. It includes the prompt or instruction, the model and tool configuration, retrieved context where RAG is used, the output produced, and the human or automated checks applied before action is taken. This matters because risk does not arise only from the model in the abstract. It arises when a model is used for a real organisational purpose with particular stakes, constraints, users, and consequences.

Star S10 explains how that claim is tested. It is the part of the RAIDT project that asks what kind of empirical programme is needed if run-level governance is to be more than a plausible idea. In other words, S10 is the project's validation architecture. It sets out how RAIDT can be examined across domains, scenarios, and configurations so that the framework's claims are supported by evidence rather than assertion.

The governance problem is clear. Many responsible AI discussions remain too general to guide day-to-day use. Organisations may endorse accountability, transparency, safety, or fairness, but still struggle to show how those principles apply to one concrete use of a chatbot, drafting assistant, retrieval system, or decision-support tool. At the same time, technical evaluations often focus on model benchmarks, prompt performance, RLHF-style alignment controls, or parameter-efficient fine-tuning such as LoRA and related PEFT approaches. Those issues matter, but they do not by themselves show whether a specific organisational use is governable.

RAIDT responds by shifting attention to the run. A run-level evidence pack captures what was asked, how the system was configured, what contextual material was retrieved, what output was generated, and what checks were applied. The five-pillar score profile then evaluates that run through Responsibility, Auditability, Interpretability, Dependability, and Traceability. S10 matters because it shows how these two outputs are validated across varied conditions rather than assumed to work everywhere in the same way.

An empirical programme is necessary because governance claims are comparative. It is not enough to demonstrate one successful run. The project must examine what happens when the same task is repeated, when prompts are varied, when RAG is introduced or removed, when tool settings change, when human review becomes stronger or weaker, and when the organisational domain changes. S10 therefore treats empirical diversity as a requirement of good governance research, not as an optional add-on.

The item structure in this star makes that logic explicit. Fourteen domains and twenty scenarios per domain create breadth. Six configurations create controlled variation in model, prompt, retrieval, and checking conditions. Repeated runs allow the project to observe stability, drift, inconsistency, and failure patterns over time. Governance readiness is treated as an outcome, which is important because RAIDT is not simply a record-keeping exercise. The framework should show whether a run can be justified, reviewed, challenged, and traced sufficiently for organisational use.

Sector playbooks translate these findings into practice. A playbook is not just a checklist for one industry. It is a structured interpretation layer that shows how the same RAIDT logic needs to be applied differently when the stakes, evidence requirements, and regulatory pressures change. In healthcare, evidence quality, provenance of retrieved guidance, and escalation rules may be central. In finance, audit trails, control ownership, and retention requirements may dominate. In law and public services, contestability and procedural fairness become especially important. In crisis and emergency response, time pressure and incomplete information mean that uncertainty has to be governed explicitly rather than hidden.

This is where the five pillars become operational. Responsibility concerns who initiated, reviewed, or approved the run and whether accountability is clear. Auditability concerns whether another person can reconstruct what happened after the event. Interpretability concerns whether the basis, assumptions, and limits of the output can be understood well enough for responsible use. Dependability concerns whether similar runs behave consistently enough for the intended organisational purpose. Traceability concerns whether the lineage of prompt, model, context, tool use, output, and subsequent action can be followed. S10 gives the empirical basis for asking whether those pillars are robust across settings rather than merely desirable in theory.

The star also matters for standards and policy alignment. Instruments such as the EU AI Act, ISO/IEC 42001, and the NIST AI RMF all expect organisations to show evidence of oversight, documentation, accountability, and risk management. Yet these instruments do not themselves provide a run-level evidencing method for everyday generative AI use. RAIDT's empirical programme can therefore make a methodological contribution by showing how broad governance expectations are instantiated in concrete runs, evidence packs, and score profiles. That is one reason S10 is central to the policy pathway in Paper 10.

S10 is equally important for the project's treatment of uncertainty. Generative AI is probabilistic, prompt-sensitive, and often unstable across model versions, retrieval states, or operational contexts. Managerial uncertainty is therefore not a side issue; it is built into the governance problem itself. By comparing repeated runs across domains and configurations, the empirical programme can identify where uncertainty is tolerable, where stronger controls are required, and where use should be restricted or redesigned.

There are limits. No empirical programme can prove universal validity across all future models, domains, or regulatory conditions. Sector playbooks may need regular updating as tools and expectations evolve. Scenario design can also bias what appears governable. For that reason, S10 should be read as a structured validation strategy rather than a claim that one study settles the governance question. Its value lies in cumulative evidence: well-defined runs, transparent evidence packs, defensible scoring, and careful sector interpretation.

For supervisors, this star demonstrates that RAIDT is not only a conceptual framework. It contains a credible route from foundations to empirical validation, and from validation to implementation and policy. In short, S10 explains how RAIDT earns its claims.

Key questions and answers

Q1. What is the empirical programme in RAIDT?

Answer:
The empirical programme is the structured research design used to test whether RAIDT works across multiple domains, scenarios, and configurations. It specifies how runs are sampled, repeated, compared, and evaluated so that the framework can be judged with evidence rather than rhetoric.

Practical example:
A researcher runs the same drafting task in healthcare, finance, and education, with and without RAG, and compares the resulting evidence packs and pillar scores.

Link to RAIDT:
This is the process through which the run-level evidence pack and the five-pillar profile become empirical instruments.

Q2. Why does RAIDT need multi-domain testing?

Answer:
Governance quality is context-dependent. A configuration that is acceptable for low-stakes educational support may be unsuitable for legal advice or clinical support. Multi-domain testing shows what remains stable in RAIDT and what must be adapted through sector playbooks.

Practical example:
A summarisation assistant may look dependable in internal training but become unacceptable when used for citizen advice because the burden of explanation and contestability is much higher.

Link to RAIDT:
Cross-domain testing helps calibrate evidence requirements, score thresholds, and governance interventions.

Q3. Why are repeated runs important?

Answer:
Repeated runs reveal consistency, volatility, and hidden failure modes. One good output does not show that a system is dependable for organisational use.

Practical example:
A procurement support prompt is run ten times over several weeks. Citation quality and confidence language vary noticeably, showing a level of instability that a single demonstration would miss.

Link to RAIDT:
Repeated runs directly strengthen the Dependability pillar and make the evidence pack more credible.

Q4. What problem do sector playbooks solve?

Answer:
Sector playbooks solve the translation problem between general governance principles and domain-specific implementation. They show how the same RAIDT framework should be applied when stakes, actors, and rules differ.

Practical example:
A finance playbook may require stronger logging and formal sign-off than an education playbook because of audit and compliance expectations.

Link to RAIDT:
Playbooks preserve the common run-level structure while adapting controls and thresholds to sector conditions.

Q5. How does this star connect to the run-level evidence pack?

Answer:
S10 defines the empirical conditions under which evidence packs are gathered and compared. It clarifies what should be captured for each run and how those records support validation, audit, and governance review.

Practical example:
For each crisis-response run, the evidence pack stores the prompt, model version, retrieved advisories, output, reviewer notes, timestamps, and action taken.

Link to RAIDT:
Without S10, the evidence pack is only a design idea. With S10, it becomes a repeatable empirical instrument.

Q6. How does S10 support RAIDT scoring?

Answer:
Scoring becomes useful only when it is calibrated against variation in task type, domain risk, and execution conditions. S10 provides the comparative evidence needed to judge whether a score is stable, sensitive, and decision-relevant.

Practical example:
RAG may improve Traceability in public services because sources are logged, but reduce Interpretability if the retrieved material is opaque to frontline users.

Link to RAIDT:
This star ensures that scoring is tied to governance decisions and interventions rather than cosmetic measurement.

Q7. What is governance readiness as an outcome?

Answer:
Governance readiness means that a run is sufficiently evidenced, understandable, reviewable, and controllable for organisational deployment. It shifts attention from output quality alone to practical governability.

Practical example:
A model may draft fluent HR guidance, but if no reviewer is assigned and no retrieval sources are documented, the run is not governance-ready.

Link to RAIDT:
Governance readiness is a downstream interpretation of the evidence pack and the five-pillar profile.

Q8. Why is this star important for supervisors?

Answer:
Supervisors need to see that RAIDT has a coherent empirical pathway, not just a persuasive concept. S10 shows what will be tested, what counts as evidence, and how findings inform later papers and practical outputs.

Practical example:
During supervision, this star allows the researcher to explain sample design, outcome logic, and sector translation without collapsing into either abstract theory or tool-specific detail.

Link to RAIDT:
S10 is the bridge connecting foundational theory, empirical validation, evidence packs, scoring, and policy translation.

Practical examples

Healthcare discharge drafting: A clinician-support run uses a prompt template, a selected model, and RAG over hospital guidance. The playbook requires provenance logging, reviewer identity, escalation thresholds, and explicit limits on autonomous use.
Financial compliance checking: A compliance analyst uses GenAI to summarise a regulatory circular. Repeated runs reveal differences in omission rates across configurations, leading to stronger Dependability thresholds and review checkpoints.
Public-sector case support: A caseworker uses GenAI to draft a citizen-facing explanation. RAIDT captures prompt wording, source documents, redaction controls, and reviewer sign-off to support Auditability and contestability.
Cybersecurity incident triage: A security team uses a model to classify alerts and draft response notes. The evidence pack records tool configuration, retrieved threat intelligence, confidence markers, and post hoc correction data.

Evidence needed / what to capture

Unique run identifier, timestamp, organisational setting, and task purpose.
User role, reviewer role, approver role, and accountability ownership.
Prompt text, template version, and any prompt engineering changes.
Model name, provider, version, settings, and tool configuration.
RAG use, retrieved sources, retrieval timestamps, and provenance metadata.
Output produced, output format, confidence cues, and uncertainty warnings.
Human checks, automated checks, escalation decisions, and acceptance or rejection status.
Repetition number, comparison condition, and scenario identifier.
Pillar scores for Responsibility, Auditability, Interpretability, Dependability, and Traceability.
Governance readiness assessment, rationale, and recommended intervention.
Standards or policy mapping where relevant.
Post-run outcomes such as corrections, incidents, and downstream impact.

Link to RAIDT project

Paper 08: foundations and methodological pathways - S10 operationalises the claim that the run is the right unit of governance.
Paper 09: empirical validation - This star provides the structure for testing reliability, variation, score calibration, and governance readiness across repeated runs.
Paper 10: policy pathways - Findings from S10 can be translated into standards-aligned and sector-specific policy recommendations.
Sector playbooks - The domains and scenarios in this star provide the raw material for practical playbooks.
RAIDT scoring - Cross-domain evidence is needed to justify score thresholds, trade-offs, and intervention logic.
RAIDT evidence pack - S10 defines what evidence must be captured consistently so that runs can be compared and reviewed.
RAIDT governance interventions - The star identifies where stronger prompts, tighter RAG controls, additional review, or policy escalation are required.

Citation ideas to support this note

Responsible AI governance literature on accountability, transparency, contestability, and oversight.
Information Systems research on socio-technical governance and technology-in-use.
Empirical methods literature on comparative case design, repeated measures, and scenario evaluation.
Generative AI studies on prompt sensitivity, hallucination, retrieval quality, and human-AI oversight.
Standards and policy materials related to the EU AI Act, ISO/IEC 42001, and the NIST AI RMF.
Domain-specific governance sources for healthcare, finance, public administration, cybersecurity, education, and crisis management.

Boundaries and limitations

This star does not claim universal validation across every model, sector, or future regulatory condition.
Good documentation does not automatically mean good governance; evidence quality and decision quality remain distinct.
S10 does not replace domain regulation, professional judgement, or sector-specific assurance processes.
Scoring is not mechanically objective; it still depends on defensible criteria, calibration, and reviewer judgement.
Sector playbooks will need revision as models, tools, and legal expectations evolve.

Conclusion

This star explains how RAIDT moves from theory into evidence. The project's core claim is that the run, rather than the model in the abstract, is the right unit of governance for organisational generative AI use. S10 shows how that claim can be tested. It sets out an empirical programme that varies domains, scenarios, configurations, and repeated runs so that we can assess whether RAIDT's evidence pack and five-pillar score profile actually improve governance readiness.

Its importance is twofold. First, it gives Paper 09 a clear validation pathway by showing what counts as comparative evidence and how readiness can be treated as an outcome. Second, it gives Paper 10 a route into policy and sector application through playbooks for healthcare, finance, law, public services, cybersecurity, education, environment, crisis response, supply chains, and ageing-related contexts. This means S10 is the bridge between conceptual foundations and practical adoption.

Slides

Slide 1 - why this star matters

Purpose:
Frame S10 as the empirical bridge between RAIDT theory and practical adoption.

Key message:
S10 explains how RAIDT is tested, evidenced, and made usable across sectors.

Slide content:

RAIDT governs the run, not only the model
Empirical testing is needed to support that claim
S10 links validation, scoring, and sector application
It turns theory into an evidence programme

Speaker note:
Position S10 as the part of the project that shows how the run-level claim is tested across real organisational uses rather than assumed.

Visual idea:
Bridge diagram linking foundations, empirical validation, and sector adoption.

Link to RAIDT:
This slide situates S10 between the run-level evidence pack, the five-pillar score profile, and the adoption pathway.

Citation support to mention if asked:
Responsible AI governance literature and Information Systems work on technology-in-use.

Slide 2 - the governance problem

Purpose:
Explain the gap that motivates the empirical programme.

Key message:
High-level AI principles and technical benchmarks do not by themselves show whether a specific GenAI use is governable.

Slide content:

Principles are often too abstract for daily practice
Benchmarks miss organisational context
GenAI risk appears at point of use
RAIDT addresses governance at run level

Speaker note:
Emphasise that organisations govern concrete uses under time pressure, uncertainty, and domain constraints. That is why run-level evidence matters.

Visual idea:
Comparison graphic: abstract principles vs technical benchmarks vs run-level governance.

Link to RAIDT:
This slide motivates why evidence packs and pillar scores must be tied to specific runs.

Citation support to mention if asked:
Responsible AI critiques, socio-technical governance research, and GenAI risk studies.

Slide 3 - design of the empirical programme

Purpose:
Show how S10 operationalises RAIDT into a research design.

Key message:
RAIDT is validated through structured comparison across domains, scenarios, configurations, and repeated runs.

Slide content:

14 domains
20 scenarios per domain
6 configurations
Repeated runs with governance readiness as outcome

Speaker note:
Breadth comes from domains and scenarios. Controlled variation comes from configurations. Repetition reveals instability and edge cases. The outcome is not only output quality but governability.

Visual idea:
Matrix showing domains x scenarios x configurations x repeats.

Link to RAIDT:
This is the empirical structure that generates comparable evidence packs and score profiles.

Citation support to mention if asked:
Comparative case design and repeated-measures evaluation literature.

Slide 4 - repeated runs and governance readiness

Purpose:
Explain why one successful demonstration is insufficient.

Key message:
Dependable governance requires evidence about consistency, variation, and uncertainty over time.

Slide content:

GenAI outputs can vary across runs
Prompt and RAG changes alter outcomes
Repetition exposes instability
Governance readiness depends on evidence, not fluency alone

Speaker note:
Connect uncertainty to managerial decision-making. If similar runs behave differently, governance needs stronger checks, narrower boundaries, or escalation rules.

Visual idea:
Simple plot showing score variation across repeated runs.

Link to RAIDT:
Repeated runs strengthen the Dependability pillar and improve the credibility of the evidence pack.

Citation support to mention if asked:
Studies on prompt sensitivity, uncertainty, model drift, and human oversight.

Slide 5 - sector playbooks

Purpose:
Explain how the same RAIDT logic is translated across sectors.

Key message:
Sector playbooks adapt RAIDT to different risk profiles, evidence demands, and regulatory pressures without losing the run-level structure.

Slide content:

Same framework, different sector emphasis
Healthcare: provenance and escalation
Finance: audit trails and controls
Public services and crisis response: contestability and uncertainty

Speaker note:
Stress that the playbook idea prevents RAIDT from becoming either too generic or too fragmented. The framework stays stable, but implementation guidance changes with domain conditions.

Visual idea:
Hub-and-spoke diagram with RAIDT at the centre and sector playbooks around it.

Link to RAIDT:
Sector playbooks are the adoption layer that translates run-level governance into practical organisational use.

Citation support to mention if asked:
Domain governance materials and sector-specific responsible AI guidance.

Slide 6 - policy, standards, and uncertainty

Purpose:
Show why S10 matters beyond the empirical study itself.

Key message:
S10 provides a practical route for aligning run-level governance with standards, regulation, and uncertainty management.

Slide content:

Supports documentation and oversight
Relevant to EU AI Act expectations
Can inform ISO/IEC 42001 and NIST AI RMF practices
Makes uncertainty visible at run level

Speaker note:
Clarify that RAIDT is not a substitute for regulation or standards. Its value is that it gives organisations a concrete evidencing method that policy frameworks can recognise.

Visual idea:
Standards mapping table linking evidence fields and pillars to governance requirements.

Link to RAIDT:
This slide supports Paper 10 and shows how evidence packs can serve assurance and policy translation.

Citation support to mention if asked:
EU AI Act materials, ISO/IEC 42001 documentation, NIST AI RMF guidance, and uncertainty literature.

Slide 7 - limits and project contribution

Purpose:
Close by clarifying what S10 does and does not claim.

Key message:
S10 does not prove universal validity, but it gives RAIDT a credible path from theory to evidence, scoring, and sector application.

Slide content:

Not universal or once-for-all validation
Sector playbooks will need updating
Scoring still requires judgement
Strong contribution: empirical pathway for RAIDT

Speaker note:
End by making the contribution precise. S10 does not settle all governance questions, but it gives the project a defensible empirical pathway that supervisors, reviewers, and practitioners can evaluate.

Visual idea:
Contribution map showing foundations -> validation -> policy -> sector use.

Link to RAIDT:
This slide summarises why S10 is essential to Papers 08, 09, and 10 and to the overall credibility of the project.

Citation support to mention if asked:
Methodology literature on scope conditions, external validity, and cumulative evidence.