S10.01 - Empirical_programme

S10.01 ? Empirical programme

flowchart LR
    A[Principle-heavy AI governance
Fragmented case evidence
Weak comparability] --> B[RAIDT
Run-level evidence framework] H[Healthcare] I[Finance] J[Law and public services] K[Cybersecurity] L[Education] M[Supply chain] H --> C I --> C J --> C K --> C L --> C M --> C B --> C[[Empirical programme
Structured testing across domains
scenarios and configurations]] C --> D[Run-level evidence] D --> E[Evidence pack] D --> F[Five-pillar score profile] E --> G[Reviewer reconstruction
Organisational learning
Governance readiness] F --> G C --> N[Evidence over assertion
Reviewability
Contestability
Audit readiness]

? Star S10 - Empirical Programme, Domains and Sector Playbooks

Star context: Explains how RAIDT is tested, calibrated, and applied across domains, scenarios, and sector playbooks so that governance claims are supported by structured run-level evidence rather than abstract principle alone.


Academic picture
Definition / background

The empirical programme is the organised body of testing through which RAIDT examines whether governance readiness can be measured from run-level evidence across different domains, scenarios, and influence configurations. Rather than assuming that responsible governance is present because an organisation has a policy, a model card, or a high-level assurance statement, the empirical programme asks whether governance quality can be demonstrated repeatedly in actual configured uses of generative AI.

Conceptually, this makes the empirical programme a research and validation layer within RAIDT. RAIDT treats the run as the unit of governance: one configured use of a generative AI system for a specific task, at a specific time, in a specific context. The empirical programme then tests what happens when such runs are compared systematically. It examines whether evidence packs are complete enough, whether score profiles vary in meaningful ways, and whether governance readiness changes when domain conditions, scenarios, and intervention choices change.

This matters because many governance frameworks stop at principle definition or control catalogues. RAIDT goes further by asking for observable evidence from real runs. The empirical programme is therefore the mechanism that shows whether the framework works beyond a single illustrative example. It links run-level evidence, evidence-pack construction, score profiling across the five pillars, and the broader claim that governance can become reviewable, contestable, and auditable in practice.

The term should not be confused with a generic evaluation study or a narrow benchmark exercise. In RAIDT, the empirical programme is broader than model performance testing and narrower than a fully open-ended research agenda. It is a structured empirical design for examining governance readiness as an outcome, using repeated and comparable run-level evidence.

Why this concept matters

The empirical programme solves a central governance problem: organisations can make responsible AI claims without being able to show how those claims hold across contexts, tasks, and operational variations. Without an empirical programme, RAIDT would risk remaining a persuasive conceptual framework that lacks demonstrated robustness across domains.

It also avoids a common confusion between governance aspiration and governance capability. A team may have documentation, review boards, and internal policies, yet still fail to produce reconstructable evidence for a specific run. By testing across multiple cases, the empirical programme reveals whether governance is actually stable, comparable, and learnable.

If this is missing, organisations face several risks. They may overgeneralise from isolated examples, misread high scores as universally portable, or overlook how governance quality degrades when context, prompting structure, or workflow influence changes. For GenAI deployment in organisational work, these are material risks because the same model can produce very different governance conditions depending on task framing, operator behaviour, domain stakes, and evidence capture discipline.

The empirical programme matters because it moves RAIDT from principles toward operational governance. It creates a defensible basis for calibration, comparison, policy discussion, and sector playbook development.

Key idea: The empirical programme matters because it tests whether RAIDT can produce reliable governance evidence across real runs, not just whether the framework sounds convincing in theory.

What this item enables
Practical example / likely audience question

Audience question

What is the empirical programme actually testing in RAIDT: the quality of the model, the quality of the governance process, or both?

Answer

The underlying concern in this question is that evaluation can easily collapse into model benchmarking, where the emphasis is on accuracy or output quality alone. The direct answer is that the empirical programme primarily tests governance readiness as evidenced at run level, although model behaviour still matters because it affects the evidence available for review.

In practical terms, RAIDT is not asking only whether a generative AI system gave a useful answer. It is asking whether the run can be reconstructed, assessed, challenged, and learned from. For example, two teams may use the same model to support decision drafting. One team records prompt context, human oversight, revision steps, rationale, and escalation thresholds; the other stores only the final output. Both may appear successful on task completion, but the empirical programme would show that their governance readiness differs substantially because the evidence conditions differ.

RAIDT handles this issue better than a generic AI governance approach because it makes the comparison run-specific. Instead of assuming that governance quality is inherited from policy documents or model-level controls, it tests whether the evidence pack and score profile for each run support real reviewability and contestability. That is what the empirical programme is empirically testing.

Practical example in RAIDT terms

Consider a healthcare administrative use case in which a generative AI system drafts patient discharge summaries for clinician review. The run-level issue is not only whether the text reads well, but whether the organisation can reconstruct how the draft was produced, what source material informed it, what instructions were applied, what human corrections were made, and whether the output stayed within approved usage boundaries.

Within the empirical programme, this healthcare scenario can be compared with equivalent runs in finance or public services, and also compared across different prompt structures or oversight configurations. The required evidence would include task framing, system settings, input provenance, output versions, reviewer interventions, exception notes, and reasons for acceptance or rejection. Responsibility is affected because accountability for approval must be clear; Auditability because the run must be reviewable; Interpretability because reviewers need to understand why the draft took its form; Dependability because repeated runs should not degrade unpredictably; and Traceability because the evidence chain must be reconstructable.

By placing such runs inside the empirical programme, RAIDT can show whether governance readiness improves when evidence capture is strengthened or influence is better controlled. The value is not just that one healthcare run looks compliant, but that governance performance becomes comparable, explainable, and improvable across many runs.

Detailed link to RAIDT

Empirical programme links to RAIDT in four ways.

First, it supports RAIDT's core idea that governance should be evidenced at the level of actual use rather than assumed from high-level principle statements.
Second, it depends on the run as the unit of analysis, because the empirical programme compares configured uses of generative AI across contexts, tasks, and influence conditions.
Third, it tests the practical outputs of RAIDT by examining how evidence packs and five-pillar score profiles vary in completeness, quality, and usefulness.
Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning by showing whether RAIDT can support defensible governance judgements across repeated cases.

Empirical programme -> Run-level evidence -> Evidence pack -> RAIDT score profile -> Governance readiness

The chain matters because the empirical programme does not sit outside the framework. It is the structured means by which RAIDT demonstrates that run-level evidence can support comparative governance assessment and continuous improvement.

Link to the five RAIDT pillars

Responsibility

The empirical programme tests whether responsibility is consistently allocated and evidenced across different runs, not merely declared in policy.

Example evidence / implication:

Auditability

This pillar is strongly affected because the empirical programme depends on whether runs can be reconstructed, compared, and inspected after the fact.

Example evidence / implication:

Interpretability

The empirical programme examines whether score differences can be meaningfully explained rather than treated as opaque numbers.

Example evidence / implication:

Dependability

This pillar is also strongly affected because repeated runs across domains and scenarios reveal whether governance quality is stable or fragile.

Example evidence / implication:

Traceability

Traceability underpins the entire empirical programme because comparisons are credible only if the evidence chain for each run is preserved.

Example evidence / implication:

The empirical programme touches all five pillars, but it is especially dependent on Auditability, Dependability, and Traceability because comparative testing fails if runs cannot be reconstructed consistently.

Why this item is more than a generic concept

In general AI governance, an empirical programme may simply mean that some cases, pilots, or evaluations were conducted. In RAIDT, it means a structured empirical design that tests governance readiness using run-level evidence, comparable evidence packs, and five-pillar score profiles.

The RAIDT meaning is more operational because it does not treat evidence as a narrative afterthought. Evidence is built into the method of comparison itself. This makes the empirical programme useful not only for academic validation, but also for supervisor discussion, audit preparation, policy translation, and sector-specific governance design.

Common misunderstanding

Misunderstanding

The empirical programme is just a large collection of examples showing that RAIDT can be used in different sectors.

Correction

That is too weak. The empirical programme is not merely illustrative coverage; it is a comparative testing structure. For example, if RAIDT is applied in healthcare, finance, and education, the point is not simply to show breadth. The point is to test whether governance readiness can be measured across these domains in a way that remains comparable, evidence-based, and sensitive to configuration differences. Breadth matters, but only because it supports empirical examination of how governance behaves under variation.

Boundary and limitation

The empirical programme does not prove that a system is ethically good, legally compliant in every jurisdiction, or safe in every possible circumstance. It also does not replace domain expertise, regulatory judgement, or substantive evaluation of model performance. A well-structured empirical programme can show patterns in governance readiness, but it cannot by itself settle all normative disputes about acceptable AI use.

Its effectiveness also depends on evidence quality. If run records are incomplete, if scenarios are weakly designed, or if scoring is inconsistently applied, then the empirical programme may produce misleading comparisons. RAIDT handles this limitation by tying the empirical programme back to explicit evidence-pack construction, repeated runs, cross-domain testing, and transparent score interpretation rather than relying on a single headline result.

Implementation levels

Manual implementation

A researcher or small team can implement the empirical programme manually by selecting a set of domains and scenarios, running comparable GenAI tasks, collecting run-level evidence in a structured template, and reviewing each case against the five RAIDT pillars. This is suitable for early-stage validation and conceptual demonstration.

Semi-automated implementation

A semi-automated implementation adds templates, metadata capture forms, review rubrics, and structured dashboards so that evidence packs and score profiles can be compared more efficiently across scenarios. This supports larger studies and makes cross-run analysis more reliable.

Fully automated implementation

At scale, the empirical programme can be implemented through a governance pipeline in which orchestration layers, wrappers, logging systems, and review dashboards automatically capture run metadata, preserve evidence chains, generate draft evidence packs, and surface comparative score patterns for oversight teams. In that form, the empirical programme becomes a continuous governance learning system rather than a one-off research exercise.

Practical use in the RAIDT project

Within the RAIDT project, this item is central to explaining how the framework moves from foundations to validation and then to policy and sector application. In a Paper 08 Foundations framing, the empirical programme helps clarify why RAIDT needs a run-level method rather than a principle-only model. In a Paper 09 Empirical Validation framing, it provides the structure through which domains, scenarios, configurations, repeated runs, and governance readiness are actually examined. In a Paper 10 Policy Pathways framing, it helps show policymakers and organisational stakeholders why evidence-based governance claims are more credible than broad AI assurance rhetoric.

It is also important for sector playbooks because each playbook should be grounded in observed evidence patterns rather than generic sector advice. For supervision, viva defence, and journal positioning, the empirical programme is the answer to the question, "How do you know RAIDT works beyond a single example?" It shows that the project has a method for comparative testing, calibration, and learning.

Key audience questions to prepare for

Q1. Is the empirical programme evaluating model performance or governance readiness?

It is primarily evaluating governance readiness at run level. Model performance may affect the evidence available, but the central question is whether the run can be evidenced, reviewed, compared, and improved.

Q2. Why is cross-domain testing necessary?

Cross-domain testing shows whether RAIDT is robust beyond a single context. Governance claims are stronger when the framework performs across varied tasks, stakes, and operational conditions.

Q3. What is being compared across runs?

The empirical programme compares evidence completeness, pillar-level score patterns, reviewer reconstructability, and the effects of scenario or configuration changes on governance readiness.

Q4. Why not rely on policy documents and organisational controls alone?

Because those do not show whether a specific GenAI use was actually governable in practice. RAIDT requires run-level evidence so that governance quality can be assessed in operational context.

Q5. What would count as a weak empirical programme in RAIDT?

A weak empirical programme would rely on too few scenarios, poorly structured evidence capture, inconsistent scoring, or no meaningful variation across domains and configurations. That would make the governance conclusions difficult to defend.

Suggested citation concepts to support this item
Short explanation for presentation

The empirical programme is the part of RAIDT that tests whether governance readiness can actually be measured from run-level evidence across different contexts. Rather than assuming that governance is strong because an organisation has policies or assurance statements, RAIDT compares real configured uses of generative AI across domains, scenarios, and influence conditions. This allows the framework to examine whether evidence packs are complete, whether score profiles are interpretable, and whether governance quality remains stable under variation. In practical terms, the empirical programme is what makes RAIDT defensible as more than a conceptual model. It supports calibration, comparison, sector playbooks, and stronger claims about reviewability, contestability, audit readiness, and organisational learning.

One-line takeaway

Empirical programme is the structured testing layer that makes RAIDT credible because it shows whether run-level evidence can support comparative governance readiness in practice.

Related items in empirical programme, domains and sector playbooks
Anchored questions
Powered by Forestry.md