S10.01 - Empirical_programme

S10.01 ? Empirical programme

flowchart LR
    A[Principle-heavy AI governance
Fragmented case evidence
Weak comparability] --> B[RAIDT
Run-level evidence framework]
    H[Healthcare]
    I[Finance]
    J[Law and public services]
    K[Cybersecurity]
    L[Education]
    M[Supply chain]
    H --> C
    I --> C
    J --> C
    K --> C
    L --> C
    M --> C
    B --> C[[Empirical programme
Structured testing across domains
scenarios and configurations]]
    C --> D[Run-level evidence]
    D --> E[Evidence pack]
    D --> F[Five-pillar score profile]
    E --> G[Reviewer reconstruction
Organisational learning
Governance readiness]
    F --> G
    C --> N[Evidence over assertion
Reviewability
Contestability
Audit readiness]

? Star S10 - Empirical Programme, Domains and Sector Playbooks

Star context: Explains how RAIDT is tested, calibrated, and applied across domains, scenarios, and sector playbooks so that governance claims are supported by structured run-level evidence rather than abstract principle alone.

Academic picture

Definition / background

The empirical programme is the organised body of testing through which RAIDT examines whether governance readiness can be measured from run-level evidence across different domains, scenarios, and influence configurations. Rather than assuming that responsible governance is present because an organisation has a policy, a model card, or a high-level assurance statement, the empirical programme asks whether governance quality can be demonstrated repeatedly in actual configured uses of generative AI.

Conceptually, this makes the empirical programme a research and validation layer within RAIDT. RAIDT treats the run as the unit of governance: one configured use of a generative AI system for a specific task, at a specific time, in a specific context. The empirical programme then tests what happens when such runs are compared systematically. It examines whether evidence packs are complete enough, whether score profiles vary in meaningful ways, and whether governance readiness changes when domain conditions, scenarios, and intervention choices change.

This matters because many governance frameworks stop at principle definition or control catalogues. RAIDT goes further by asking for observable evidence from real runs. The empirical programme is therefore the mechanism that shows whether the framework works beyond a single illustrative example. It links run-level evidence, evidence-pack construction, score profiling across the five pillars, and the broader claim that governance can become reviewable, contestable, and auditable in practice.

The term should not be confused with a generic evaluation study or a narrow benchmark exercise. In RAIDT, the empirical programme is broader than model performance testing and narrower than a fully open-ended research agenda. It is a structured empirical design for examining governance readiness as an outcome, using repeated and comparable run-level evidence.

Why this concept matters

The empirical programme solves a central governance problem: organisations can make responsible AI claims without being able to show how those claims hold across contexts, tasks, and operational variations. Without an empirical programme, RAIDT would risk remaining a persuasive conceptual framework that lacks demonstrated robustness across domains.

It also avoids a common confusion between governance aspiration and governance capability. A team may have documentation, review boards, and internal policies, yet still fail to produce reconstructable evidence for a specific run. By testing across multiple cases, the empirical programme reveals whether governance is actually stable, comparable, and learnable.

If this is missing, organisations face several risks. They may overgeneralise from isolated examples, misread high scores as universally portable, or overlook how governance quality degrades when context, prompting structure, or workflow influence changes. For GenAI deployment in organisational work, these are material risks because the same model can produce very different governance conditions depending on task framing, operator behaviour, domain stakes, and evidence capture discipline.

The empirical programme matters because it moves RAIDT from principles toward operational governance. It creates a defensible basis for calibration, comparison, policy discussion, and sector playbook development.

Key idea: The empirical programme matters because it tests whether RAIDT can produce reliable governance evidence across real runs, not just whether the framework sounds convincing in theory.

What this item enables

Systematic comparison of run-level governance evidence across domains and scenarios.
Examination of how different influence configurations alter evidence completeness and score outcomes.
Repeated testing of whether RAIDT scores are stable, sensitive, and interpretable rather than arbitrary.
Development of sector playbooks grounded in observed governance patterns rather than generic advice.
Calibration of governance readiness as an empirical outcome rather than a rhetorical claim.
Organisational learning about where responsibility, auditability, interpretability, dependability, and traceability weaken in practice.

Practical example / likely audience question

Audience question

What is the empirical programme actually testing in RAIDT: the quality of the model, the quality of the governance process, or both?

Answer

The underlying concern in this question is that evaluation can easily collapse into model benchmarking, where the emphasis is on accuracy or output quality alone. The direct answer is that the empirical programme primarily tests governance readiness as evidenced at run level, although model behaviour still matters because it affects the evidence available for review.

In practical terms, RAIDT is not asking only whether a generative AI system gave a useful answer. It is asking whether the run can be reconstructed, assessed, challenged, and learned from. For example, two teams may use the same model to support decision drafting. One team records prompt context, human oversight, revision steps, rationale, and escalation thresholds; the other stores only the final output. Both may appear successful on task completion, but the empirical programme would show that their governance readiness differs substantially because the evidence conditions differ.

RAIDT handles this issue better than a generic AI governance approach because it makes the comparison run-specific. Instead of assuming that governance quality is inherited from policy documents or model-level controls, it tests whether the evidence pack and score profile for each run support real reviewability and contestability. That is what the empirical programme is empirically testing.

Practical example in RAIDT terms

Consider a healthcare administrative use case in which a generative AI system drafts patient discharge summaries for clinician review. The run-level issue is not only whether the text reads well, but whether the organisation can reconstruct how the draft was produced, what source material informed it, what instructions were applied, what human corrections were made, and whether the output stayed within approved usage boundaries.

Within the empirical programme, this healthcare scenario can be compared with equivalent runs in finance or public services, and also compared across different prompt structures or oversight configurations. The required evidence would include task framing, system settings, input provenance, output versions, reviewer interventions, exception notes, and reasons for acceptance or rejection. Responsibility is affected because accountability for approval must be clear; Auditability because the run must be reviewable; Interpretability because reviewers need to understand why the draft took its form; Dependability because repeated runs should not degrade unpredictably; and Traceability because the evidence chain must be reconstructable.

By placing such runs inside the empirical programme, RAIDT can show whether governance readiness improves when evidence capture is strengthened or influence is better controlled. The value is not just that one healthcare run looks compliant, but that governance performance becomes comparable, explainable, and improvable across many runs.

Detailed link to RAIDT

Empirical programme links to RAIDT in four ways.

First, it supports RAIDT's core idea that governance should be evidenced at the level of actual use rather than assumed from high-level principle statements.
Second, it depends on the run as the unit of analysis, because the empirical programme compares configured uses of generative AI across contexts, tasks, and influence conditions.
Third, it tests the practical outputs of RAIDT by examining how evidence packs and five-pillar score profiles vary in completeness, quality, and usefulness.
Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning by showing whether RAIDT can support defensible governance judgements across repeated cases.

Empirical programme -> Run-level evidence -> Evidence pack -> RAIDT score profile -> Governance readiness

The chain matters because the empirical programme does not sit outside the framework. It is the structured means by which RAIDT demonstrates that run-level evidence can support comparative governance assessment and continuous improvement.

Link to the five RAIDT pillars

Responsibility

The empirical programme tests whether responsibility is consistently allocated and evidenced across different runs, not merely declared in policy.