S10.03 - 20_scenarios_per_domain
S10.03 — 20 scenarios per domain
flowchart LR
A[Problem:
too few examples, selective demos,
weak domain coverage] --> B[RAIDT
run-level evidence framework]
H[Domain playbooks:
healthcare, finance, law,
education, cyber, supply chain] --> C[[20 scenarios per domain
structured scenario portfolio]]
B --> C
C --> D[Run-level evidence packs]
C --> E[RAIDT score profiles]
C --> I[Comparison across configurations
and repeated runs]
D --> F[Reviewer reconstruction
and contestability]
E --> G[Governance readiness]
I --> G← Star S10 - Empirical Programme, Domains and Sector Playbooks
Star context: Shows how RAIDT is tested across realistic organisational settings by using a sufficiently broad but still manageable portfolio of domain-specific scenarios. The item sits inside the empirical programme because it connects abstract governance claims to repeated, comparable run-level evidence within sector playbooks.
Academic picture
Definition / background
In RAIDT, 20 scenarios per domain refers to the use of a structured set of twenty realistic task situations within each domain playbook so that GenAI use can be examined under varied but comparable conditions. A scenario is not simply a prompt. It is a designed governance test case that specifies a task context, a likely user purpose, relevant source conditions, and the kinds of failure modes or assurance questions that reviewers need to surface.
The conceptual importance of this item is that it gives the empirical programme a repeatable unit of domain testing. If RAIDT were applied to only one or two examples per sector, the resulting evidence would be too anecdotal to support strong claims about governance readiness. By contrast, a portfolio of twenty scenarios offers breadth across routine, borderline, and more demanding cases while remaining manageable for practical evaluation.
This item also differs from a generic benchmark. A benchmark often aims to compare model performance against a fixed answer key. RAIDT uses scenarios for governance-oriented examination: whether the run can be reconstructed, whether provenance is visible, whether human oversight is meaningful, whether outputs remain dependable across repetitions, and whether the evidence produced is strong enough to support review and contestation.
Within RAIDT, this matters because each scenario generates runs, each run can produce a run-level evidence pack, and patterns across those runs inform the five-pillar score profile. The twenty-scenario structure therefore belongs inside the empirical programme and sector playbooks: it is the mechanism that turns domain variation into comparable evidence rather than leaving governance claims at the level of theory.
Why this concept matters
This concept solves a practical methodological problem for GenAI governance. Organisations often want evidence that a framework works across domains, but they either test too narrowly or test too loosely. Too narrow a test gives false confidence; too loose a test makes comparison impossible. Twenty scenarios per domain provides a middle path: enough variety to expose meaningful differences, but enough structure to support systematic review.
It also avoids confusion between domain adaptation and ad hoc improvisation. Without a scenario set, a sector playbook can become a collection of examples chosen for convenience or rhetorical effect. With a defined scenario portfolio, RAIDT can show that the same governance framework is being challenged by a coherent range of tasks within healthcare, finance, law, education, cybersecurity, and other fields.
If this item is missing, the empirical programme risks becoming selective, fragile, and difficult to defend in supervision, peer review, or organisational scrutiny. A small number of examples cannot credibly support claims about repeatability, failure modes, or readiness for deployment. RAIDT therefore uses scenario portfolios to move from principles and isolated demonstrations towards structured operational evidence.
Key idea: Twenty scenarios per domain matter because RAIDT needs a domain-sensitive but comparable way to produce enough run-level evidence for credible governance assessment.
What this item enables
- A standardised portfolio of realistic task situations within each domain playbook.
- Coverage of routine, ambiguous, edge-case, and risk-sensitive GenAI uses rather than a single showcase example.
- Comparable testing across domains, configurations, and repeated runs.
- More robust evidence packs because multiple runs can be compared within the same domain logic.
- Better-founded RAIDT score profiles because pillar judgements can be based on patterns rather than on isolated impressions.
- Stronger reviewer challenge, contestability, and organisational learning about failure modes.
- Clearer translation from sector playbooks into empirical validation and policy argument.
Practical example / likely audience question
Audience question
Why do you need as many as twenty scenarios per domain rather than one or two representative examples?
Answer
The concern behind this question is whether the design is unnecessarily heavy. The direct answer is that one or two examples may illustrate a concept, but they do not provide enough variation to test governance performance across the kinds of tasks, ambiguities, and failure modes that organisations actually face. A domain such as healthcare or finance contains routine tasks, borderline cases, information-quality problems, conflicting source conditions, and different expectations of oversight. A small sample can easily flatter the framework.
In RAIDT, the purpose of the scenario set is not volume for its own sake. The purpose is structured coverage. Twenty scenarios give enough spread to observe whether governance evidence remains reviewable when the task changes, when provenance becomes uncertain, when human judgement is needed, or when the system becomes overconfident. This is especially important because RAIDT is concerned with run-level evidence, not just abstract compliance statements.
A practical example is a finance playbook in which some scenarios involve straightforward drafting from clean source data, while others involve missing documentation, ambiguous risk signals, or pressure to summarise complex evidence quickly. Generic AI governance might note that the organisation has a policy and an approval process. RAIDT goes further by asking whether those controls still produce reconstructable, scoreable evidence across a wider scenario portfolio. That is why the twenty-scenario design is methodologically defensible rather than arbitrary.
Practical example in RAIDT terms
Consider a finance domain playbook for a team using GenAI to assist with suspicious activity report preparation. One scenario may involve a straightforward case with complete source material. Another may introduce partial provenance, inconsistent transaction notes, and a prompt that could tempt the model to present conclusions too confidently. Across a portfolio of twenty scenarios, the organisation can observe how the same workflow behaves under varied but realistic pressure.
The run-level issue is not merely whether the model can generate plausible text. It is whether each run leaves sufficient evidence to show what data were used, how the prompt framed the task, what uncertainty was present, what the model produced, how the analyst edited or rejected the draft, and whether escalation occurred when required. The evidence needed includes prompt text, source-document references, model and configuration details, draft outputs, reviewer annotations, approval records, and reasons for any override or non-use.
This scenario portfolio affects all five RAIDT pillars. Responsibility is engaged because roles and approval thresholds must remain clear across cases. Auditability and Traceability are strengthened when reviewers can reconstruct each scenario run. Interpretability is tested by whether analysts can explain why an output was accepted or challenged. Dependability is tested most strongly, because the point of multiple scenarios is to see whether governance quality holds across changing task conditions. Governance readiness improves when the organisation can show not one successful demonstration, but a body of structured evidence across realistic finance scenarios.
Detailed link to RAIDT
Twenty scenarios per domain links to RAIDT in four ways.
First, it operationalises the RAIDT core idea that GenAI governance should be examined through real use situations rather than through abstract principle statements alone.
Second, it creates a structured portfolio of runs, because each scenario becomes a context in which run-level evidence can be generated, compared, and challenged.
Third, it strengthens the evidence pack and the score profile by ensuring that pillar judgements are informed by patterned evidence across a domain rather than by a single anecdotal case.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning because reviewers can inspect where controls succeed, where they weaken, and how governance interventions should be refined.
20 scenarios per domain → Structured run portfolio → Evidence packs and RAIDT score profiles → Governance readiness
Link to the five RAIDT pillars
Responsibility
This item supports Responsibility by making domain playbooks explicit about who is expected to act, review, escalate, or approve across different kinds of task situations rather than only in ideal cases.
Example evidence / implication:
- Scenario specifications can include expected user role, reviewer role, and escalation threshold.
- Differences between scenarios reveal whether accountability arrangements remain clear under pressure or ambiguity.
Auditability
This item has a strong effect on Auditability because a scenario portfolio makes it possible to compare whether runs are consistently reconstructable across a domain rather than only in a hand-picked example.
Example evidence / implication:
- Reviewers can compare the completeness of prompts, inputs, outputs, and notes across multiple scenarios.
- Missing records become visible as a patterned governance weakness rather than as an isolated oversight.
Interpretability
Twenty scenarios per domain supports Interpretability by showing whether users and reviewers can understand why outputs differ across domain situations and whether explanations remain meaningful when tasks become more complex.
Example evidence / implication:
- Scenario variation can reveal when staff rely on unexplained model assertions rather than grounded reasoning.
- Reviewer notes can show whether output acceptance depended on understandable justification or on guesswork.
Dependability
This item most strongly affects Dependability because the point of using many scenarios is to test whether governance performance remains stable, safe, and usable across varied domain conditions.
Example evidence / implication:
- Repeated patterns of inconsistency, overconfidence, or missed risk signals can be detected across the scenario set.
- Strong performance on simple tasks can be distinguished from weak performance on ambiguous or high-stakes tasks.
Traceability
It also strongly affects Traceability because each scenario should preserve a clear link between task context, source material, model behaviour, human intervention, and downstream decision use.
Example evidence / implication:
- Scenario identifiers can be connected to timestamps, artefacts, reviewers, and configuration choices.
- Cross-scenario comparison can show whether trace quality deteriorates when workflows become more complex.
This item affects all five pillars, but it is especially important for Dependability, Auditability, and Traceability because those pillars are where the value of structured scenario coverage becomes most visible.
Why this item is more than a generic concept
In general AI governance, scenarios may simply mean examples, use cases, stress tests, or workshop illustrations. In RAIDT, 20 scenarios per domain has a more disciplined meaning: a designed empirical portfolio used to generate run-level evidence that can be reviewed, compared, scored, and used to justify governance claims.
The RAIDT meaning is more operational because the scenario set is not just descriptive. It is tied to concrete runs, to evidence-pack assembly, to five-pillar scoring, and to the judgement of governance readiness. In that sense, the item is not merely about having examples. It is about creating a defensible evidential basis for domain-sensitive governance assessment.
Common misunderstanding
Misunderstanding
Twenty scenarios per domain is just an arbitrary number, or a benchmarking convenience, with no special governance significance.
Correction
The number should be understood as a structured design choice, not as a universal natural constant. The governance significance comes from what the scenario portfolio achieves: enough breadth to surface different failure modes and enough discipline to preserve comparability. For example, if a healthcare playbook contained only two easy scenarios, RAIDT might appear successful while missing provenance failures, ambiguity management problems, or unsafe overconfidence in more difficult cases. The scenario portfolio is therefore a governance mechanism for evidential coverage, not a superficial counting exercise.
Boundary and limitation
Twenty scenarios per domain does not prove exhaustive coverage of a sector, and it does not guarantee that every real-world use condition has been represented. It also does not replace live monitoring, incident review, staff training, procurement checks, or legal and policy analysis. A scenario portfolio is a strong empirical instrument, but it remains a designed sample of governance-relevant situations.
The item also depends on scenario quality. Poorly designed scenarios can be too artificial, too repetitive, or insufficiently aligned to the tasks that matter in practice. In addition, domains evolve, so a static portfolio can lose relevance over time. RAIDT handles these limitations by treating scenario sets as structured but revisable playbook assets, to be updated through empirical learning, repeated runs, and cross-domain comparison rather than treated as permanent truth.
Implementation levels
Manual implementation
A researcher or small team can implement this item by drafting twenty scenario sheets for a domain, each specifying task purpose, context, source conditions, expected risks, and review questions. Runs can then be executed manually and logged in a structured evidence template.
Semi-automated implementation
Semi-automated implementation can use templates, tagged scenario libraries, metadata forms, and workflow checklists so that each scenario run automatically captures identifiers, role information, timestamps, and required reviewer prompts while still allowing human judgement.
Fully automated implementation
At scale, a platform or orchestration layer can store scenario libraries, launch runs against multiple configurations, collect run artefacts automatically, and assemble comparative dashboards showing evidence-pack completeness, pillar impacts, and recurring failure modes across the twenty-scenario set.
Practical use in the RAIDT project
Within the RAIDT project, this item is important for Paper 08 Foundations because it helps justify why empirical governance assessment needs a scenario portfolio rather than isolated examples. It is central to Paper 09 Empirical Validation because the whole validation design depends on consistent testing across domains, configurations, and repeated runs. It is also relevant to Paper 10 Policy Pathways because policymakers and organisational stakeholders will ask how RAIDT scales from abstract governance principles to sector-sensitive implementation.
The item also connects directly to sector playbooks, the evidence pack, and the scoring rubric. A domain playbook becomes more defensible when its claims rest on a structured scenario set. The evidence pack becomes more meaningful when individual runs can be located within that set. The score profile becomes more credible when it reflects patterned strengths and weaknesses across multiple scenarios rather than one-off success stories. For supervision, viva defence, and journal positioning, this item helps explain why RAIDT is an empirical governance programme rather than just a conceptual framework.
Key audience questions to prepare for
Q1. Why twenty rather than ten or fifty?
The value of twenty is methodological balance. It is large enough to create meaningful variation within a domain, but still small enough to remain feasible for repeated, comparative assessment. The exact number is a design choice, but the underlying principle is structured coverage rather than convenience sampling.
Q2. Are the twenty scenarios supposed to be identical across all domains?
No. The number is consistent, but the scenario content is domain-specific. RAIDT uses this design so that healthcare, finance, law, education, and other fields can each be tested through realistic tasks while still allowing comparison at the level of governance logic and evidence quality.
Q3. Is this just another benchmark dataset?
No. A benchmark usually focuses on performance against expected outputs. RAIDT scenarios are governance-oriented test cases designed to reveal reviewability, provenance, oversight quality, dependability, and traceability at the run level.
Q4. What happens if the organisation performs well on some scenarios and badly on others?
That is exactly the point of the design. Mixed performance is informative because it shows where governance controls are robust and where they fail under certain task conditions. RAIDT treats that variation as evidence for improvement, not as an embarrassment to be hidden.
Q5. How does this strengthen governance readiness?
It strengthens governance readiness by replacing selective demonstration with a body of comparable evidence. Reviewers can see whether practices hold across a realistic spread of domain situations, which makes assurance claims more credible and improvement priorities more concrete.
Suggested citation concepts to support this item
- Scenario-based evaluation in AI governance
- Domain-specific stress testing for generative AI
- Benchmarking versus governance-oriented evaluation of AI systems
- Sociotechnical test-case design for human-AI work
- Evidence-based assurance for organisational AI deployment
- Scenario portfolios in safety-critical or regulated digital systems
- Comparative evaluation of GenAI across domains and use contexts
- Human oversight under varied AI task conditions
- Auditability and traceability through structured scenario design
- Empirical validation methods for responsible AI frameworks
Short explanation for presentation
Twenty scenarios per domain is the mechanism RAIDT uses to make domain testing credible rather than anecdotal. Instead of relying on one or two favourable examples, each sector playbook contains a structured portfolio of realistic scenarios that expose different task conditions and governance risks. That matters because RAIDT is a run-level evidence framework: each scenario produces runs, those runs generate evidence packs, and patterns across them inform the five-pillar score profile. The point is not to create a benchmark in the narrow technical sense, but to show whether governance controls remain reviewable, interpretable, dependable, and traceable across a meaningful spread of domain situations. In supervision or viva terms, this item explains how RAIDT turns sector variation into a defensible empirical programme.
One-line takeaway
Twenty scenarios per domain is a structured domain playbook portfolio because RAIDT needs enough varied runs to turn governance claims into comparative run-level evidence.
Related items in empirical programme, domains and sector playbooks
- S10.01 · Empirical programme
- S10.02 · 14 domains
- S10.04 · 6 configurations
- S10.05 · Repeated runs
- S10.06 · Governance readiness as outcome
- S10.07 · Healthcare
- S10.08 · Finance
- S10.09 · Law and public services
- S10.10 · Cybersecurity
- S10.11 · Education
- S10.12 · Environment
- S10.13 · Crisis and emergency response
- S10.14 · Supply chain
- S10.15 · Ageing calibration
Anchored questions
No anchored questions are currently listed in the source item.