Boundaries, Limitations and Future Questions

flowchart LR
    A[GenAI governance uncertainty] --> B[RAIDT run-level framework]
    B --> C[Star S11 scope discipline]
    C --> D[Boundary conditions]
    C --> E[Correctness vs readiness]
    C --> F[Proportional evidence]
    D --> G[Score interpretation]
    E --> G
    F --> H[Governance decisions]
    G --> H
    H --> I[Future research path]

<- Circle 3 - Academic, adoption and boundary layer

Ring: Control star

Function

Defines the outer limits of RAIDT as a run-level governance framework, clarifies what its evidence and scores can legitimately support, and prevents conceptual or policy overclaiming when RAIDT is used in research, supervision, implementation, and organisational decision-making.

Role in the project

This star plays a boundary-setting role across the whole RAIDT programme. It sits at the intersection of foundations, governance design, empirical validation, and policy translation. In practical terms, it explains where RAIDT is strong, where it is necessarily partial, and which future research questions remain open. That makes it especially important for Paper 08, where methodological and conceptual claims must be carefully scoped; for Paper 09, where empirical testing must distinguish measurable governance readiness from broader claims about system quality; and for Paper 10, where policy relevance depends on showing both alignment potential and remaining gaps.

S11 also helps supervisors and workshop participants understand that RAIDT is not simply another general Responsible AI checklist. It is a run-level evidence framework for governing configured uses of generative AI in organisational work. Its contribution lies in making particular AI uses inspectable, documentable, and governable at the level of the run. This note therefore protects the intellectual coherence of the project by showing what RAIDT can evaluate, what it can only signal indirectly, and what lies outside its immediate scope.

Main questions answered by this star
Items in this star (10)
Workshop discussion prompts
Main message

Any governance framework becomes weaker, not stronger, if it claims to solve more than it can actually demonstrate. This is especially true in the governance of generative AI, where organisations often face pressure to move quickly from experimentation to operational use while regulators, auditors, managers, and staff all ask slightly different questions. Some want to know whether a system is safe. Others want to know whether it is lawful, fair, accurate, explainable, or useful. RAIDT does not answer all of these questions in a total sense. Its value lies in answering a narrower but highly practical question: how can a specific use of a generative AI system in organisational work be governed through evidence captured at the level of the run?

In RAIDT, the run is the unit of governance. A run is one configured use of a GenAI system for a specific task, at a specific time, in a specific context. It includes the prompt or instruction, model and tool configuration, retrieved context where used, output, and human or automated checks. This framing matters because many governance failures are not failures of abstract model design alone. They arise from how a system is configured, what context it receives, which tools it can call, how outputs are reviewed, and whether evidence exists to reconstruct what happened. RAIDT addresses that practical governance gap through two outputs: a run-level evidence pack and a five-pillar RAIDT score profile covering Responsibility, Auditability, Interpretability, Dependability, and Traceability.

The purpose of this star is to clarify the boundaries of that contribution. RAIDT is not a universal theory of AI safety, not a proof of correctness, and not a substitute for domain regulation, legal advice, or broader organisational governance. It is a framework for producing structured evidence about a run and using that evidence to assess governance readiness. That distinction between correctness and governance readiness is central. A model may generate a factually correct answer in one instance while still being poorly governed because prompts are not recorded, retrieval sources are unstable, reviewer responsibilities are unclear, or audit trails are missing. Equally, a run may be well documented and well controlled yet still produce an incorrect answer. RAIDT helps organisations see and manage these conditions, but it does not collapse them into one metric.

This boundary is methodologically important for the PhD project. In academic terms, RAIDT contributes to Information Systems governance by proposing a governance object that is concrete enough to inspect yet flexible enough to travel across use cases. It also contributes to Responsible AI debates by shifting attention from broad principles to operationally auditable evidence. However, this does not mean that run-level evidence fully captures ethical quality, legal sufficiency, or social legitimacy. A run can be traceable without being fair. It can be auditable without being optimal. It can be interpretable in workflow terms without revealing the full internal logic of a foundation model. S11 exists to state those limits explicitly.

The star also addresses proportionality. Not every run requires the same intensity of evidence capture or the same governance intervention. A low-risk drafting assistant for internal brainstorming should not necessarily be governed in the same way as a clinical triage assistant, legal advice generator, or public-sector benefits support tool. RAIDT therefore needs a proportionality logic: governance should match task criticality, uncertainty, sensitivity of data, degree of automation, and consequences of error. This makes the evidence pack more practical, because the framework becomes adaptable rather than uniformly burdensome. At the same time, proportionality creates its own challenge. If organisations reduce evidence capture too far in the name of efficiency, the score profile may become weak precisely where assurance is most needed.

Privacy and data protection introduce another important limitation. RAIDT can specify what evidence would ideally be captured, but not all evidence should always be stored in full. Prompts may contain personal data, retrieved documents may be confidential, and tool traces may expose sensitive workflow information. The framework therefore has to distinguish between evidence existence, evidence accessibility, and evidence retention. In some settings, strong traceability may require cryptographic hashes, metadata, controlled access logs, or redacted artefacts rather than unrestricted raw storage. This means the evidence pack is not a simple archive of everything. It is a governed record designed to support accountability without creating avoidable privacy or security risks.

Metric overreach is a further concern. RAIDT produces scores across five pillars, but those scores should not be misread as direct measures of truth, quality, morality, or legal compliance. They are structured indicators of governance conditions. A high Auditability score suggests that a run can be reconstructed and reviewed. A high Dependability score suggests that controls, checks, and workflow stability are stronger. Yet even a strong score profile does not eliminate uncertainty. Generative AI remains probabilistic, context-sensitive, and dependent on changing components such as models, retrieval pipelines, prompt templates, moderation layers, and human review practices. S11 therefore insists that scores are aids to governance judgement, not replacements for it.

This becomes even more important when systems drift. Component drift may arise because a model provider updates a model silently, retrieval corpora change, prompts are revised, tools are added, or reviewers adopt different checking habits. Since RAIDT governs runs rather than merely static systems, it is well placed to detect some forms of drift through evidence capture and comparison over time. But it cannot guarantee that all relevant changes are visible, especially in opaque vendor environments. This is one reason why empirical validation in Paper 09 matters. The project must test whether the evidence pack and score profile remain reliable, usable, and discriminating across real organisational cases.

S11 also frames future questions. Multimodal AI introduces inputs and outputs that are harder to interpret and archive, such as images, audio, video, and mixed-media prompts. Agentic AI creates longer action chains, delegated decisions, dynamic tool use, and more complex accountability boundaries. These developments do not invalidate RAIDT, but they do place pressure on its current design. The run remains a useful governance unit, yet the evidence model may need extension to capture chaining, delegation, autonomy thresholds, and intervention points. Likewise, sector playbooks may be required because acceptable evidence and governance thresholds differ across education, healthcare, finance, government, and professional services.

Overall, this star supports the RAIDT project by making its claims more precise and defensible. It shows that RAIDT should be understood as a practical governance framework for documenting and assessing the quality of control around particular GenAI uses. That is a substantial contribution, but not an unlimited one. By naming its boundaries, RAIDT becomes more credible for supervisors, more useful for organisations, and more adaptable for future research, policy alignment, and sector-specific implementation.

Key questions and answers

Q1. What is the main purpose of boundaries and limitations in RAIDT?

Answer:
The main purpose is to define the legitimate scope of RAIDT's claims. RAIDT is designed to assess the governance readiness of a specific GenAI run through evidence and scoring. It is not designed to prove that every output is correct, lawful, or ethically ideal. By stating this clearly, the framework avoids conceptual inflation and becomes more methodologically robust.

Practical example:
An organisation uses a large language model to draft internal policy summaries. RAIDT can capture the prompt, model version, retrieved sources, reviewer checks, and approval steps. It can show whether the run was governable and reviewable, but it cannot by itself prove that every policy summary is legally flawless.

Link to RAIDT:
This distinction protects the integrity of the run-level evidence pack and ensures that pillar scores are interpreted as governance indicators rather than absolute truth claims.

Q2. Why does RAIDT need an explicit boundary note?

Answer:
Responsible AI frameworks are often criticised for vagueness or overreach. An explicit boundary note helps RAIDT explain where it fits in relation to Responsible AI, AI risk management, and Information Systems governance. It makes the framework easier to defend in supervision, publication, and implementation settings.

Practical example:
During a workshop, a supervisor asks whether RAIDT can determine fairness across all user groups. The boundary note allows a clear answer: RAIDT can require evidence relevant to fairness assessment in a run, but broader fairness analysis may require additional datasets, impact assessment methods, and domain expertise.

Link to RAIDT:
This note supports the interpretive use of the evidence pack and prevents the five-pillar profile being oversold.

Q3. What problem does this star solve?

Answer:
It solves the problem of overclaiming. Without explicit boundaries, users may assume that a structured score profile is equivalent to proof of safety, correctness, or compliance. S11 prevents that misunderstanding by clarifying what RAIDT can evidence directly and what must remain a matter for further judgement or supplementary governance.

Practical example:
A procurement team sees a high RAIDT score and wants to treat it as automatic approval for enterprise-wide deployment. S11 clarifies that the score should instead inform a decision, alongside legal review, risk assessment, and sector-specific controls.

Link to RAIDT:
It safeguards the governance function of scoring and anchors governance interventions in evidence rather than misplaced certainty.

Q4. What is the difference between correctness and governance readiness?

Answer:
Correctness concerns whether an output is substantively right for the task. Governance readiness concerns whether the run was configured, documented, checked, and controlled in a way that supports accountability and appropriate use. These dimensions may overlap, but they are not identical.

Practical example:
A chatbot returns the right answer to a customer question by chance, but no reviewer checks were recorded and the source trace is missing. The answer may be correct, yet governance readiness is weak.

Link to RAIDT:
RAIDT captures this distinction through evidence fields and pillar scoring, especially Auditability, Dependability, and Traceability.

Q5. How does proportionality affect RAIDT?

Answer:
Proportionality determines how much evidence, review, and control are appropriate for a run. Higher-stakes tasks require stronger evidence capture, clearer accountability, and more robust checks. Lower-stakes tasks may justify lighter controls to avoid unnecessary governance burden.

Practical example:
An internal brainstorming prompt may only require prompt logging, model identification, and basic user acknowledgement. A healthcare discharge summary assistant may require source capture, reviewer sign-off, escalation rules, and restricted deployment conditions.

Link to RAIDT:
Proportionality shapes the design of the evidence pack, the interpretation of score thresholds, and the selection of governance interventions.

Q6. Why is privacy a limitation for run-level evidence capture?

Answer:
Because the evidence that improves accountability may also increase exposure of sensitive data. Effective governance therefore requires selective capture, redaction, access controls, and retention rules rather than indiscriminate logging.

Practical example:
A human resources assistant uses GenAI to draft employee communications. Full prompt capture might reveal personal or contractual information. The organisation may need to store redacted prompts, metadata, and access logs instead of raw text.

Link to RAIDT:
This shapes how the evidence pack is designed and ensures Traceability and Auditability are balanced against privacy and data protection duties.

Q7. What is metric overreach in the context of RAIDT?

Answer:
Metric overreach occurs when RAIDT scores are treated as if they measure more than they actually do. The five pillars indicate the quality of governance conditions around a run, not the total value or legitimacy of the AI system.

Practical example:
A team reports that a tool is 'safe' because it achieved a strong RAIDT profile. In reality, the profile shows that the use was well documented and controlled; safety still depends on the task, users, domain, and broader risk context.

Link to RAIDT:
S11 protects the meaning of RAIDT scoring and supports responsible communication of evidence-pack results.

Q8. How does component drift challenge RAIDT?

Answer:
Component drift means that a run environment can change over time even if the use case appears stable. Models, retrieval sources, tool permissions, moderation settings, and human review practices may all shift. This makes governance evidence time-sensitive.

Practical example:
A vendor updates the underlying model behind an API. Outputs change subtly, even though the organisation did not revise its prompt template. Prior evidence may no longer represent current behaviour.

Link to RAIDT:
Because RAIDT treats the run as the unit of governance, it can record time-stamped configurations and make drift more visible, but it cannot eliminate vendor opacity.

Q9. Why are multimodal and agentic systems future questions rather than simple extensions?

Answer:
They introduce new governance complexities. Multimodal systems create evidence types that are harder to store, inspect, and interpret. Agentic systems create longer chains of action, more dynamic decision points, and blurred responsibility boundaries. These features require extension of the current evidence model.

Practical example:
An agentic research assistant retrieves sources, writes a draft, queries a database, and sends a report for approval. Governance now concerns not only one prompt-output pair but a sequence of actions with checkpoints and intervention points.

Link to RAIDT:
These questions point to future development of the evidence pack, scoring logic, and governance interventions for more complex run structures.

Q10. What evidence would help justify claims made using RAIDT?

Answer:
Useful evidence includes run identifiers, timestamps, task purpose, user role, prompt text or prompt hash, model and tool configuration, retrieved context metadata, output artefacts, review actions, exception logs, and approval status. Evidence should be sufficient to reconstruct key governance decisions without breaching privacy unnecessarily.

Practical example:
For a procurement drafting run, the evidence pack might contain the template used, the policy corpus version, the model provider, review comments, and whether the output was accepted, revised, or rejected.

Link to RAIDT:
This is the core of the RAIDT evidence pack and underpins all five pillars.

Q11. How does this star help supervisors understand the project?

Answer:
It gives a concise answer to a common supervisory concern: what exactly is the contribution, and what is not being claimed? By making the limits explicit, the project becomes easier to position against Responsible AI, governance theory, and empirical validation literature.

Practical example:
In a supervision meeting, S11 can be used to explain that RAIDT is not claiming to solve alignment in the broad technical sense. It is claiming to make organisational GenAI uses more governable through run-level evidence.

Link to RAIDT:
This strengthens the conceptual framing for Paper 08, the testing logic for Paper 09, and the policy translation in Paper 10.

Q12. Why do future research questions strengthen rather than weaken RAIDT?

Answer:
A framework is stronger when it identifies the next questions needed for extension and validation. Future questions show that RAIDT is designed as a research programme, not a finished universal solution. This is appropriate for a PhD project that aims to build a defensible foundation and a pathway for empirical and policy development.

Practical example:
If early case studies show that evidence capture is feasible in text-based advisory tasks but difficult in multimodal settings, that result does not invalidate RAIDT. It identifies a clear research frontier.

Link to RAIDT:
This links directly to empirical validation, sector playbooks, policy alignment, and future refinements to scoring and evidence design.

Practical examples
Evidence needed / what to capture

Where relevant, this star implies the need to capture or reference:

Link to RAIDT project

This note connects to the wider RAIDT project in several explicit ways.

Citation ideas to support this note
Boundaries and limitations
Conclusion

RAIDT needs a boundary note because the framework becomes more defensible when it is explicit about what it does and does not claim. The core contribution is not that RAIDT proves generative AI outputs are always correct or compliant. Rather, it governs the run as the unit of analysis. That means one configured use of a GenAI system for a particular task, with its prompt, model and tool settings, retrieved context, output, and checks. From that, RAIDT produces an evidence pack and a five-pillar score profile. S11 explains that these outputs indicate governance readiness, not absolute truth or safety. It also shows why proportionality, privacy, metric overreach, and component drift matter. This is important for the PhD because it sharpens the methodological contribution in Paper 08, defines what must be tested in Paper 09, and prevents policy overclaiming in Paper 10. In short, S11 makes RAIDT more credible by specifying the framework's scope, limits, and future research path.

Slides
Slide 1 - why this star matters

Purpose:
Frame the need for explicit boundaries in the RAIDT project.

Key message:
RAIDT becomes stronger, not weaker, when it states clearly what it can and cannot claim.

Slide content:

  • Governance frameworks fail when they overclaim
  • GenAI governance needs scope discipline
  • S11 defines RAIDT's legitimate contribution
  • Boundaries support supervision, publication, and adoption

Speaker note:
Introduce S11 as the star that protects the credibility of the whole framework. Explain that supervisors, reviewers, and organisations need to know whether RAIDT is claiming to measure truth, safety, compliance, or something narrower. The answer is narrower and more practical: RAIDT governs specific uses through evidence at run level.

Visual idea:
Boundary diagram showing 'RAIDT covers' and 'RAIDT does not by itself cover'.

Link to RAIDT:
This slide positions RAIDT as a run-level evidence and scoring framework rather than a universal Responsible AI solution.

Citation support to mention if asked:
Responsible AI critique literature on principle vagueness and operationalisation gaps.

Slide 2 - what RAIDT governs

Purpose:
Define the run as the unit of governance.

Key message:
RAIDT governs one configured GenAI use in context, not an abstract model in isolation.

Slide content:

  • A run = task + time + context + configuration
  • Includes prompt, model, tools, context, output, checks
  • Output 1: run-level evidence pack
  • Output 2: five-pillar RAIDT profile

Speaker note:
Explain that the run is the practical unit where governance decisions occur. The framework captures the actual configuration and checking conditions around a use. This is what makes RAIDT operational for organisational work.

Visual idea:
Process chain from prompt and configuration to output, checks, evidence pack, and score profile.

Link to RAIDT:
This is the core definition that connects S11 to every other star, especially the pillars and evidence-pack design.

Citation support to mention if asked:
Information Systems governance literature and AI assurance documentation concepts.

Slide 3 - correctness is not governance readiness

Purpose:
Separate substantive output quality from governance quality.

Key message:
A correct answer can be poorly governed, and a well-governed run can still produce an incorrect answer.

Slide content:

  • Correctness and governance readiness are different
  • RAIDT assesses governance conditions
  • Documentation does not equal truth
  • Missing evidence weakens accountability

Speaker note:
Use a simple example such as a chatbot producing the right answer by chance despite poor logging and no review trail. Then contrast that with a documented and reviewed run that still produces a mistake. This is the central conceptual boundary of S11.

Visual idea:
Two-by-two matrix: correct/incorrect versus well-governed/poorly governed.

Link to RAIDT:
This slide protects interpretation of the five pillars, especially Auditability, Dependability, and Traceability.

Citation support to mention if asked:
AI uncertainty, assurance case, and human oversight literature.

Slide 4 - proportionality and privacy

Purpose:
Show why evidence capture must be risk-sensitive and privacy-aware.

Key message:
RAIDT requires proportionate evidence capture, not indiscriminate logging.

Slide content:

  • High-stakes runs need stronger controls
  • Low-stakes runs may justify lighter evidence
  • Raw capture may create privacy risks
  • Redaction, hashing, and access control may be necessary

Speaker note:
Explain that governance must match consequences. A brainstorming tool and a healthcare tool cannot be treated identically. Also stress that better traceability is not the same as storing everything forever. Governance includes disciplined retention and access design.

Visual idea:
Risk ladder or proportionality scale paired with a privacy-control overlay.

Link to RAIDT:
Connects directly to evidence-pack design, scoring thresholds, and governance interventions.

Citation support to mention if asked:
Data protection, proportionality, and accountability-by-design source categories.

Slide 5 - scores need careful interpretation

Purpose:
Prevent metric overreach.

Key message:
RAIDT scores are governance indicators, not total measures of safety, legality, or value.

Slide content:

  • Five pillars summarise governance conditions
  • Scores inform judgement, not replace it
  • High scores do not remove uncertainty
  • Metrics should trigger questions and actions

Speaker note:
Clarify that scoring is useful precisely because it structures judgement, comparison, and intervention. But if a strong profile is treated as proof of safety or compliance, the framework is being misused. This is one of the most important supervisory and policy messages.

Visual idea:
Five-pillar score card with warning label: 'indicator, not proof'.

Link to RAIDT:
This slide interprets the RAIDT profile responsibly and links scoring to governance decision-making.

Citation support to mention if asked:
Measurement critique, governance metrics, and assurance framework literature.

Slide 6 - drift and feasibility challenges

Purpose:
Explain why run-level evidence must be time-sensitive and operationally realistic.

Key message:
Model changes, retrieval updates, and workflow shifts mean that governance evidence can degrade over time.

Slide content:

  • Component drift changes run conditions
  • Vendor opacity limits visibility
  • Evidence capture must remain feasible in practice
  • Re-scoring and review cycles may be needed

Speaker note:
Discuss how apparently stable systems can change because a provider updates a model, a retrieval corpus changes, or human review behaviour shifts. RAIDT can expose some of this through logging and time stamps, but not all of it. That is why empirical validation matters.

Visual idea:
Timeline showing repeated runs with changing components and changing evidence quality.

Link to RAIDT:
Supports Paper 09 testing logic and the practical design of review and escalation mechanisms.

Citation support to mention if asked:
MLOps drift, AI lifecycle governance, and audit trail literature.

Slide 7 - future questions and research extensions

Purpose:
Show how S11 opens a forward-looking research agenda.

Key message:
Multimodal and agentic AI extend RAIDT's relevance but also require evidence-model development.

Slide content:

  • Multimodal runs complicate capture and interpretation
  • Agentic runs create longer action chains
  • Sector playbooks will likely be needed
  • Future work should extend and test RAIDT carefully

Speaker note:
Explain that future questions are not weaknesses in the project design. They are evidence that RAIDT is a research programme with a structured extension path. The run remains useful, but the evidence model will need elaboration for more complex systems.

Visual idea:
Extension map from text-based runs to multimodal and agentic variants.

Link to RAIDT:
Connects to future scoring refinements, evidence-pack expansion, and sector-specific governance pathways.

Citation support to mention if asked:
Agentic AI governance, multimodal assurance, and sector-specific AI governance guidance.

Slide 8 - why s11 matters across the project

Purpose:
Close by linking this star to the three-paper architecture and implementation pathway.

Key message:
S11 makes RAIDT more defensible in theory, more testable in practice, and more credible in policy discussion.

Slide content:

  • Paper 08: defines conceptual scope
  • Paper 09: defines what must be validated
  • Paper 10: prevents policy overclaiming
  • Sector playbooks translate scope into practice

Speaker note:
Finish by showing that S11 is not a peripheral note. It underpins the whole project. It tells supervisors what the framework contributes, tells researchers what to validate, and tells practitioners and policymakers how to use RAIDT without misreading it.

Visual idea:
Three-paper architecture diagram with S11 as a boundary frame around all three.

Link to RAIDT:
Integrates RAIDT evidence packs, scoring, interventions, and policy alignment into one defensible project narrative.

Citation support to mention if asked:
Framework design, validation methodology, and policy translation source categories.

Powered by Forestry.md