S11.09 - Future_extension_multimodal_AI
S11.09 — Future extension: multimodal AI
flowchart LR
A[Text-centred governance assumptions
prompt-output records only] --> B[RAIDT
run-level evidence framework]
H[Practical multimodal fields
file hashes, transcripts, OCR, tool logs, review notes] --> C[[Future extension: multimodal AI
modality-aware evidence expansion]]
B --> C
C --> D[Evidence pack
multimodal artefacts and provenance]
C --> E[RAIDT score profile
pillar-based judgement]
D --> F[Reviewer reconstruction
contestability across media]
E --> G[Governance readiness
auditability and organisational learning]← Star S11 - Boundaries, Limitations and Future Questions
Star context: Marks an important future direction for RAIDT by showing that the framework should not be treated as text-only; instead, its run-level logic can be extended to image, audio, video, document, and tool-mediated evidence while still preserving clear governance boundaries.
Academic picture
Definition / background
Multimodal AI refers to systems that can process, generate, or combine more than one form of data or interaction, such as text, images, audio, video, documents, and tool-mediated actions. In practice, multimodal use is increasingly common in organisational work: users upload screenshots, dictate speech, analyse scanned PDFs, review charts, inspect photographs, or combine textual instructions with visual material in the same workflow.
Within RAIDT, this item names a future extension rather than a fully solved component. The conceptual origin is straightforward: if RAIDT treats the run as the unit of governance, then any shift in the nature of the run changes the nature of the evidence required. A text-only run can often be reconstructed from prompt, context, output, and review notes. A multimodal run may additionally require provenance of media files, intermediate transformations such as OCR or transcription, compression or editing history, and modality-specific review criteria.
This matters in GenAI governance because multimodal systems create new evidential ambiguity. A generated image may depend on uploaded reference images; an audio summary may depend on a speech-to-text layer; a document understanding pipeline may combine OCR, retrieval, and model inference before producing an answer. If these stages are not visible, governance claims become weaker even if the final output looks plausible.
The item belongs inside RAIDT because RAIDT's core logic remains applicable: the run still matters, the evidence pack still matters, and the five-pillar score profile still matters. What changes is the evidence schema and review method. This item therefore sits at the boundary between current RAIDT capability and future methodological development.
Why this concept matters
This concept matters because organisations are unlikely to remain within purely text-based GenAI use. As soon as work practices involve screenshots, scanned case files, recorded calls, diagrams, photographs, or mixed-media reporting, the governance challenge shifts. Without a multimodal extension, a framework may appear conceptually sound but operationally incomplete.
It also prevents a common confusion: the belief that multimodal AI is merely a stronger model capability rather than a governance complication. In governance terms, multimodality is not only about what the system can do. It is about what must be captured, interpreted, reviewed, and challenged if a run is later questioned.
If this item is ignored, organisations may continue using RAIDT-like language while silently dropping important evidence. That creates false assurance. A run may appear documented because the prompt and final answer are preserved, yet the decisive input could have been an uploaded image, a noisy transcript, a redacted document, or a tool action that is no longer reconstructable.
Key idea: Multimodal AI matters for RAIDT because the framework can scale beyond text only if run-level evidence is extended to capture modality-specific provenance, transformations, and review points.
What this item enables
- Extension of RAIDT from text-centred runs to runs involving images, audio, video, documents, and mixed-media inputs.
- Identification of additional evidence fields needed for multimodal reconstruction, such as file provenance, transcript versions, OCR outputs, and annotation history.
- Stronger review of where errors may arise within a multimodal pipeline rather than only in the final generated answer.
- More defensible evidence packs for tasks where meaning depends on visual, auditory, or document-based material.
- Better scoring discipline across the five pillars when multimodal risk is present.
- A clearer boundary between RAIDT's current baseline formulation and future implementation work needed for broader deployment.
- A pathway for sector playbooks to specify modality-sensitive controls without abandoning the run-level governance model.
Practical example / likely audience question
Audience question
Does RAIDT scale to multimodal systems, or is it really a framework designed only for text prompts and text outputs?
Answer
The concern behind this question is that multimodal systems appear to change the governance object so much that a text-oriented framework may no longer fit. The direct answer is that RAIDT does scale conceptually, because its central claim is not that prompts are textual, but that one configured use event should be governable at the level of the run. That logic survives the move to multimodal AI.
What changes is the evidence model. If a user uploads a photograph, a PDF, or an audio clip, the run cannot be governed adequately by recording only the final prompt and final answer. A reviewer may need to know what file was provided, what preprocessing occurred, whether OCR or transcription introduced distortions, which model components were involved, and how the human reviewer checked the result against the original media.
For example, suppose a public-sector worker uses a multimodal assistant to summarise a scanned application form and an attached voice note from a claimant. A generic AI governance approach might record that a model was approved and that staff were trained. RAIDT handles the issue better because it asks whether the specific run can be reconstructed: what documents were uploaded, what transcript was generated, what summary was produced, who checked it, and what evidence supports the score assigned to that run. In that sense, RAIDT is extensible, but only if evidence capture becomes modality-aware.
Practical example in RAIDT terms
Consider a public services setting in which a caseworker uses a multimodal GenAI assistant to process a benefits claim containing scanned identity documents, handwritten notes, and a short audio explanation recorded by the claimant. The use case is efficient and plausible, but the run-level issue is whether the system interpreted each medium accurately and whether the caseworker can justify the resulting summary or recommendation.
The evidence needed would include the uploaded file identifiers or hashes, document versions, OCR output, transcript output, prompt or instruction template, model and preprocessing components used, generated summary, human edits, final decision note, and the review record showing whether the original materials were checked when uncertainty arose. Responsibility is affected because the caseworker or reviewer must remain accountable for relying on the multimodal summary. Auditability is affected because another reviewer must be able to trace how the summary emerged from several media sources. Interpretability is affected because explanation must include not only prompt logic but also transformations such as transcription and OCR. Dependability is affected because transcription quality, image quality, and pipeline consistency influence reliability. Traceability is affected because the run must connect every media artefact, transformation step, and final decision.
This improves governance readiness because the organisation can evaluate a disputed case using actual evidence rather than speculation. It becomes possible to see whether the problem arose from poor image quality, a faulty transcript, ambiguous prompt instructions, weak human checking, or an inappropriate reliance on the generated summary.
Detailed link to RAIDT
Future extension: multimodal AI links to RAIDT in four ways.
First, it reinforces the RAIDT core idea that governance should attach to real organisational use rather than to abstract system claims, even when that use spans several media types.
Second, it extends the run-level perspective by showing that a run may include multiple inputs, transformations, and artefacts beyond text, all of which may need evidential capture.
Third, it enlarges the evidence pack and affects the score profile, because multimodal runs require additional fields, checks, and judgement criteria to support defensible scoring.
Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning by making it possible to inspect where multimodal failure, ambiguity, or misuse entered the process.
Future extension: multimodal AI → Modality-aware run-level evidence → Evidence pack → RAIDT score profile → Governance readiness
Link to the five RAIDT pillars
Responsibility
Multimodal AI raises Responsibility questions because users may trust generated interpretations of media that they have not fully reviewed themselves. RAIDT keeps responsibility with the organisational actor and therefore requires evidence about who uploaded, checked, approved, or relied upon multimodal outputs.
Example evidence / implication:
- Named reviewer for image-, audio-, or document-based outputs before downstream use.
- Record of escalation when media quality, ambiguity, or missing provenance made the run unsuitable for routine acceptance.
Auditability
This item has a strong effect on Auditability because multimodal runs are harder to reconstruct than text-only runs unless intermediate artefacts and preprocessing steps are preserved.
Example evidence / implication:
- Stored references to original files, transcript versions, OCR layers, and generated outputs.
- Review notes showing how a later auditor could follow the route from media input to final organisational action.
Interpretability
Interpretability becomes more demanding in multimodal settings because reviewers must understand not only what the model answered, but how different media were interpreted and combined.
Example evidence / implication:
- Explanation of which modality carried the key information for the run.
- Notes on whether OCR, transcription, captioning, or image analysis affected the meaning of the final output.
Dependability
Dependability is strongly affected because multimodal performance can vary with image quality, audio noise, document structure, and pipeline configuration. Reliable governance therefore depends on evidence of conditions of use.
Example evidence / implication:
- Quality checks on source media or confidence thresholds for preprocessing stages.
- Records of recurring failure modes such as misread handwriting, poor transcription, or incorrect visual interpretation.
Traceability
Traceability is central because multimodal governance depends on linking every relevant artefact and transformation back to the run. Without that chain, the evidence pack becomes incomplete.
Example evidence / implication:
- Timestamps, file identifiers, and component versions connecting source media to output and review actions.
- Clear mapping from original artefacts through intermediate transformations to the final decision or product.
This item affects all five pillars, but its strongest immediate effects are on Auditability, Dependability, and Traceability because multimodal runs become fragile very quickly when provenance and transformation history are missing.
Why this item is more than a generic concept
In general AI governance, multimodal AI often means a technical trend: systems that can accept or generate more than one type of content. In RAIDT, the meaning is more operational. It refers to a future extension of the evidence framework so that multimodal runs can still be reconstructed, reviewed, scored, and challenged.
That makes the RAIDT meaning narrower and more useful. The important question is not simply whether a model is multimodal. The important question is whether an organisation can produce run-level evidence for a multimodal use event in a way that supports an evidence pack, a five-pillar score profile, and governance readiness.
Common misunderstanding
Misunderstanding
If a multimodal system keeps the prompt and final answer, RAIDT already has enough evidence.
Correction
That is only true in very simple cases. In many multimodal runs, the decisive evidence lies outside the visible prompt-response pair. For example, an assistant may summarise a photographed whiteboard and an uploaded meeting recording. If the final text summary is contested, governance cannot rely only on the text output. A reviewer may need the original image, the audio file, the transcript version, any preprocessing steps, and the human checking record. RAIDT therefore requires more than a saved prompt when multimodal interpretation shapes the output.
Boundary and limitation
This item does not mean that RAIDT already solves multimodal governance in full. It does not by itself provide validated schemas for every medium, guarantee authenticity of images or recordings, detect deepfakes automatically, or remove legal and ethical issues around privacy, consent, copyright, or retention. It is a future extension, not a claim of complete coverage.
It may also be difficult to implement proportionately. Multimodal evidence can be large, sensitive, and technically messy. Organisations may struggle with storage, retention, redaction, and reviewer workload. Some artefacts may be unavailable because third-party tools do not expose intermediate steps.
RAIDT handles this limitation by being explicit about scope. The framework's conceptual backbone remains stable, but multimodal deployment requires modality-specific evidence design, sector-specific controls, and careful decisions about proportional capture. The value of this item is that it identifies the extension path without pretending that the problem is already closed.
Implementation levels
Manual implementation
A researcher or small team can apply this item manually by extending the run template for multimodal cases. This would include recording what media were used, how they were obtained, what transformations occurred, what the model produced, and how a human reviewer checked the result against the original artefacts.
Semi-automated implementation
Semi-automated implementation can attach structured metadata to uploads and processing steps. Templates or forms can prompt users to log file type, provenance, OCR or transcript outputs, reviewer checks, and reasons for accepting or rejecting multimodal outputs.
Fully automated implementation
At scale, a wrapper, orchestration layer, or governance platform can capture media identifiers, preprocessing logs, model versions, transformation chains, reviewer interventions, and storage references automatically. The system can then assemble a modality-aware evidence pack and feed scoring inputs into a RAIDT dashboard or governance pipeline.
Practical use in the RAIDT project
In Paper 08 Foundations, this item helps show that RAIDT is conceptually extensible without losing its central commitment to run-level evidence. It clarifies that the framework is not restricted in principle to text use cases, but that additional methodological work is required when evidence becomes multimodal.
In Paper 09 Empirical Validation, this item can guide future study design by identifying what should be tested once multimodal cases are included: feasibility of capture, reviewer burden, scoring consistency, and the effect of modality-specific fields on governance quality. It also helps explain negative findings if certain multimodal evidence cannot yet be captured reliably.
In Paper 10 Policy Pathways and in sector playbooks, this item provides a bridge from abstract future-facing claims to implementable governance design. It can support discussion of evidence-pack templates, scoring-rubric extensions, governance interventions, and influence methods for organisations adopting richer media workflows. For supervision, viva defence, and journal positioning, it is useful because it lets you answer a difficult question precisely: RAIDT is extensible to multimodal AI, but only through disciplined expansion of run-level evidence rather than by assuming that text-era governance records remain sufficient.
Key audience questions to prepare for
Q1. Is multimodal AI a separate governance problem or just a more advanced model capability?
It is both, but RAIDT focuses on the governance consequence. The capability matters because it changes the evidence needed to reconstruct and review a run.
Q2. Does this item imply that RAIDT already supports multimodal deployments fully?
No. It identifies a future extension path. The framework logic is stable, but the evidence schema, review criteria, and implementation tooling still need development for different media types.
Q3. Why is multimodal evidence harder than text evidence?
Because meaning may depend on source quality, preprocessing, intermediate transformations, and cross-modal interpretation. Those layers can fail silently if they are not recorded.
Q4. Would a generic audit trail solve the problem?
Not usually. A generic audit trail may capture access events or timestamps, but RAIDT needs governance-relevant evidence that connects media provenance, transformations, outputs, human checks, and scoring justification.
Q5. What is the main PhD-level contribution of including this item?
It shows that RAIDT is not overclaimed. The thesis can acknowledge a realistic extension path while clearly identifying the additional evidence design work required to govern future multimodal use.
Suggested citation concepts to support this item
- multimodal AI governance and accountability
- provenance and traceability for multimodal machine learning systems
- OCR and speech-to-text error propagation in decision support workflows
- auditability of image, audio, and document AI pipelines
- human oversight in multimodal generative AI systems
- sociotechnical evidence capture for AI-assisted organisational work
- governance of multimodal foundation models in public sector settings
- explainability and reviewability in cross-modal AI systems
- data lineage and artefact tracking in AI orchestration pipelines
Short explanation for presentation
Future extension: multimodal AI shows where RAIDT can develop next without overstating what it already solves. The framework is built around the run as the unit of governance, and that logic still holds when a run involves images, audio, scanned documents, video, or tool actions rather than text alone. The challenge is that multimodal runs need richer evidence. It is no longer enough to save only a prompt and output; governance may also require file provenance, OCR or transcript layers, preprocessing history, reviewer checks, and modality-specific quality controls. This item therefore matters because it protects the integrity of RAIDT's central claim. RAIDT can extend to multimodal use, but only by expanding run-level evidence, evidence packs, and scoring criteria in a disciplined and reviewable way.
One-line takeaway
Future extension: multimodal AI is RAIDT's pathway for governing image, audio, video, document, and mixed-media runs because multimodal use remains governable only when run-level evidence expands beyond text-only records.