S11.09 - Future_extension_multimodal_AI

S11.09 — Future extension: multimodal AI

flowchart LR
    A[Text-centred governance assumptions
prompt-output records only] --> B[RAIDT
run-level evidence framework]
    H[Practical multimodal fields
file hashes, transcripts, OCR, tool logs, review notes] --> C[[Future extension: multimodal AI
modality-aware evidence expansion]]
    B --> C
    C --> D[Evidence pack
multimodal artefacts and provenance]
    C --> E[RAIDT score profile
pillar-based judgement]
    D --> F[Reviewer reconstruction
contestability across media]
    E --> G[Governance readiness
auditability and organisational learning]

← Star S11 - Boundaries, Limitations and Future Questions

Star context: Marks an important future direction for RAIDT by showing that the framework should not be treated as text-only; instead, its run-level logic can be extended to image, audio, video, document, and tool-mediated evidence while still preserving clear governance boundaries.

Academic picture

Definition / background

Multimodal AI refers to systems that can process, generate, or combine more than one form of data or interaction, such as text, images, audio, video, documents, and tool-mediated actions. In practice, multimodal use is increasingly common in organisational work: users upload screenshots, dictate speech, analyse scanned PDFs, review charts, inspect photographs, or combine textual instructions with visual material in the same workflow.

Within RAIDT, this item names a future extension rather than a fully solved component. The conceptual origin is straightforward: if RAIDT treats the run as the unit of governance, then any shift in the nature of the run changes the nature of the evidence required. A text-only run can often be reconstructed from prompt, context, output, and review notes. A multimodal run may additionally require provenance of media files, intermediate transformations such as OCR or transcription, compression or editing history, and modality-specific review criteria.

This matters in GenAI governance because multimodal systems create new evidential ambiguity. A generated image may depend on uploaded reference images; an audio summary may depend on a speech-to-text layer; a document understanding pipeline may combine OCR, retrieval, and model inference before producing an answer. If these stages are not visible, governance claims become weaker even if the final output looks plausible.

The item belongs inside RAIDT because RAIDT's core logic remains applicable: the run still matters, the evidence pack still matters, and the five-pillar score profile still matters. What changes is the evidence schema and review method. This item therefore sits at the boundary between current RAIDT capability and future methodological development.

Why this concept matters

This concept matters because organisations are unlikely to remain within purely text-based GenAI use. As soon as work practices involve screenshots, scanned case files, recorded calls, diagrams, photographs, or mixed-media reporting, the governance challenge shifts. Without a multimodal extension, a framework may appear conceptually sound but operationally incomplete.

It also prevents a common confusion: the belief that multimodal AI is merely a stronger model capability rather than a governance complication. In governance terms, multimodality is not only about what the system can do. It is about what must be captured, interpreted, reviewed, and challenged if a run is later questioned.

If this item is ignored, organisations may continue using RAIDT-like language while silently dropping important evidence. That creates false assurance. A run may appear documented because the prompt and final answer are preserved, yet the decisive input could have been an uploaded image, a noisy transcript, a redacted document, or a tool action that is no longer reconstructable.

Key idea: Multimodal AI matters for RAIDT because the framework can scale beyond text only if run-level evidence is extended to capture modality-specific provenance, transformations, and review points.

What this item enables

Extension of RAIDT from text-centred runs to runs involving images, audio, video, documents, and mixed-media inputs.
Identification of additional evidence fields needed for multimodal reconstruction, such as file provenance, transcript versions, OCR outputs, and annotation history.
Stronger review of where errors may arise within a multimodal pipeline rather than only in the final generated answer.
More defensible evidence packs for tasks where meaning depends on visual, auditory, or document-based material.
Better scoring discipline across the five pillars when multimodal risk is present.
A clearer boundary between RAIDT's current baseline formulation and future implementation work needed for broader deployment.
A pathway for sector playbooks to specify modality-sensitive controls without abandoning the run-level governance model.

Practical example / likely audience question

Audience question

Does RAIDT scale to multimodal systems, or is it really a framework designed only for text prompts and text outputs?

Answer

The concern behind this question is that multimodal systems appear to change the governance object so much that a text-oriented framework may no longer fit. The direct answer is that RAIDT does scale conceptually, because its central claim is not that prompts are textual, but that one configured use event should be governable at the level of the run. That logic survives the move to multimodal AI.

What changes is the evidence model. If a user uploads a photograph, a PDF, or an audio clip, the run cannot be governed adequately by recording only the final prompt and final answer. A reviewer may need to know what file was provided, what preprocessing occurred, whether OCR or transcription introduced distortions, which model components were involved, and how the human reviewer checked the result against the original media.

For example, suppose a public-sector worker uses a multimodal assistant to summarise a scanned application form and an attached voice note from a claimant. A generic AI governance approach might record that a model was approved and that staff were trained. RAIDT handles the issue better because it asks whether the specific run can be reconstructed: what documents were uploaded, what transcript was generated, what summary was produced, who checked it, and what evidence supports the score assigned to that run. In that sense, RAIDT is extensible, but only if evidence capture becomes modality-aware.

Practical example in RAIDT terms

Consider a public services setting in which a caseworker uses a multimodal GenAI assistant to process a benefits claim containing scanned identity documents, handwritten notes, and a short audio explanation recorded by the claimant. The use case is efficient and plausible, but the run-level issue is whether the system interpreted each medium accurately and whether the caseworker can justify the resulting summary or recommendation.

The evidence needed would include the uploaded file identifiers or hashes, document versions, OCR output, transcript output, prompt or instruction template, model and preprocessing components used, generated summary, human edits, final decision note, and the review record showing whether the original materials were checked when uncertainty arose. Responsibility is affected because the caseworker or reviewer must remain accountable for relying on the multimodal summary. Auditability is affected because another reviewer must be able to trace how the summary emerged from several media sources. Interpretability is affected because explanation must include not only prompt logic but also transformations such as transcription and OCR. Dependability is affected because transcription quality, image quality, and pipeline consistency influence reliability. Traceability is affected because the run must connect every media artefact, transformation step, and final decision.

This improves governance readiness because the organisation can evaluate a disputed case using actual evidence rather than speculation. It becomes possible to see whether the problem arose from poor image quality, a faulty transcript, ambiguous prompt instructions, weak human checking, or an inappropriate reliance on the generated summary.

Detailed link to RAIDT

Future extension: multimodal AI links to RAIDT in four ways.

First, it reinforces the RAIDT core idea that governance should attach to real organisational use rather than to abstract system claims, even when that use spans several media types.

Second, it extends the run-level perspective by showing that a run may include multiple inputs, transformations, and artefacts beyond text, all of which may need evidential capture.

Third, it enlarges the evidence pack and affects the score profile, because multimodal runs require additional fields, checks, and judgement criteria to support defensible scoring.

Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning by making it possible to inspect where multimodal failure, ambiguity, or misuse entered the process.

Future extension: multimodal AI → Modality-aware run-level evidence → Evidence pack → RAIDT score profile → Governance readiness

Link to the five RAIDT pillars

Responsibility

Multimodal AI raises Responsibility questions because users may trust generated interpretations of media that they have not fully reviewed themselves. RAIDT keeps responsibility with the organisational actor and therefore requires evidence about who uploaded, checked, approved, or relied upon multimodal outputs.