S8.05 - Monitoring

S8.05 ? Monitoring

flowchart LR
    A[One-off approval logic
Unobserved drift
Hidden recurring errors] --> B[RAIDT
Run-level evidence framework]
    B --> C[Monitoring
Observe scores, evidence completeness, drift, and configuration change over time]
    H[Public services
Healthcare
Finance
Dashboards
Logging systems] --> C
    C --> D[Evidence pack
Trend visibility]
    C --> E[RAIDT score profile
Movement across five pillars]
    C --> I[Reviewer reconstruction
What changed, when, and why]
    D --> F[Reviewability
Contestability]
    E --> G[Governance readiness
Corrective action
Organisational learning]
    I --> G

? Star S8 - Implementation and Operations

Star context: Shows how RAIDT becomes part of operational governance after initial adoption, so that each governed run can be observed over time rather than treated as a one-off assessment.

Academic picture

Definition / background

Monitoring is the continuing observation of RAIDT outputs and supporting evidence across time, so that an organisation can see whether governed GenAI use remains stable, reviewable, and defensible after initial deployment. In this note, the term covers more than technical uptime or system health. It includes score movement across the five RAIDT pillars, evidence completeness, repeated reviewer concerns, changes in configuration, and signs that a once-acceptable run pattern is no longer acceptable in the current context.

Conceptually, monitoring emerges from the long-standing governance need to distinguish a point-in-time check from ongoing oversight. In conventional AI governance, monitoring is often associated with model drift, performance degradation, incident tracking, or compliance surveillance. In generative AI settings, however, the object of concern is broader: prompts, context windows, retrieval sources, user behaviour, escalation pathways, and human review practices can all change the practical risk profile even when the underlying model appears unchanged.

Monitoring belongs inside RAIDT because RAIDT treats the run as the unit of governance. A run is one configured use of a GenAI system for a specific task, at a specific time, in a specific context. Once runs are documented as evidence-bearing events, it becomes possible to compare them over time, identify repeated weaknesses, and ask whether current practice still matches what was originally approved or justified. Monitoring therefore extends RAIDT from run capture into temporal governance.

The relationship to run-level evidence is direct. Monitoring depends on evidence packs being sufficiently complete to support comparison across runs, and it adds value to the score profile by showing pattern, movement, and persistence rather than a single result. It is closely related to all five pillars, but it is especially important for Auditability, Dependability, and Traceability because those pillars are weakened quickly when changes are not observed or recorded.

Why this concept matters

Monitoring solves the problem of governance decay. An organisation may begin with a carefully documented use case, a sensible score profile, and a well-constructed evidence pack, yet still lose control as the operational environment changes. Without monitoring, the evidence base becomes stale, the score profile becomes historical rather than current, and governance claims increasingly rely on assumption instead of observation.

It also prevents a common confusion between deployment and oversight. A run that was acceptable last month is not automatically acceptable this month if the model version, task context, reviewer capacity, policy basis, or escalation practice has changed. Monitoring creates an explicit mechanism for noticing those changes and deciding whether they are benign, material, or in need of intervention.

For organisations using GenAI in real work, this matters because responsible governance is rarely undermined by a single dramatic failure alone. More often, risk accumulates through small unobserved shifts: evidence is collected less consistently, prompts are adjusted informally, reviewers stop recording exceptions, or a model update changes output style in ways that alter user dependence. Monitoring makes those shifts visible early enough for RAIDT to support correction rather than retrospective explanation.

Key idea: Monitoring matters because RAIDT is not only about assessing one run well, but about ensuring that governed runs remain evidentially visible, comparable, and reviewable over time.

What this item enables

Ongoing observation of whether RAIDT scores are stable, improving, or deteriorating across comparable runs.
Detection of evidence-pack gaps, such as missing reviewer forms, incomplete logs, or undocumented configuration changes.
Identification of drift in prompts, source materials, model settings, or operational context.
Recognition of recurring failure modes, such as repeated hallucination patterns, escalation omissions, or inconsistent human overrides.
Triggering of post-run review, gating changes, or corrective action when thresholds are breached.
Organisational learning by turning isolated run records into longitudinal governance insight.
Stronger audit readiness because reviewers can reconstruct not only one run, but changes in practice over time.

Practical example / likely audience question

Audience question

If RAIDT already creates a run-level evidence pack and score profile, why is monitoring needed after deployment?

Answer

The concern behind this question is the assumption that good governance can be completed once and then left alone. That assumption may be reasonable for a static checklist, but it is weak for GenAI systems used in changing organisational environments. A run-level evidence pack captures what was true for a particular run. Monitoring asks whether later runs still look acceptable, comparable, and evidentially complete.

The direct answer is that RAIDT needs monitoring because governance quality can drift even when the formal workflow stays the same. A team may keep using the same named process while changing prompt wording, adding unofficial workarounds, relying on different source material, or reducing human review due to workload pressure. Monitoring is what reveals that operational reality has diverged from the originally justified pattern.

A practical example is a department that uses a GenAI assistant to draft internal policy summaries. The first set of runs is carefully documented and scores well. Three months later, the model provider has updated the system, staff have shortened the review step, and evidence logs are completed less consistently. A generic AI governance approach might still say the tool is "approved". RAIDT handles the issue better because monitoring compares runs over time, surfaces declining evidence quality and score movement, and provides a basis for review, contest, and intervention.

Practical example in RAIDT terms

Consider a local authority using a GenAI assistant to summarise housing-support case notes for frontline staff. Each run produces a draft summary for a specific caseworker task, using a defined prompt template, a particular model version, and the case materials available at that moment.

The run-level issue is not merely whether one summary is useful. The governance question is whether successive runs remain properly evidenced and whether the quality of governance is changing over time. Suppose the authority updates the prompt to speed up drafting, adds a new external knowledge source, and experiences pressure on staff time. The summaries remain superficially helpful, but reviewers begin recording fewer checks, exceptions are noted less consistently, and some outputs omit uncertainty cues that had previously supported safe use.

The evidence needed for monitoring includes timestamps, prompt and configuration versions, model or provider version identifiers, retrieval-source versions where relevant, reviewer sign-off records, exception logs, score profiles across the five pillars, and notes on corrective actions. The most affected RAIDT pillars are Responsibility, Auditability, Dependability, and Traceability, though Interpretability also matters if staff can no longer explain why outputs have changed.

Monitoring improves governance readiness here by showing that the organisation is not relying on a one-time approval. Instead, it can demonstrate an evidence trail of how run quality is observed, when changes are noticed, how they are escalated, and whether corrective steps restore acceptable practice.

Detailed link to RAIDT

Monitoring links to RAIDT in four ways.

First, it supports RAIDT's core idea that governance should be grounded in evidence rather than broad assurance claims.
Second, it extends the run as the unit of governance from a single documented event into a sequence of comparable events over time.
Third, it strengthens both the evidence pack and the score profile by making them usable for trend analysis, exception tracking, and review triggers.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning by showing how governance conditions evolve rather than assuming they remain fixed.

Monitoring ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

In this sense, monitoring is the mechanism that keeps RAIDT operational after initial implementation. It converts documented runs into a living governance record from which supervisors, reviewers, managers, and auditors can see whether controls are still functioning as intended.

Link to the five RAIDT pillars

Monitoring has the strongest direct effect on Auditability, Dependability, and Traceability, but it materially supports all five pillars.

Responsibility

Monitoring supports Responsibility by showing whether accountable practice is sustained in use, not merely declared at the point of design or approval.