S8.05 - Monitoring
S8.05 ? Monitoring
flowchart LR
A[One-off approval logic
Unobserved drift
Hidden recurring errors] --> B[RAIDT
Run-level evidence framework]
B --> C[Monitoring
Observe scores, evidence completeness, drift, and configuration change over time]
H[Public services
Healthcare
Finance
Dashboards
Logging systems] --> C
C --> D[Evidence pack
Trend visibility]
C --> E[RAIDT score profile
Movement across five pillars]
C --> I[Reviewer reconstruction
What changed, when, and why]
D --> F[Reviewability
Contestability]
E --> G[Governance readiness
Corrective action
Organisational learning]
I --> G? Star S8 - Implementation and Operations
Star context: Shows how RAIDT becomes part of operational governance after initial adoption, so that each governed run can be observed over time rather than treated as a one-off assessment.
Academic picture
Definition / background
Monitoring is the continuing observation of RAIDT outputs and supporting evidence across time, so that an organisation can see whether governed GenAI use remains stable, reviewable, and defensible after initial deployment. In this note, the term covers more than technical uptime or system health. It includes score movement across the five RAIDT pillars, evidence completeness, repeated reviewer concerns, changes in configuration, and signs that a once-acceptable run pattern is no longer acceptable in the current context.
Conceptually, monitoring emerges from the long-standing governance need to distinguish a point-in-time check from ongoing oversight. In conventional AI governance, monitoring is often associated with model drift, performance degradation, incident tracking, or compliance surveillance. In generative AI settings, however, the object of concern is broader: prompts, context windows, retrieval sources, user behaviour, escalation pathways, and human review practices can all change the practical risk profile even when the underlying model appears unchanged.
Monitoring belongs inside RAIDT because RAIDT treats the run as the unit of governance. A run is one configured use of a GenAI system for a specific task, at a specific time, in a specific context. Once runs are documented as evidence-bearing events, it becomes possible to compare them over time, identify repeated weaknesses, and ask whether current practice still matches what was originally approved or justified. Monitoring therefore extends RAIDT from run capture into temporal governance.
The relationship to run-level evidence is direct. Monitoring depends on evidence packs being sufficiently complete to support comparison across runs, and it adds value to the score profile by showing pattern, movement, and persistence rather than a single result. It is closely related to all five pillars, but it is especially important for Auditability, Dependability, and Traceability because those pillars are weakened quickly when changes are not observed or recorded.
Why this concept matters
Monitoring solves the problem of governance decay. An organisation may begin with a carefully documented use case, a sensible score profile, and a well-constructed evidence pack, yet still lose control as the operational environment changes. Without monitoring, the evidence base becomes stale, the score profile becomes historical rather than current, and governance claims increasingly rely on assumption instead of observation.
It also prevents a common confusion between deployment and oversight. A run that was acceptable last month is not automatically acceptable this month if the model version, task context, reviewer capacity, policy basis, or escalation practice has changed. Monitoring creates an explicit mechanism for noticing those changes and deciding whether they are benign, material, or in need of intervention.
For organisations using GenAI in real work, this matters because responsible governance is rarely undermined by a single dramatic failure alone. More often, risk accumulates through small unobserved shifts: evidence is collected less consistently, prompts are adjusted informally, reviewers stop recording exceptions, or a model update changes output style in ways that alter user dependence. Monitoring makes those shifts visible early enough for RAIDT to support correction rather than retrospective explanation.
Key idea: Monitoring matters because RAIDT is not only about assessing one run well, but about ensuring that governed runs remain evidentially visible, comparable, and reviewable over time.
What this item enables
- Ongoing observation of whether RAIDT scores are stable, improving, or deteriorating across comparable runs.
- Detection of evidence-pack gaps, such as missing reviewer forms, incomplete logs, or undocumented configuration changes.
- Identification of drift in prompts, source materials, model settings, or operational context.
- Recognition of recurring failure modes, such as repeated hallucination patterns, escalation omissions, or inconsistent human overrides.
- Triggering of post-run review, gating changes, or corrective action when thresholds are breached.
- Organisational learning by turning isolated run records into longitudinal governance insight.
- Stronger audit readiness because reviewers can reconstruct not only one run, but changes in practice over time.
Practical example / likely audience question
Audience question
If RAIDT already creates a run-level evidence pack and score profile, why is monitoring needed after deployment?
Answer
The concern behind this question is the assumption that good governance can be completed once and then left alone. That assumption may be reasonable for a static checklist, but it is weak for GenAI systems used in changing organisational environments. A run-level evidence pack captures what was true for a particular run. Monitoring asks whether later runs still look acceptable, comparable, and evidentially complete.
The direct answer is that RAIDT needs monitoring because governance quality can drift even when the formal workflow stays the same. A team may keep using the same named process while changing prompt wording, adding unofficial workarounds, relying on different source material, or reducing human review due to workload pressure. Monitoring is what reveals that operational reality has diverged from the originally justified pattern.
A practical example is a department that uses a GenAI assistant to draft internal policy summaries. The first set of runs is carefully documented and scores well. Three months later, the model provider has updated the system, staff have shortened the review step, and evidence logs are completed less consistently. A generic AI governance approach might still say the tool is "approved". RAIDT handles the issue better because monitoring compares runs over time, surfaces declining evidence quality and score movement, and provides a basis for review, contest, and intervention.
Practical example in RAIDT terms
Consider a local authority using a GenAI assistant to summarise housing-support case notes for frontline staff. Each run produces a draft summary for a specific caseworker task, using a defined prompt template, a particular model version, and the case materials available at that moment.
The run-level issue is not merely whether one summary is useful. The governance question is whether successive runs remain properly evidenced and whether the quality of governance is changing over time. Suppose the authority updates the prompt to speed up drafting, adds a new external knowledge source, and experiences pressure on staff time. The summaries remain superficially helpful, but reviewers begin recording fewer checks, exceptions are noted less consistently, and some outputs omit uncertainty cues that had previously supported safe use.
The evidence needed for monitoring includes timestamps, prompt and configuration versions, model or provider version identifiers, retrieval-source versions where relevant, reviewer sign-off records, exception logs, score profiles across the five pillars, and notes on corrective actions. The most affected RAIDT pillars are Responsibility, Auditability, Dependability, and Traceability, though Interpretability also matters if staff can no longer explain why outputs have changed.
Monitoring improves governance readiness here by showing that the organisation is not relying on a one-time approval. Instead, it can demonstrate an evidence trail of how run quality is observed, when changes are noticed, how they are escalated, and whether corrective steps restore acceptable practice.
Detailed link to RAIDT
Monitoring links to RAIDT in four ways.
First, it supports RAIDT's core idea that governance should be grounded in evidence rather than broad assurance claims.
Second, it extends the run as the unit of governance from a single documented event into a sequence of comparable events over time.
Third, it strengthens both the evidence pack and the score profile by making them usable for trend analysis, exception tracking, and review triggers.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning by showing how governance conditions evolve rather than assuming they remain fixed.
Monitoring ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness
In this sense, monitoring is the mechanism that keeps RAIDT operational after initial implementation. It converts documented runs into a living governance record from which supervisors, reviewers, managers, and auditors can see whether controls are still functioning as intended.
Link to the five RAIDT pillars
Monitoring has the strongest direct effect on Auditability, Dependability, and Traceability, but it materially supports all five pillars.
Responsibility
Monitoring supports Responsibility by showing whether accountable practice is sustained in use, not merely declared at the point of design or approval.
Example evidence / implication:
- Repeated reviewer sign-off patterns can show whether human oversight is genuinely occurring.
- Escalation and exception logs can reveal whether responsibility is being exercised or bypassed under operational pressure.
Auditability
Monitoring supports Auditability by preserving a time-based record of what changed, when it changed, and how those changes affected scores, evidence completeness, and review outcomes.
Example evidence / implication:
- Time-stamped score histories allow auditors to see whether governance quality is stable or deteriorating.
- Monitoring records help reconstruct why a run pattern was accepted, challenged, or paused.
Interpretability
Monitoring supports Interpretability when organisations need to explain changing output behaviour, changing reviewer confidence, or shifts in how prompts and context shape outcomes.
Example evidence / implication:
- Versioned prompt records can help explain why summaries become more assertive or less nuanced over time.
- Reviewer annotations can show whether outputs remain understandable enough for safe human judgement.
Dependability
Monitoring supports Dependability by revealing instability, recurring errors, or degradation in run quality that would be missed by a one-time assessment.
Example evidence / implication:
- Incident and exception trends can show whether the system is becoming less reliable in real work.
- Threshold alerts can trigger post-run review when repeated failure patterns emerge.
Traceability
Monitoring supports Traceability by maintaining continuity between individual runs, their configurations, their evidence, and the decisions made in response to observed issues.
Example evidence / implication:
- Configuration-change logs enable reviewers to trace score movement back to prompt, model, or workflow updates.
- Links between runs and corrective actions show how observed issues feed into governance intervention.
Why this item is more than a generic concept
In general AI governance, monitoring may mean broad post-deployment observation, dashboarding, incident collection, or technical performance checks. In RAIDT, it has a more specific and more operational meaning. Monitoring is tied to the run as the unit of governance, to the evidence pack as the container of reviewable material, and to the score profile as a structured expression of governance quality.
That makes the RAIDT meaning more practical than a generic appeal to "continuous monitoring". It is not merely a statement that oversight should continue. It is a method for observing whether governed runs remain evidentially complete, whether scores move in meaningful ways, whether recurring weaknesses appear, and whether those changes justify review, gating, or correction.
Common misunderstanding
Misunderstanding
Monitoring is just a technical logging function that belongs to platform engineering rather than governance.
Correction
Technical logs can be part of monitoring, but governance monitoring is broader. It includes evidential completeness, reviewer behaviour, exception handling, score movement, and changes in organisational context. For example, a platform log may confirm that the model endpoint was available, yet RAIDT monitoring may still reveal that reviewer forms were no longer completed, prompt changes were undocumented, and outputs had become harder for staff to interpret safely. In other words, system observability is useful, but it is not sufficient for governance observability.
Boundary and limitation
Monitoring does not by itself prove that a system is safe, compliant, fair, or well governed. It only provides the structured observation needed to notice change, compare runs, and trigger response. If an organisation monitors the wrong variables, monitors inconsistently, or fails to act on what it observes, monitoring becomes superficial.
It also does not replace deeper evaluation, human judgement, or corrective intervention. A deteriorating score trend still requires analysis and action. Likewise, stable monitoring results do not guarantee that all harms have been detected; some issues may remain latent or poorly instrumented.
RAIDT handles this limitation by positioning monitoring alongside gating, post-run review, corrective action, reviewer forms, and reproducibility practices. Monitoring therefore works best as part of an operational governance chain rather than as a stand-alone control.
Implementation levels
Manual implementation
A researcher, practitioner, or small team can monitor manually by maintaining a structured log of runs, recording score profiles, checking whether required evidence artefacts are present, and reviewing changes in prompts, model settings, or reviewer comments at regular intervals.
Semi-automated implementation
Semi-automated monitoring can use templates, metadata capture, form-based review, and simple dashboards to flag missing evidence, compare recent scores, highlight changed configurations, and queue runs for human review when thresholds or anomalies appear.
Fully automated implementation
At scale, monitoring can be implemented through wrappers, orchestration layers, governance dashboards, logging systems, and automated policy checks that capture run metadata, detect score drift, identify evidence gaps, compare configuration versions, and route concerning patterns into formal review or corrective-action workflows.
Practical use in the RAIDT project
Within the RAIDT project, monitoring is important for explaining how the framework moves beyond one-off conceptual scoring into operational governance. In Paper 08 Foundations, it helps articulate why a run-level model must include temporal oversight rather than static classification alone. In Paper 09 Empirical Validation, it can support analysis of whether different sectors or deployment modes show recurring governance patterns across repeated runs. In Paper 10 Policy Pathways, monitoring helps show policymakers that RAIDT can support ongoing assurance, not just initial assessment.
It is also useful in sector playbooks because organisations often ask what happens after implementation. Monitoring provides the answer: RAIDT supports an evidence-based routine for observing score movement, evidence completeness, review burden, and recurring risks. For the evidence pack and scoring rubric, it clarifies how artefacts become longitudinal rather than isolated. For supervisor explanation, viva defence, and journal positioning, it demonstrates that RAIDT addresses the practical governance question of continuity over time.
Key audience questions to prepare for
Q1. Is monitoring in RAIDT mainly about technical drift?
No. Technical drift may be one signal, but RAIDT monitoring also covers evidence completeness, reviewer behaviour, configuration changes, exception patterns, and score movement across the five pillars.
Q2. Why not just re-run the assessment occasionally instead of monitoring continuously?
Periodic reassessment is useful, but it can miss important changes between review points. Monitoring provides the continuity needed to notice emerging problems early and to decide when reassessment or intervention is necessary.
Q3. Does monitoring create too much administrative burden?
It can if designed poorly. RAIDT addresses this by allowing manual, semi-automated, and automated implementation levels, so the monitoring burden can match organisational maturity and risk.
Q4. What is the main governance benefit of monitoring?
The main benefit is that it turns governance from a point-in-time claim into an observable process. That makes organisational decisions more reviewable, contestable, and auditable.
Q5. How does monitoring support a viva or paper defence?
It shows that RAIDT is not only a scoring idea but an operational governance framework. You can explain how the framework detects change, supports intervention, and enables learning across repeated GenAI runs.
Suggested citation concepts to support this item
- continuous monitoring in AI governance
- post-deployment oversight for generative AI
- model drift and governance drift in AI systems
- audit trails and longitudinal evidence in AI assurance
- human oversight monitoring in socio-technical AI systems
- operational governance for large language model deployments
- MLOps monitoring versus governance monitoring
- accountability and traceability in AI operations
- organisational learning from AI incident and exception logs
- continuous assurance for AI in public sector decision support
Short explanation for presentation
Monitoring in RAIDT means observing governed GenAI use over time rather than treating governance as a one-off approval event. Because RAIDT treats the run as the unit of governance, monitoring can compare successive runs to see whether scores remain stable, whether evidence packs stay complete, and whether prompts, configurations, reviewer behaviour, or operational context have changed. That matters because many governance failures emerge gradually through drift, undocumented workarounds, or declining review quality rather than through a single obvious incident. In RAIDT, monitoring therefore links run-level evidence to audit readiness, contestability, and continuous improvement. It helps organisations show not only that a run was once acceptable, but that governance remains visible and defensible as real-world use evolves.
One-line takeaway
Monitoring is the ongoing observation of governed GenAI runs because RAIDT needs evidence of how practice changes over time, not only evidence from a single approved moment.
Related items in implementation and operations
Mentioned in reference-paper summaries (5)
Paper summaries live in Port/93-References/pdf_summaries/. Each file listed below contains the key term at least once.
REF-012__Ashmore-2021.mdREF-020__Bommasani-2021.mdREF-021__Braga-2025.mdREF-022__Breck-2017.mdREF-024__Charness-2009.md