S11.06 - Metric_overreach

S11.06 — Metric overreach

flowchart LR
    A[Background problem:
scores treated as proof] --> B[RAIDT:
run-level evidence framework]
    A2[Benchmark logic and dashboard simplification] --> B
    B --> C[[S11.06 Metric overreach]]
    C --> D[Run-level evidence pack]
    C --> E[Five-pillar score profile]
    C --> F[Governance move:
evidence over assertion]
    D --> G[Reviewer reconstruction]
    E --> G
    F --> H[Contestability and audit readiness]
    I[Healthcare, public services,
education, finance, enterprise use] --> C

← Star S11 - Boundaries, Limitations and Future Questions

Star context: This item sits in Star S11 because it marks a boundary on what RAIDT scores can legitimately claim. It prevents the framework from being misunderstood as a machine that converts complex governance judgement into a single definitive number.

Academic picture

Definition / background

Metric overreach occurs when a metric, score, or quantified governance output is treated as stronger evidence than it really is. In practice, this happens when numerical results are interpreted as if they provide final proof of safety, compliance, responsibility, or suitability for use, even though the underlying judgement depends on context, assumptions, scope, and evidence quality.

In generative AI governance, this problem is especially acute because organisations often want concise indicators for oversight, benchmarking, procurement, or assurance. A score can help structure comparison, prioritisation, and review, but it cannot by itself capture the full meaning of a run, the adequacy of the evidence, the seriousness of domain risk, or the appropriateness of a model's behaviour in context. Metric overreach therefore describes a category mistake: confusing an aid to governance with governance proof.

Within RAIDT, this concept matters because RAIDT intentionally produces a five-pillar score profile while also insisting that the run remains the unit of governance. The run-level evidence pack records what system was used, for what task, under what conditions, with what configuration, constraints, outputs, and review observations. The score profile is therefore an interpretive summary of evidence, not a substitute for that evidence. Metric overreach is the warning that protects RAIDT from being misused as a simplistic rating instrument.

This item also differs from general criticism of quantification. RAIDT is not anti-metric. Rather, it treats metrics as disciplined governance artefacts whose meaning depends on documented provenance, reviewer judgement, and contestable evidence. The concept belongs inside RAIDT because the framework's value depends on making scores useful without allowing them to become overclaimed.

Why this concept matters

Metric overreach matters because governance failure often begins not with the absence of a metric, but with excessive confidence in one. When organisations read a score as a verdict rather than a prompt for review, they risk approving weak systems, overlooking domain-specific harms, and presenting unjustified assurance to managers, regulators, or service users.

The concept prevents three common confusions. First, it separates governance readiness from technical performance alone. Second, it distinguishes a summary judgement from the underlying evidence that justifies it. Third, it reminds decision-makers that different runs may produce different governance implications even when they involve the same model or tool.

For organisations using GenAI, this matters operationally. Procurement teams may want a single threshold. Senior leaders may want a dashboard colour. Project teams may want quick comparability across pilots. RAIDT can support all of these needs, but only if users understand that scores guide review, comparison, and improvement rather than replace judgement. By naming metric overreach explicitly, RAIDT moves governance from principles and assertions toward evidence-backed interpretation.

Key idea: Metric overreach matters because RAIDT scores are governance signals, not self-sufficient proof, and their legitimacy depends on run-level evidence and human review.

What this item explains

It explains why a RAIDT score profile should never be treated as absolute proof of safety, compliance, or trustworthiness.
It explains the difference between a useful governance summary and an overclaimed governance conclusion.
It explains why run context, task sensitivity, evidence quality, and reviewer judgement remain necessary even when structured metrics exist.
It explains how evidence packs anchor metrics to documented runs, making over-interpretation easier to detect and challenge.
It explains why governance maturity requires contestable metrics rather than unchallengeable numbers.
It explains a core limitation boundary for RAIDT: the framework supports defensible review, not automatic certification.

Practical example / likely audience question

Audience question

Are RAIDT scores enough to show that a generative AI system is safe or compliant?

Answer

The short answer is no. The concern behind the question is understandable: if RAIDT produces a structured five-pillar score profile, it is tempting to treat that profile as the final answer. However, the score only summarises the state of the documented evidence for a particular run. It does not eliminate the need to inspect the run context, the task, the stakes, the prompts, the outputs, the reviewer observations, or the limitations of the evidence collected.

A practical example makes this clear. Two runs may receive similar Dependability scores, yet one may involve low-stakes internal drafting while the other supports a public-facing eligibility decision in a sensitive service context. The governance implications are not the same. RAIDT handles this issue better than a generic AI governance approach because it keeps the score tied to the run-level evidence pack. Reviewers can inspect what the number rests on, reconstruct why it was given, and contest it if the context suggests that the score is being over-read.

So the role of the score is important but limited. It guides review, comparison, prioritisation, and improvement. It does not replace judgement, sector-specific oversight, or accountability for decisions made around the system.

Practical example in RAIDT terms

Consider a public service department using a GenAI assistant to draft responses to citizens' housing support queries. A specific run involves a configured prompt set, a selected model, internal guidance documents, and an instruction to produce concise case-note summaries for staff review.

The run-level issue is that the team obtains a reasonably strong RAIDT score profile and begins to describe the system as effectively governance-assured. This is where metric overreach appears. The score profile may indicate a good level of Auditability and Traceability because logs, prompts, and outputs were captured. However, if the evidence pack also shows that the run was tested only on narrow examples, excluded edge cases, and lacked close scrutiny of fairness impacts for vulnerable applicants, the score cannot legitimately be treated as proof of readiness for broader operational use.

The evidence needed includes the run configuration, source materials, sample prompts, outputs, reviewer notes, known failure modes, intended user role, escalation arrangements, and explanation of scoring decisions. The most affected RAIDT pillars are Responsibility, Dependability, and Interpretability, although all five remain relevant. By naming metric overreach, RAIDT improves governance readiness because it stops the organisation from mistaking a structured scoring result for a complete deployment justification.

Detailed link to RAIDT

Metric overreach links to RAIDT in four ways.

First, it reinforces RAIDT's core idea that governance should be grounded in evidence from concrete runs rather than broad claims about models or tools in the abstract.
Second, it connects directly to the run because the risk of overreach arises when a score is detached from the specific task, timing, configuration, and context of that run.
Third, it clarifies the relationship between the evidence pack and the score profile: the evidence pack is the substantive record, while the score profile is a structured synthesis of that record.
Fourth, it strengthens reviewability, contestability, audit readiness, and organisational learning by ensuring that metrics remain open to inspection, challenge, and revision rather than being treated as self-justifying outputs.

Metric overreach → Run-level evidence → Evidence pack → RAIDT score profile → Governance readiness

In RAIDT, this chain works only if the later stages do not erase the earlier ones. Governance readiness improves when the score profile remains visibly anchored to the evidence pack and the run from which it was derived.

Link to the five RAIDT pillars

Responsibility

Metric overreach affects Responsibility because decision-makers may offload accountability onto a score instead of owning the judgement about whether a run is appropriate for use.