S5.07 - Low_score

S5.07 ? Low score

flowchart LR
    A[Background problem:
low scores often misread as false output
or total system failure] --> B[RAIDT
run-level evidence framework]
    H[Practical evidence fields:
prompt, source links, timestamps,
review notes, sign-off, repeat runs] --> C[[Low score
evidence or pillar intent insufficient
in a specific run]]
    B --> C
    C --> D[Evidence pack
reveals gaps and weaknesses]
    C --> E[RAIDT score profile
shows affected pillars]
    D --> F[Reviewer reconstruction
and contestability]
    E --> G[Governance readiness
and improvement action]
    G --> I[Organisational learning
and policy alignment]

? Star S5 - RAIDT Pillars and Scoring

Star context: Defines the five governance dimensions and shows how scoring makes governance readiness measurable, evidence-based, and comparable without erasing important trade-offs between pillars.

Academic picture

Definition / background

In RAIDT, a low score means that a run is weakly evidenced, weakly controlled, or materially falls short of the intent of one of the five pillars: Responsibility, Auditability, Interpretability, Dependability, or Traceability. It is therefore a judgement about governance quality at the level of a specific run, not a blanket statement about the truth or falsity of the generated text. A run may produce text that appears plausible or even factually correct, yet still receive a low score because the organisation cannot show who reviewed it, what inputs shaped it, which version of the tool was used, or whether the process behaved reliably.

Conceptually, low score belongs to the scoring layer of RAIDT rather than to model evaluation alone. It translates evidential weakness or pillar failure into a practical governance signal that can be interpreted, compared, and acted on. In this sense, it differs from a generic low rating, poor benchmark result, or subjective quality judgement. The score is not simply saying that the output looked weak; it is saying that the run does not presently support sufficient confidence, reconstruction, accountability, or review against RAIDT's evidential logic.

This matters because RAIDT treats the run as the unit of governance. The run-level evidence pack provides the material from which the score is justified, and the score profile shows where governance strength or weakness sits across the five pillars. A low score therefore links evidence to action: it helps an organisation identify whether the problem is unclear accountability, poor documentation, insufficient explanation, unstable performance, or broken provenance.

Within the RAIDT framework, low score also preserves an important distinction between governance readiness and output correctness. A run can be operationally risky even when the text looks acceptable, and a run can be governance-poor even when no immediate harm is visible. By making that distinction explicit, RAIDT avoids the mistake of treating apparently good outputs as automatically well governed.

Why this concept matters

The concept of low score solves a practical governance problem. Organisations using generative AI often recognise that something about a run feels weak or incomplete, but they lack a disciplined way to express that weakness. Without a score grounded in evidence and pillar intent, concerns remain vague: reviewers may say that a run seems risky, poorly documented, or hard to explain, yet have no shared structure for recording why.

Low score also helps avoid a major confusion in AI governance: the assumption that governance quality can be inferred directly from output quality. In practice, many governance failures concern missing evidence, absent review, weak hand-offs, or unstable behaviour across repeat runs. A low score marks these weaknesses clearly and allows them to be discussed before they mature into incidents, disputes, or failed audit responses.

For organisations, this matters because responsible GenAI use requires more than principles and policies. It requires a way to flag runs that are not yet governance-ready. RAIDT uses low scores to convert concern into a traceable improvement signal, making it possible to prioritise remediation, refine controls, and strengthen evidence capture over time.

Key idea: A low RAIDT score matters because it identifies where a specific GenAI run lacks sufficient evidence or pillar alignment for confident governance review.

What this item measures

Whether a run is sufficiently evidenced to justify confidence in governance review.
Whether the run meets the practical intent of the relevant RAIDT pillar or pillars.
Whether missing documentation, absent provenance, weak review, or unstable behaviour materially reduce governance readiness.
Whether a reviewer could reconstruct and evaluate the run after the event.
Whether the evidence pack supports a defensible score profile rather than an impressionistic judgement.
Whether the organisation should treat the run as a signal for remediation, escalation, or process improvement.

Practical example / likely audience question

Audience question

How should low scores be used?

Answer

The concern behind this question is that a low score may be treated either too harshly or too casually. Some people assume it means the output is simply wrong and should be discarded immediately; others assume it is just a bureaucratic label with little practical value. Neither interpretation is adequate. In RAIDT, a low score should be used as an improvement signal that points to evidential or governance weakness in a specific run.

The direct answer is that low scores should guide review, remediation, and prioritisation. They help teams identify missing logs, weak review, absent provenance, unstable behaviour, or unclear accountability before those weaknesses are normalised. For example, a team may find that a generated draft report is factually acceptable, but the run still scores low on Traceability because source material was not linked, and low on Auditability because the prompt and approval trail were not preserved. The correct response is not to treat the score as meaningless, nor to collapse it into a crude pass/fail judgement, but to improve the evidential basis of the process.

RAIDT handles this better than a generic AI governance approach because it ties the low score to one run, one evidence pack, and one multi-pillar profile. That makes the signal operational. Instead of saying only that the system is concerning in the abstract, RAIDT shows what was weak in this run, why that matters, and what should be strengthened.

Practical example in RAIDT terms

Consider a public-service setting in which a caseworker uses a GenAI drafting tool to produce a first draft of a housing eligibility letter. The use case is administratively helpful, but the run-level issue is that the caseworker pastes source notes into the tool, edits the result, and sends the final letter onward without retaining the exact prompt, the version of the tool, the review notes, or a record of which evidence from the case file was relied upon.

The letter may look coherent and may even reach a substantively reasonable outcome, but the evidence pack is weak. The organisation cannot easily reconstruct what the tool produced, why the wording took a certain form, whether important facts were omitted, or who checked the final text against policy. In RAIDT terms, Responsibility is weakened because reviewer accountability is unclear, Auditability is weakened because the run cannot be reconstructed well, Interpretability is weakened because the relation between input materials and output wording is thinly documented, and Traceability is weakened because provenance is incomplete. Dependability may also be affected if repeated drafting of similar cases produces inconsistent quality.

Here, a low score improves governance readiness because it does not wait for an obvious failure. It identifies that the run is not yet well governed, directs attention to the missing evidence, and supports a practical intervention such as mandatory prompt capture, reviewer sign-off, and source linking before the tool is used more widely.

Detailed link to RAIDT

Low score links to RAIDT in four ways.

First, it operationalises RAIDT's core idea that governance claims should be grounded in evidence from actual use rather than broad assurance statements alone.

Second, it depends on the run as the unit of governance, because the score is attached to one specific configured use of GenAI in one context.

Third, it translates the quality of the evidence pack into a pillar-based score profile that shows where governance strength and weakness sit.

Fourth, it supports reviewability, contestability, audit readiness, and organisational learning by showing where a run requires explanation, remediation, escalation, or redesign.

Low score ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

A low score therefore does not sit outside the framework as an after-the-fact judgement. It is one of the ways RAIDT converts run-level evidence into practical governance action.

Link to the five RAIDT pillars

Responsibility

A low score on Responsibility indicates that ownership, human oversight, approval, or escalation duties were unclear or insufficiently evidenced in the run.