S5.13 - Evidence-based_scoring

S5.13 ? Evidence-based scoring

flowchart LR
    A[Abstract AI ratings and policy claims] --> B[RAIDT
run-level evidence framework]
    X[Output-only judgement without reconstruction] --> B
    F1[Healthcare admin] --> C
    F2[Finance documentation] --> C
    F3[Education feedback] --> C
    F4[Enterprise productivity] --> C
    B --> C[[Evidence-based scoring
score the evidence pack of one run]]
    C --> D[Run-level evidence pack]
    C --> E[Five-pillar score profile]
    C --> H[Governance move
evidence over assertion]
    D --> I[Reviewer reconstruction]
    D --> J[Contestability]
    E --> K[Governance readiness]
    E --> L[Improvement priorities]
    H --> M[Audit readiness]
    H --> N[Organisational learning]

? Star S5 - RAIDT Pillars and Scoring

Star context: Places scoring inside RAIDT as a governance activity grounded in run-level evidence, so readiness is judged through documented proof rather than broad claims about a model or organisation.

Academic picture

Definition / background

Evidence-based scoring means that RAIDT assigns scores to the documented evidence for a specific run, not to a model in general, a vendor claim, or the fluency of an output in isolation. A run is one configured use of a generative AI system for a defined task, at a particular time, in a particular organisational context. The score therefore reflects how well that run can be justified, examined, reconstructed, and governed.

This matters because many discussions of AI evaluation merge together technical performance, user satisfaction, policy compliance, and governance readiness. RAIDT separates these issues. A system may generate convincing output, yet still receive a weak RAIDT profile if the run lacks accountable ownership, decision trace, prompt and context capture, interpretive explanation, or dependable execution evidence. Conversely, a modest technical output may be much easier to govern if the evidence pack is complete, reviewable, and contestable.

Within RAIDT, evidence-based scoring belongs in the scoring layer that translates raw run documentation into a structured five-pillar profile across Responsibility, Auditability, Interpretability, Dependability, and Traceability. It is therefore the bridge between collected evidence and actionable governance judgement. Without this bridge, the evidence pack remains descriptive; with it, the organisation can make a defensible statement about readiness, weakness, and improvement priorities.

Why this concept matters

Evidence-based scoring solves a central governance problem in generative AI: organisations often claim that a system is compliant or responsible without being able to show what happened in a particular use episode. RAIDT avoids that problem by making the score depend on what can actually be evidenced for a run.

This prevents several common confusions. It avoids scoring the brand of the model instead of the real work setting. It avoids rewarding polished narrative documentation that contains little reconstructable proof. It also avoids treating governance as a one-off approval exercise divorced from changing prompts, data, operators, stakes, or deployment conditions.

If evidence-based scoring is missing, organisations risk overconfident deployment, weak audit response, poor contestability, and superficial assurance. If it is present, scoring becomes operational: reviewers can see why a run scored as it did, compare runs over time, and identify which pillar requires intervention.

Key idea: Evidence-based scoring matters because RAIDT judges governance readiness through run-level proof, not through abstract assurances about AI quality.

What this item measures

Whether the evidence pack for a specific run is sufficiently complete to support governance judgement.
Whether scoring is anchored to demonstrable artefacts such as prompts, outputs, roles, controls, logs, review notes, and decision context.
Whether the five RAIDT pillars can be assessed on the basis of evidence rather than assertion.
Whether another reviewer could inspect the same run and understand why a score was assigned.
Whether the resulting score profile can support comparison, challenge, escalation, and continuous improvement.

Practical example / likely audience question

Audience question

What exactly is being scored in RAIDT: the model, the output quality, or the governance record?

Answer

The concern behind this question is a common misconception that any AI score must be a judgement about the model as a whole. RAIDT takes a different position. The direct object of scoring is the run-level evidence pack and the governance readiness that the evidence substantiates for that one run.

For example, imagine two teams using the same large language model for policy drafting. Team A stores the task purpose, versioned prompt, retrieval inputs, reviewer identity, approval route, output revisions, and reasons for final acceptance. Team B keeps only the final text. Even if the outputs look similarly polished, RAIDT should score Team A much more strongly because its run can be examined, challenged, and reconstructed. The higher score is not praise for prose style; it is recognition that the run is governable.

A generic AI governance approach may stop at policy compliance statements or broad system risk categories. RAIDT handles the issue better because it ties the judgement to one evidenced run. That makes disagreement more productive: if someone disputes a score, they can contest the evidence, the anchor, or the interpretation, rather than arguing in the abstract.

Practical example in RAIDT terms

Consider a healthcare administration setting in which a generative AI assistant drafts patient appointment summary letters. One run concerns a letter generated for a patient with multiple follow-up actions and medication changes.

The run-level issue is not simply whether the language model produced readable text. The governance issue is whether the organisation can show who initiated the run, what prompt template was used, what patient data fields were supplied, what human checks were applied, what edits were made before sending, and how any uncertainty or clinical escalation was recorded.

The evidence needed would include the prompt and template version, metadata about source inputs, output snapshots, reviewer sign-off, exception notes, timing data, and any applicable process control. Responsibility is affected because ownership and approval must be clear. Auditability and Traceability are affected because the run must be reconstructable. Interpretability matters because reviewers must understand why the draft took the shape it did. Dependability matters because repeated use should not produce unmanaged variation. Evidence-based scoring improves governance readiness here by converting these documented elements into a transparent score profile for that specific run.

Detailed link to RAIDT

Evidence-based scoring links to RAIDT in four ways.

First, it expresses RAIDT's core idea that governance should be grounded in evidence from actual use, not only in principles or vendor assurances.
Second, it links directly to the run because the score is assigned to one configured use in one context, at one time.
Third, it converts the evidence pack into a five-pillar score profile that can be reviewed, compared, and improved.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning because the basis of the score can be inspected after the fact.

Evidence-based scoring ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

Link to the five RAIDT pillars

Responsibility

Evidence-based scoring strengthens Responsibility by requiring clear ownership, role definition, and accountable sign-off for the run being assessed. A score should not imply responsibility unless the evidence shows who initiated, reviewed, approved, or acted on the output.