S5.13 - Evidence-based_scoring

S5.13 ? Evidence-based scoring

flowchart LR
    A[Abstract AI ratings and policy claims] --> B[RAIDT
run-level evidence framework] X[Output-only judgement without reconstruction] --> B F1[Healthcare admin] --> C F2[Finance documentation] --> C F3[Education feedback] --> C F4[Enterprise productivity] --> C B --> C[[Evidence-based scoring
score the evidence pack of one run]] C --> D[Run-level evidence pack] C --> E[Five-pillar score profile] C --> H[Governance move
evidence over assertion] D --> I[Reviewer reconstruction] D --> J[Contestability] E --> K[Governance readiness] E --> L[Improvement priorities] H --> M[Audit readiness] H --> N[Organisational learning]

? Star S5 - RAIDT Pillars and Scoring

Star context: Places scoring inside RAIDT as a governance activity grounded in run-level evidence, so readiness is judged through documented proof rather than broad claims about a model or organisation.


Academic picture
Definition / background

Evidence-based scoring means that RAIDT assigns scores to the documented evidence for a specific run, not to a model in general, a vendor claim, or the fluency of an output in isolation. A run is one configured use of a generative AI system for a defined task, at a particular time, in a particular organisational context. The score therefore reflects how well that run can be justified, examined, reconstructed, and governed.

This matters because many discussions of AI evaluation merge together technical performance, user satisfaction, policy compliance, and governance readiness. RAIDT separates these issues. A system may generate convincing output, yet still receive a weak RAIDT profile if the run lacks accountable ownership, decision trace, prompt and context capture, interpretive explanation, or dependable execution evidence. Conversely, a modest technical output may be much easier to govern if the evidence pack is complete, reviewable, and contestable.

Within RAIDT, evidence-based scoring belongs in the scoring layer that translates raw run documentation into a structured five-pillar profile across Responsibility, Auditability, Interpretability, Dependability, and Traceability. It is therefore the bridge between collected evidence and actionable governance judgement. Without this bridge, the evidence pack remains descriptive; with it, the organisation can make a defensible statement about readiness, weakness, and improvement priorities.

Why this concept matters

Evidence-based scoring solves a central governance problem in generative AI: organisations often claim that a system is compliant or responsible without being able to show what happened in a particular use episode. RAIDT avoids that problem by making the score depend on what can actually be evidenced for a run.

This prevents several common confusions. It avoids scoring the brand of the model instead of the real work setting. It avoids rewarding polished narrative documentation that contains little reconstructable proof. It also avoids treating governance as a one-off approval exercise divorced from changing prompts, data, operators, stakes, or deployment conditions.

If evidence-based scoring is missing, organisations risk overconfident deployment, weak audit response, poor contestability, and superficial assurance. If it is present, scoring becomes operational: reviewers can see why a run scored as it did, compare runs over time, and identify which pillar requires intervention.

Key idea: Evidence-based scoring matters because RAIDT judges governance readiness through run-level proof, not through abstract assurances about AI quality.

What this item measures
Practical example / likely audience question

Audience question

What exactly is being scored in RAIDT: the model, the output quality, or the governance record?

Answer

The concern behind this question is a common misconception that any AI score must be a judgement about the model as a whole. RAIDT takes a different position. The direct object of scoring is the run-level evidence pack and the governance readiness that the evidence substantiates for that one run.

For example, imagine two teams using the same large language model for policy drafting. Team A stores the task purpose, versioned prompt, retrieval inputs, reviewer identity, approval route, output revisions, and reasons for final acceptance. Team B keeps only the final text. Even if the outputs look similarly polished, RAIDT should score Team A much more strongly because its run can be examined, challenged, and reconstructed. The higher score is not praise for prose style; it is recognition that the run is governable.

A generic AI governance approach may stop at policy compliance statements or broad system risk categories. RAIDT handles the issue better because it ties the judgement to one evidenced run. That makes disagreement more productive: if someone disputes a score, they can contest the evidence, the anchor, or the interpretation, rather than arguing in the abstract.

Practical example in RAIDT terms

Consider a healthcare administration setting in which a generative AI assistant drafts patient appointment summary letters. One run concerns a letter generated for a patient with multiple follow-up actions and medication changes.

The run-level issue is not simply whether the language model produced readable text. The governance issue is whether the organisation can show who initiated the run, what prompt template was used, what patient data fields were supplied, what human checks were applied, what edits were made before sending, and how any uncertainty or clinical escalation was recorded.

The evidence needed would include the prompt and template version, metadata about source inputs, output snapshots, reviewer sign-off, exception notes, timing data, and any applicable process control. Responsibility is affected because ownership and approval must be clear. Auditability and Traceability are affected because the run must be reconstructable. Interpretability matters because reviewers must understand why the draft took the shape it did. Dependability matters because repeated use should not produce unmanaged variation. Evidence-based scoring improves governance readiness here by converting these documented elements into a transparent score profile for that specific run.

Detailed link to RAIDT

Evidence-based scoring links to RAIDT in four ways.

First, it expresses RAIDT's core idea that governance should be grounded in evidence from actual use, not only in principles or vendor assurances.
Second, it links directly to the run because the score is assigned to one configured use in one context, at one time.
Third, it converts the evidence pack into a five-pillar score profile that can be reviewed, compared, and improved.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning because the basis of the score can be inspected after the fact.

Evidence-based scoring ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

Link to the five RAIDT pillars

Responsibility

Evidence-based scoring strengthens Responsibility by requiring clear ownership, role definition, and accountable sign-off for the run being assessed. A score should not imply responsibility unless the evidence shows who initiated, reviewed, approved, or acted on the output.

Example evidence / implication:

Auditability

This item has a particularly strong effect on Auditability because the score must be explainable through inspectable evidence. If the evidence pack is incomplete or poorly structured, auditability should score weakly even when the output appears useful.

Example evidence / implication:

Interpretability

Evidence-based scoring supports Interpretability by asking whether the run can be meaningfully understood by a reviewer. This does not require perfect model transparency; it requires enough contextual evidence to explain the generation process and the basis for trusting or limiting it.

Example evidence / implication:

Dependability

Dependability is affected because a score should reflect whether the run was carried out in a stable and controlled way. Evidence-based scoring therefore attends to consistency, process discipline, and exception handling.

Example evidence / implication:

Traceability

This item is also strongly tied to Traceability because the score depends on being able to follow the chain from task request to output, review, and final use. A run that cannot be traced cannot be strongly scored in RAIDT terms.

Example evidence / implication:

If the item affects some pillars more strongly than others, the strongest direct effects are on Auditability and Traceability, with Responsibility and Dependability close behind. Interpretability remains important because evidence must support intelligible review rather than mere record accumulation.

Why this item is more than a generic concept

In general AI governance, evidence-based scoring may mean using data, metrics, or benchmarks to justify an assessment. In RAIDT, it has a narrower and more operational meaning: the score must be grounded in run-level evidence that shows how a specific use of generative AI was configured, executed, reviewed, and governed.

That RAIDT meaning is more practical because it prevents category drift. It stops organisations from substituting benchmark scores, policy statements, or general impressions for a governable record of use. RAIDT therefore treats scoring as a disciplined interpretation of evidence, not as a broad reputational judgement about an AI system.

Common misunderstanding

Misunderstanding

If a model performs well technically, it should automatically score well in RAIDT.

Correction

Technical performance and RAIDT scoring are related but not identical. A highly capable model can still produce a weak RAIDT score if the run lacks evidence of review, justification, traceability, or accountable use. For instance, an enterprise team may generate excellent contract summaries, but if it cannot show which prompt was used, who checked the output, or how the summary entered the workflow, the governance score should remain limited. RAIDT scores governability of the run, not model prestige.

Boundary and limitation

Evidence-based scoring does not prove that a run was ethically perfect, legally compliant in every jurisdiction, or factually correct in every detail. It also does not replace domain evaluation, safety testing, or human judgement. A well-evidenced run may still reveal poor decisions; the value of the method is that those decisions become visible and challengeable.

The concept also depends on the quality of the evidence architecture. If logging is weak, reviewers are inconsistent, or the rubric is poorly calibrated, the score may be less reliable than intended. RAIDT handles this limitation by linking scoring to anchors, repeat runs, calibration, and trade-off analysis rather than pretending that one score is self-sufficient.

Implementation levels

Manual implementation

A researcher or small team can apply evidence-based scoring manually by collecting the prompt, inputs, outputs, decision notes, and reviewer comments for each run, then rating the evidence pack against a defined RAIDT rubric. This is feasible in pilots, case studies, and early-stage governance experiments.

Semi-automated implementation

Semi-automated implementation uses templates, metadata forms, evidence-pack checklists, and structured review workflows to reduce omission and improve consistency. An Obsidian note template, spreadsheet rubric, or lightweight governance dashboard can support repeatable scoring while retaining human judgement.

Fully automated implementation

At scale, a platform or orchestration layer can capture run metadata automatically, attach logs and artefacts to a case record, trigger pillar-specific checks, and generate a provisional score profile for human confirmation. In this form, evidence-based scoring becomes part of a governance pipeline that supports monitoring, exceptions handling, and audit preparation across many runs.

Practical use in the RAIDT project

Within the RAIDT project, this item helps Paper 08 Foundations explain why scoring must be attached to the run and its evidence rather than to abstract AI principles alone. It supports Paper 09 Empirical Validation by giving a defensible basis for comparing scored runs across settings and reviewers. It contributes to Paper 10 Policy Pathways by showing regulators and organisations how principles can be translated into inspectable governance practice.

It is also directly useful for the evidence pack and scoring rubric, because it clarifies what the score is actually about. For sector playbooks and governance interventions, it helps frame practical controls: improve the evidence architecture, and the quality of governance judgement improves with it. For supervision, viva defence, and journal positioning, it provides a precise answer to the question of what RAIDT measures and why that measurement is different from generic AI assurance language.

Key audience questions to prepare for

Q1. Why not score the model once and reuse that score everywhere?

Because RAIDT governs situated use, not abstract capability. The same model can be used under very different prompts, data conditions, stakes, and review controls. A reusable model score cannot substitute for a run-level governance judgement.

Q2. Does evidence-based scoring ignore output quality?

No. Output quality can appear within the evidence pack through reviewer assessment, correction history, exception notes, or domain checks. RAIDT's point is that quality should be evidenced and contextualised, not assumed from surface fluency.

Q3. What makes the score defensible to an auditor or supervisor?

The score is defensible when the evidence pack, scoring anchor, and rationale are available for inspection. A reviewer can then see what was observed, how it was interpreted, and where uncertainty remains.

Q4. Can evidence-based scoring work in low-resource teams?

Yes, although it may begin with lightweight templates and manual review rather than full automation. The minimum requirement is not expensive tooling; it is disciplined capture of the artefacts needed to justify a governance judgement.

Q5. How does this help organisational learning rather than simple compliance?

Because repeated scoring of evidenced runs reveals recurring weaknesses, such as poor trace capture or inconsistent review. That allows teams to target process improvement instead of merely filing governance paperwork.

Suggested citation concepts to support this item
Short explanation for presentation

Evidence-based scoring means RAIDT does not score a model in the abstract. It scores the governance quality of one run by examining the evidence pack for that run. That distinction is important because generative AI use is highly contextual: the same model can be used for low-risk drafting, high-stakes recommendations, or poorly controlled ad hoc tasks. RAIDT therefore asks what can actually be shown about the prompt, inputs, outputs, reviewers, controls, and decision path. The resulting five-pillar profile is a structured judgement about governability, not a vague statement of trust. This makes scoring more useful for supervision, audit, comparison across runs, and continuous improvement in real organisational settings.

One-line takeaway

Evidence-based scoring is the practice of assigning RAIDT scores to the documented evidence of a specific run because governance readiness must be demonstrated, not merely claimed.

Related items in RAIDT pillars and scoring
Anchored questions
Powered by Forestry.md