S3.06 - Comparability

S3.06 ? Comparability

flowchart LR
    A[Fragmented documentation and output-only judgement] --> B[RAIDT - run-level evidence framework]
    A2[Uneven criteria across teams and tools] --> B
    H[Healthcare triage]
    F[Finance reporting]
    E[Education support]
    P[Enterprise productivity]
    G[Public-service review]
    B --> C[[Comparability - common evidential anchors for runs]]
    H --> C
    F --> C
    E --> C
    P --> C
    G --> C
    C --> D[Evidence pack]
    C --> I[RAIDT score profile]
    C --> J[Reviewer reconstruction]
    C --> K[Organisational learning]
    C --> L[Procurement and workflow choice]
    D --> M[Reviewability]
    I --> N[Governance readiness]
    J --> M
    K --> N
    C --> O[Evidence over assertion]
    C --> Q[Contestability and audit readiness]

? Star S3 - Run-Level Evidence Logic

Star context: Explains the proof-object logic of RAIDT by showing how separate runs can be judged against the same evidential anchors, so that reconstruction, comparison, challenge and learning are possible across contexts rather than within one isolated case.

Academic picture

Definition / background

Comparability in RAIDT means that separate generative AI runs can be assessed against sufficiently common evidential anchors to support meaningful comparison. A run is not treated as a free-standing anecdote. Instead, it is documented as a governable proof object with metadata, context, outputs, review traces and pillar-relevant evidence that allow reviewers to ask whether two runs were conducted under comparable conditions and whether any observed differences are governance-relevant.

Conceptually, comparability sits close to ideas such as standardisation, benchmarking, consistency and evaluability, but it is not identical to any of them. Standardisation seeks sameness of process; comparability seeks defensible grounds for judging difference and similarity. Benchmarking often focuses on performance outcomes; comparability in RAIDT includes governance quality, evidence quality, and the conditions under which outcomes were produced. Consistency concerns repeatability of practice; comparability asks whether runs can be fairly set side by side and interpreted through the same review logic.

This matters in generative AI governance because organisations rarely rely on one model, one task, or one team. They compare vendors, prompts, guardrails, workflows, reviewers, deployment settings and time periods. Without comparability, governance discussion collapses into assertion: one run is claimed to be better, safer, or more trustworthy, but the basis of that claim remains unstable. RAIDT makes comparability operational by linking it to run-level evidence, evidence packs and five-pillar score profiles.

Within RAIDT, comparability belongs in Run-Level Evidence Logic because a run can only become a governable unit if it can be examined in relation to other runs. The item therefore connects directly to reconstructability, replayability, audit trail quality and minimum metadata. A run-level evidence pack is not only for documenting what happened once; it is also for making runs reviewable against one another in a structured and defensible way.

Why this concept matters

Comparability solves a central governance problem: organisations need to know whether differences in quality, risk or readiness across runs are real, explainable and actionable. If each run is recorded differently, scored differently, or justified through local judgement alone, then cross-run learning becomes unreliable. Teams may overstate progress, understate risk, or choose configurations that appear effective only because the evidence basis is weak.

The concept also prevents a common confusion in AI governance, namely the assumption that policy principles automatically produce comparable operational evidence. Principles such as fairness, accountability or transparency may be shared across the organisation, but unless runs are documented through common evidential anchors, those principles cannot be compared in practice. RAIDT closes that gap by moving from broad governance aspiration to structured evidence that supports comparison.

For organisations using generative AI in operational work, comparability supports procurement decisions, internal assurance, incident analysis, policy refinement and continuous improvement. It allows reviewers to determine whether one workflow is more dependable than another, whether one team is consistently producing stronger evidence packs, or whether a change in tooling has improved traceability while weakening interpretability.

Key idea: Comparability matters because RAIDT turns separate GenAI runs into evidence-bearing objects that can be judged against shared governance anchors rather than isolated claims.

What this item enables

Comparison of runs across teams, tools, prompts, vendors and time periods using shared evidential anchors.
Identification of why one run achieved stronger governance readiness than another.
More reliable interpretation of score-profile differences across the five RAIDT pillars.
Organisational learning from repeated uses of generative AI rather than one-off case descriptions.
Stronger challenge and contestability because reviewers can ask whether like is being compared with like.
Better procurement and configuration decisions because evidence quality can be compared, not merely output quality.
Governance escalation when a run appears strong on outcomes but weak on documentation, traceability or auditability.

Practical example / likely audience question

Audience question

Why compare runs?

Answer

The concern behind this question is often that comparison may appear unnecessary if a single run already produced a useful output. In practice, however, governance decisions are rarely made on one run alone. Organisations need to know whether a result was unusually good, unusually weak, or part of a stable pattern across repeated uses and alternative configurations.

The direct answer is that comparison shows which configurations repeatedly produce stronger or weaker governance readiness. In RAIDT terms, this means comparing not only outputs but also the evidence conditions around the run: task framing, model choice, prompting approach, review steps, logging completeness, human intervention and downstream use context.

For example, two teams may use different large language models to draft policy summaries. Both outputs may look competent, but one team documents prompt history, reviewer changes, confidence concerns and source handling, while the other retains only the final text. A generic AI governance approach might call both uses acceptable because they both produced useful summaries. RAIDT handles the issue better because it can show that one run is substantially more comparable, auditable and defensible than the other, even before any later dispute arises.

Practical example in RAIDT terms

Consider an enterprise productivity setting in which two departments use generative AI to draft board briefing notes. Department A uses a managed model with a fixed template, mandatory metadata capture, reviewer sign-off and retained prompt-output history. Department B uses a different model informally, copies outputs into email, and records only the final briefing note.

The run-level issue is not simply which briefing note reads better. The governance question is whether the two runs can be compared on a common basis. RAIDT would require evidence about the task purpose, prompt structure, source inputs, model version, human edits, review decisions, output use context and any risk flags. With that evidence, the two runs can be compared across Responsibility, Auditability, Interpretability, Dependability and Traceability.

In this case, comparability improves governance readiness because Department A's run can be set alongside similar runs and judged through a stable evidential structure, whereas Department B's run remains difficult to compare and therefore difficult to defend. The practical result is that RAIDT helps the organisation see not only which output was acceptable, but which workflow is more governable over time.

Detailed link to RAIDT

Comparability links to RAIDT in four ways.

First, it supports RAIDT's core idea that governance should rest on evidence rather than broad assurance claims.
Second, it depends on the run being documented as the unit of governance, because comparison is only possible when runs carry structured evidence.
Third, it strengthens the evidence pack and score profile by making cross-run differences interpretable instead of impressionistic.
Fourth, it supports reviewability, contestability, audit readiness and organisational learning because reviewers can examine why one run scored differently from another.

Comparability ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

This chain matters because comparability is the bridge between documenting one run and learning systematically from many runs.

Link to the five RAIDT pillars

Responsibility

Comparability supports Responsibility by making it possible to judge whether similar tasks are being governed to similar standards across teams or units. It exposes uneven practice that may otherwise remain hidden.