S3.06 - Comparability

S3.06 ? Comparability

flowchart LR
    A[Fragmented documentation and output-only judgement] --> B[RAIDT - run-level evidence framework]
    A2[Uneven criteria across teams and tools] --> B
    H[Healthcare triage]
    F[Finance reporting]
    E[Education support]
    P[Enterprise productivity]
    G[Public-service review]
    B --> C[[Comparability - common evidential anchors for runs]]
    H --> C
    F --> C
    E --> C
    P --> C
    G --> C
    C --> D[Evidence pack]
    C --> I[RAIDT score profile]
    C --> J[Reviewer reconstruction]
    C --> K[Organisational learning]
    C --> L[Procurement and workflow choice]
    D --> M[Reviewability]
    I --> N[Governance readiness]
    J --> M
    K --> N
    C --> O[Evidence over assertion]
    C --> Q[Contestability and audit readiness]

? Star S3 - Run-Level Evidence Logic

Star context: Explains the proof-object logic of RAIDT by showing how separate runs can be judged against the same evidential anchors, so that reconstruction, comparison, challenge and learning are possible across contexts rather than within one isolated case.


Academic picture
Definition / background

Comparability in RAIDT means that separate generative AI runs can be assessed against sufficiently common evidential anchors to support meaningful comparison. A run is not treated as a free-standing anecdote. Instead, it is documented as a governable proof object with metadata, context, outputs, review traces and pillar-relevant evidence that allow reviewers to ask whether two runs were conducted under comparable conditions and whether any observed differences are governance-relevant.

Conceptually, comparability sits close to ideas such as standardisation, benchmarking, consistency and evaluability, but it is not identical to any of them. Standardisation seeks sameness of process; comparability seeks defensible grounds for judging difference and similarity. Benchmarking often focuses on performance outcomes; comparability in RAIDT includes governance quality, evidence quality, and the conditions under which outcomes were produced. Consistency concerns repeatability of practice; comparability asks whether runs can be fairly set side by side and interpreted through the same review logic.

This matters in generative AI governance because organisations rarely rely on one model, one task, or one team. They compare vendors, prompts, guardrails, workflows, reviewers, deployment settings and time periods. Without comparability, governance discussion collapses into assertion: one run is claimed to be better, safer, or more trustworthy, but the basis of that claim remains unstable. RAIDT makes comparability operational by linking it to run-level evidence, evidence packs and five-pillar score profiles.

Within RAIDT, comparability belongs in Run-Level Evidence Logic because a run can only become a governable unit if it can be examined in relation to other runs. The item therefore connects directly to reconstructability, replayability, audit trail quality and minimum metadata. A run-level evidence pack is not only for documenting what happened once; it is also for making runs reviewable against one another in a structured and defensible way.

Why this concept matters

Comparability solves a central governance problem: organisations need to know whether differences in quality, risk or readiness across runs are real, explainable and actionable. If each run is recorded differently, scored differently, or justified through local judgement alone, then cross-run learning becomes unreliable. Teams may overstate progress, understate risk, or choose configurations that appear effective only because the evidence basis is weak.

The concept also prevents a common confusion in AI governance, namely the assumption that policy principles automatically produce comparable operational evidence. Principles such as fairness, accountability or transparency may be shared across the organisation, but unless runs are documented through common evidential anchors, those principles cannot be compared in practice. RAIDT closes that gap by moving from broad governance aspiration to structured evidence that supports comparison.

For organisations using generative AI in operational work, comparability supports procurement decisions, internal assurance, incident analysis, policy refinement and continuous improvement. It allows reviewers to determine whether one workflow is more dependable than another, whether one team is consistently producing stronger evidence packs, or whether a change in tooling has improved traceability while weakening interpretability.

Key idea: Comparability matters because RAIDT turns separate GenAI runs into evidence-bearing objects that can be judged against shared governance anchors rather than isolated claims.

What this item enables
Practical example / likely audience question

Audience question

Why compare runs?

Answer

The concern behind this question is often that comparison may appear unnecessary if a single run already produced a useful output. In practice, however, governance decisions are rarely made on one run alone. Organisations need to know whether a result was unusually good, unusually weak, or part of a stable pattern across repeated uses and alternative configurations.

The direct answer is that comparison shows which configurations repeatedly produce stronger or weaker governance readiness. In RAIDT terms, this means comparing not only outputs but also the evidence conditions around the run: task framing, model choice, prompting approach, review steps, logging completeness, human intervention and downstream use context.

For example, two teams may use different large language models to draft policy summaries. Both outputs may look competent, but one team documents prompt history, reviewer changes, confidence concerns and source handling, while the other retains only the final text. A generic AI governance approach might call both uses acceptable because they both produced useful summaries. RAIDT handles the issue better because it can show that one run is substantially more comparable, auditable and defensible than the other, even before any later dispute arises.

Practical example in RAIDT terms

Consider an enterprise productivity setting in which two departments use generative AI to draft board briefing notes. Department A uses a managed model with a fixed template, mandatory metadata capture, reviewer sign-off and retained prompt-output history. Department B uses a different model informally, copies outputs into email, and records only the final briefing note.

The run-level issue is not simply which briefing note reads better. The governance question is whether the two runs can be compared on a common basis. RAIDT would require evidence about the task purpose, prompt structure, source inputs, model version, human edits, review decisions, output use context and any risk flags. With that evidence, the two runs can be compared across Responsibility, Auditability, Interpretability, Dependability and Traceability.

In this case, comparability improves governance readiness because Department A's run can be set alongside similar runs and judged through a stable evidential structure, whereas Department B's run remains difficult to compare and therefore difficult to defend. The practical result is that RAIDT helps the organisation see not only which output was acceptable, but which workflow is more governable over time.

Detailed link to RAIDT

Comparability links to RAIDT in four ways.

First, it supports RAIDT's core idea that governance should rest on evidence rather than broad assurance claims.
Second, it depends on the run being documented as the unit of governance, because comparison is only possible when runs carry structured evidence.
Third, it strengthens the evidence pack and score profile by making cross-run differences interpretable instead of impressionistic.
Fourth, it supports reviewability, contestability, audit readiness and organisational learning because reviewers can examine why one run scored differently from another.

Comparability ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

This chain matters because comparability is the bridge between documenting one run and learning systematically from many runs.

Link to the five RAIDT pillars

Responsibility

Comparability supports Responsibility by making it possible to judge whether similar tasks are being governed to similar standards across teams or units. It exposes uneven practice that may otherwise remain hidden.

Example evidence / implication:

Auditability

Comparability has a particularly strong effect on Auditability because auditors need stable grounds for comparing one run with another. If evidence structures vary too widely, formal review becomes weak.

Example evidence / implication:

Interpretability

Comparability supports Interpretability by helping reviewers explain why one run appears stronger or weaker. It prevents superficial conclusions based only on output quality.

Example evidence / implication:

Dependability

Comparability supports Dependability by showing whether governance readiness is stable across repeated or variant runs. It helps distinguish robust practice from one-off success.

Example evidence / implication:

Traceability

Comparability strongly supports Traceability because runs can only be compared well if their lineage, context and evidence chain are visible. Weak traceability weakens comparison.

Example evidence / implication:

Comparability affects all five pillars, but it is especially strong in relation to Auditability, Traceability and Dependability because these pillars rely most directly on stable cross-run evidential structure.

Why this item is more than a generic concept

In general AI governance, comparability may simply mean that systems, models or evaluations can be set beside one another for broad assessment. In RAIDT, comparability has a narrower and more operational meaning: runs are comparable when they contain enough shared evidential structure to support review, challenge, explanation and scoring at the point of governance.

The RAIDT meaning is more operational because it is tied to run-level evidence rather than abstract policy alignment alone. It is therefore not satisfied by saying that two systems both comply with the same principle. It requires concrete evidence showing how each run was configured, conducted, reviewed and documented so that differences can be interpreted responsibly.

Common misunderstanding

Misunderstanding

Comparability means every run must be identical or fully standardised.

Correction

Comparability does not require identical runs. It requires sufficient common anchors to make differences intelligible and reviewable. A healthcare triage-support run and an education feedback-support run will not be identical, but both can still be comparable in RAIDT if they retain core metadata, review traces, output context and pillar-relevant evidence.

A practical example is vendor comparison. Two vendors may use different model architectures and interfaces, so the runs are not the same. RAIDT still allows comparison if both runs preserve the evidence needed to judge how each performed as a governable run. The point is not sameness; it is defensible comparison.

Boundary and limitation

Comparability does not prove that a run is good, safe, lawful or ethically justified. It only ensures that runs can be placed into a common frame of review. A weak run can still be highly comparable if it is well documented, and a strong output can remain poorly comparable if the surrounding evidence is incomplete.

The concept also depends on minimum metadata, stable review logic and disciplined documentation practice. If organisations compare runs that differ radically in task purpose, risk exposure or evidence completeness without acknowledging those differences, comparability can become misleading. RAIDT handles this limitation by insisting that comparison should be anchored in run context and evidence quality, not just surface outcome similarity.

Implementation levels

Manual implementation

A researcher or small team can apply comparability manually by using a fixed RAIDT template for each run, recording the same metadata fields, capturing prompt-output history, and reviewing each run against a common pillar rubric. Manual comparison tables can then be used to compare runs across cases.

Semi-automated implementation

Semi-automated implementation can use forms, templates, structured review sheets and metadata validation rules to ensure that runs contain the same minimum evidence fields. Dashboards or spreadsheets can then compare score profiles and evidence completeness across runs.

Fully automated implementation

At scale, comparability can be implemented through wrappers, orchestration layers, logging pipelines and governance dashboards that automatically capture run metadata, model identifiers, reviewer interventions, prompt chains, timestamps and scoring rationales. A fully automated system can flag when runs are no longer comparable because required evidence is missing or inconsistent.

Practical use in the RAIDT project

In the RAIDT project, comparability is useful for explaining why run-level evidence matters in the first place. In Paper 08 Foundations, it helps justify the move from principle-based governance language to evidence-based comparison across runs. In Paper 09 Empirical Validation, it supports analysis of whether different workflows, sectors or tool configurations produce measurably different governance readiness patterns. In Paper 10 Policy Pathways, it helps show policymakers and organisational leaders why evidence structures must support cross-run learning rather than one-off compliance reporting.

The concept also strengthens sector playbooks, evidence-pack design and scoring-rubric refinement. It is particularly valuable in supervision meetings and viva defence because it answers the question of how RAIDT moves beyond documenting a single case toward building an evidence architecture for organisational learning, comparison and intervention.

Key audience questions to prepare for

Q1. Is comparability just benchmarking under a different name?

No. Benchmarking usually centres on outcome performance against a task or test set. Comparability in RAIDT includes outcome quality, but it also includes the evidential conditions under which the run occurred, making it a governance concept rather than a performance concept alone.

Q2. Can runs still be comparable if they use different models or prompts?

Yes, provided they retain enough shared evidential anchors. The purpose is not to erase difference but to document difference so that it can be interpreted responsibly.

Q3. Why is comparability important for audit readiness?

Audit readiness depends on being able to explain why one run was treated differently from another, or why one was scored higher. Without comparability, those explanations become ad hoc and difficult to defend.

Q4. Does comparability mean the highest-scoring run is always the best choice?

No. A higher-scoring run may still be unsuitable for strategic, legal or operational reasons. Comparability improves the quality of decision-making, but it does not remove the need for contextual judgement.

Q5. What happens if comparability is weak?

If comparability is weak, organisations cannot distinguish between real governance improvement and documentation noise. This weakens learning, assurance, procurement decisions and response to challenge.

Suggested citation concepts to support this item
Short explanation for presentation

Comparability in RAIDT means that different generative AI runs can be judged against the same evidential anchors rather than treated as isolated anecdotes. That matters because organisations do not govern one run once; they govern repeated uses, alternative configurations, different teams and changing tools over time. By making runs comparable, RAIDT allows reviewers to see why one workflow is more governance-ready than another, not just whether an output looked good on the day. This strengthens evidence packs, makes score profiles more interpretable, and supports audit readiness, contestability and organisational learning. In short, comparability is what allows RAIDT to move from documenting single cases to building a defensible governance picture across many real-world uses of generative AI.

One-line takeaway

Comparability is the capacity to judge GenAI runs against shared evidential anchors because RAIDT treats each run as a governable evidence object rather than an isolated output.

Related items in run-level evidence logic
Anchored questions
Mentioned in reference-paper summaries (5)

Paper summaries live in Port/93-References/pdf_summaries/. Each file listed below contains the key term at least once.

Powered by Forestry.md