Evidence Architecture and Artefacts

flowchart LR
    A[Run-level uncertainty] --> B[RAIDT governance logic]
    B --> C[Star S4 evidence architecture]
    C --> D[Run-level evidence pack]
    C --> E[Five-pillar score profile]
    D --> F[Reviewer reconstruction]
    D --> G[Audit-ready artefacts]
    E --> H[Governance intervention]
    G --> I[Policy and standards alignment]
    J[Prompts models tools review] --> C

Circle 2 - Operational governance mechanism

Ring: Operational star

Function

Defines the evidence architecture that turns a single RAIDT run into an inspectable, reviewable, and governable record. This star specifies which fields, identifiers, hashes, traces, review artefacts, and access controls must be captured so that a run-level evidence pack can support scoring across Responsibility, Auditability, Interpretability, Dependability, and Traceability.

Role in the project

This note sits at the operational centre of the RAIDT project. It translates RAIDT from a conceptual governance framework into a concrete evidence model that can be implemented, tested, audited, and discussed with supervisors and organisational stakeholders. In project terms, Star S4 belongs primarily to evidence architecture and implementation, but it also supports foundations, empirical validation, policy alignment, and sector application because the quality of RAIDT depends on whether a run can be evidenced consistently across settings.

Main questions answered by this star
Workshop discussion prompts
Items in this star (17)
Main message

RAIDT treats the run, rather than the model in the abstract, as the unit of governance. That shift matters because organisational risk rarely arises from a model in isolation. It arises when a particular person, using a particular prompt, on a particular system configuration, at a particular time, generates an output that affects work. Star S4 addresses the practical consequence of that claim: if the run is the unit of governance, the project needs an evidence architecture that can capture what actually happened in that run in a structured and inspectable way.

In this note, evidence architecture means the design of the fields, identifiers, traces, and linked artefacts that together make a run record intelligible. Artefacts are the concrete objects that populate that architecture: prompt versions, hashes, retrieval traces, model identifiers, reviewer notes, access controls, and related metadata. The aim is not to create an exhaustive technical log of everything a system could emit. The aim is to create a governance-ready record that is proportionate, decision-relevant, and usable by supervisors, auditors, operators, and policy-facing audiences.

The problem this star solves is straightforward but serious. Generative AI systems are dynamic, configurable, and often partially opaque. The same apparent task can produce different outputs because the prompt changed, the provider updated the model, the retrieval index was refreshed, the decoding parameters varied, a PEFT or LoRA adapter was switched, or an alignment policy altered refusal behaviour. If none of that is captured, organisations are left with a weak form of accountability. They may know that an answer was produced, but they cannot explain which configuration produced it, whether the run followed policy, whether a reviewer intervened, or whether the output can be contested after the fact. In managerial terms, this is an uncertainty problem. Decision-makers cannot govern what they cannot reconstruct.

RAIDT therefore requires a run-level evidence pack. Star S4 specifies the architecture of that pack. The run identifier and timestamp establish the event. User role, task label, and domain label establish organisational context and help distinguish authorised from unauthorised usage. Prompt registry, prompt ID, prompt version, and prompt hash establish which instruction was actually used, which is essential for prompt engineering governance. Model, provider, version, and decoding parameters establish which generative configuration produced the output. Retrieval query, index ID, and retrieved document identifiers with hashes show whether a retrieval-augmented generation workflow shaped the response and whether the supporting documents can later be inspected. Tool-chain trace records downstream actions and therefore becomes crucial when a model does more than generate text, for example by calling a database, workflow engine, or external API. Adapter lineage and alignment policy ID matter because fine-tuning, PEFT, LoRA, and alignment controls can alter behaviour materially even when the headline model name remains constant. Output hash, review decision, reviewer notes, and retention or access controls close the loop by supporting inspection, contestability, and secure stewardship of evidence.
This architecture matters because it provides the bridge between data capture and governance judgement. RAIDT does not merely collect evidence for archival purposes. It uses that evidence to support scoring across five pillars. Responsibility is supported by fields that identify roles, decisions, approval points, and policy ownership. Auditability depends on whether an independent reviewer can inspect the sequence of artefacts and determine what happened. Interpretability is supported indirectly through evidence about prompts, retrieved context, model selection, and reviewer rationale, even when the internal model remains technically opaque. Dependability depends on whether repeated or similar runs can be compared and whether failure points can be diagnosed. Traceability is the most obvious beneficiary because the architecture is explicitly designed to connect an output back to its generating conditions.

A practical example makes the logic clearer. Suppose a university uses a GenAI assistant to draft student support communications. A problematic email is later challenged because it gave misleading advice. Without Star S4 evidence, the institution might only see the final text. With Star S4 evidence, it can inspect the user role, task label, prompt version, model version, retrieved policy documents, reviewer notes, and retention settings. It may discover that the operator used an outdated prompt, the retrieval index did not include the latest student welfare policy, and the reviewer approved the response without escalating uncertainty. The value of the evidence pack here is not only retrospective explanation. It supports governance intervention: update the prompt registry, rebuild the index, revise the review rule, and adjust the RAIDT score profile.

A second example concerns sector sensitivity. In healthcare triage support, retrieved document hashes and tool traces may be especially important because clinical pathways, document provenance, and decision escalation are safety-relevant. In legal drafting support, prompt versioning and alignment-policy identifiers may matter more because acceptable refusal behaviour, jurisdictional constraints, and source boundaries must be documented carefully. The same S4 architecture can therefore support sector playbooks without assuming that all fields carry equal importance in every setting.

Star S4 also supports the research logic of the RAIDT programme. Paper 08 needs a defensible methodological pathway showing that RAIDT is more than a normative slogan; it requires an operational grammar of evidence. Paper 09 needs observable fields that can be tested empirically across cases, scored for completeness and usefulness, and evaluated with practitioners. Paper 10 needs a way to show how organisational controls can map onto policy and standards discussions, including the EU AI Act, ISO/IEC 42001, and the NIST AI RMF. This star supplies the artefactual layer that makes those connections credible.

The boundaries of the star are equally important. Evidence architecture does not eliminate uncertainty, and it does not make a stochastic model fully reproducible in the scientific sense. Some vendor systems are closed, some model internals remain inaccessible, and some outputs are influenced by changing upstream services. Star S4 therefore supports accountable inspection rather than perfect determinism. It also does not claim that every run should capture unlimited detail. Over-capture creates privacy, security, cost, and usability burdens. The design principle should be sufficiency for governance, not maximal logging. For RAIDT, that distinction is essential: the evidence pack must be rigorous enough to support review and proportionate enough to be usable in real organisations.

Key questions and answers

Q1. What is meant by evidence architecture in RAIDT?

Answer:
Evidence architecture is the structured design of the fields and artefacts that make a single GenAI run inspectable. It specifies what must be recorded, how records relate to each other, and which elements are necessary for governance rather than only technical debugging.

Practical example:
A run record captures the prompt version, model version, retrieval source IDs, and reviewer decision for one procurement-support task.

Link to RAIDT:
This is the backbone of the run-level evidence pack and enables scoring across all five pillars because each pillar depends on evidence rather than assertion.

Q2. Why does RAIDT focus on the run rather than only on the model?

Answer:
The model alone does not explain organisational risk. Risk emerges when a configured system is used for a defined task in a real context. The run captures the actual combination of prompt, model, tools, retrieved context, output, and checks.

Practical example:
The same foundation model behaves differently in a customer-service run with a strict prompt template than in a policy-drafting run with a retrieval pipeline.

Link to RAIDT:
RAIDT is explicitly a run-level governance framework, so S4 operationalises the basic unit that the framework governs.

Q3. What problem does S4 solve for organisational governance?

Answer:
S4 solves the problem of weak reconstructability. Many organisations keep outputs but cannot show how those outputs were generated, reviewed, or constrained. That limits accountability, contestability, and learning.

Practical example:
An executive asks why a generated report cited the wrong internal policy. Without S4 evidence, the team can only guess. With S4 evidence, it can inspect the retrieval query, index version, and document hashes.

Link to RAIDT:
Better reconstructability improves Auditability and Traceability and makes governance interventions evidence-based.

Q4. Why are prompt identifiers, versions, and hashes necessary?

Answer:
Prompt text is not trivial input; it is a material part of system behaviour. Versioning and hashing allow an organisation to distinguish approved prompts from ad hoc prompt edits and to verify what instruction was used at the time of the run.

Practical example:
Two analysts claim they used the same template, but the hash shows that one inserted a hidden instruction that changed the tone and scope of the output.

Link to RAIDT:
These artefacts strengthen Responsibility, Auditability, and Interpretability because they connect outputs to governed prompt engineering practices.

Q5. Why does retrieval evidence matter in RAG systems?

Answer:
In retrieval-augmented generation, the retrieved context may shape the output as much as the prompt does. Governance therefore requires visibility into the query, index, and retrieved documents, not only the final answer.
Practical example:
A compliance chatbot gives outdated advice because the retrieval index excluded the latest policy memo. The retrieved document hashes make that gap visible.

Link to RAIDT:
Retrieval evidence feeds the evidence pack directly and supports Traceability, Dependability, and policy-facing review.

Q6. Why should tool-chain traces be included?

Answer:
When a model calls tools, the risk moves beyond text generation into action pathways. Tool traces show whether the system queried a database, triggered a workflow, or used an external service, which is crucial for understanding downstream consequences.

Practical example:
A GenAI assistant drafts an email and also pulls account data from a CRM. The tool trace shows which system was accessed and in what sequence.

Link to RAIDT:
This evidence expands the run record into operational governance and is especially relevant for Responsibility and Dependability.

Q7. Why capture adapter lineage and alignment policy identifiers?

Answer:
A model name can hide meaningful behavioural differences. PEFT or LoRA adapters may specialise outputs for a domain, while alignment policies can alter refusal thresholds, safety behaviour, or stylistic boundaries.

Practical example:
A healthcare deployment uses a clinical adapter and a stricter alignment policy than a general knowledge assistant. Those choices affect output quality and acceptable risk.

Link to RAIDT:
S4 ensures the evidence pack reflects the real configured system, which is necessary for accurate scoring and defensible governance decisions.

Q8. How does S4 support the five RAIDT pillars?

Answer:
S4 provides the artefacts from which each pillar can be assessed. Responsibility depends on role and review fields; Auditability on inspectable records; Interpretability on prompt, retrieval, and rationale visibility; Dependability on stable comparison of runs; and Traceability on end-to-end linkage.

Practical example:
A run with complete prompt and review records may score well on Auditability but poorly on Dependability if repeated runs show unstable behaviour under the same task conditions.

Link to RAIDT:
This star is a direct input to the five-pillar score profile and therefore to governance prioritisation.

Q9. Does S4 make GenAI fully reproducible?

Answer:
No. S4 supports accountable inspection, not perfect reproduction. Closed vendor models, non-deterministic generation, and changing upstream services mean that exact replay is not always possible.

Practical example:
Even with the same prompt and parameters, a provider-side model update may produce a different output one month later.

Link to RAIDT:
RAIDT uses evidence to improve oversight and contestability under uncertainty, not to promise impossible determinism.

Q10. How should supervisors understand the value of this star?

Answer:
Supervisors can read S4 as the artefactual proof that RAIDT is operationally serious. It demonstrates how abstract governance claims are translated into inspectable fields, auditable records, and testable research instruments.

Practical example:
In a supervision meeting, S4 can be shown as the schema that links theoretical governance claims in Paper 08 to measurable case evidence in Paper 09.

Link to RAIDT:
S4 anchors the project's methodological coherence by connecting theory, empirical validation, policy pathways, and implementation.

Practical examples
  1. A bank uses a GenAI assistant for internal policy summarisation. S4 reveals that a low-quality summary was caused by an unapproved prompt variant and an outdated retrieval index, leading to a governance intervention that locks prompt versions and refreshes the index on schedule.
  2. A hospital pilots a triage support tool. S4 records the adapter lineage, alignment policy ID, and reviewer notes, allowing the team to distinguish model behaviour from workflow decisions when a recommendation is challenged.
  3. A public-sector team uses GenAI to draft citizen responses. S4 shows which operator initiated the run, which policy documents were retrieved, and which reviewer authorised release, improving contestability and audit readiness.
  4. A university deploys a writing-support assistant for administrative staff. S4 makes it possible to compare runs across departments, identify where review controls are inconsistent, and refine RAIDT scoring for local governance maturity.
Evidence needed / what to capture
Link to RAIDT project
Citation ideas to support this note
Boundaries and limitations
Conclusion

Star S4 is where RAIDT becomes concrete. The broader project argues that governance should focus on the run, meaning one configured use of a generative AI system for a specific task in a specific context. This note explains what evidence must be captured if that claim is to be operationally credible. The key point is that governance cannot rely on outputs alone. We need a structured evidence architecture covering prompt versioning, model and provider identifiers, decoding settings, retrieval context, tool traces, adapter lineage, alignment controls, review decisions, and retention rules. Together these artefacts form the run-level evidence pack. That pack then supports the five-pillar RAIDT score profile: Responsibility, Auditability, Interpretability, Dependability, and Traceability. For supervision purposes, S4 matters because it links theory to implementation. It shows how RAIDT can be empirically tested, how it can align with standards and policy discussions, and how organisations can move from vague Responsible AI principles to inspectable governance practice.

Suggested slide order for oral presentation
  1. Why evidence architecture matters in RAIDT
  2. The governance problem S4 solves
  3. What sits inside the run-level evidence pack
  4. How S4 supports the five pillars
  5. Worked organisational examples
  6. Research and policy relevance
  7. Limits and design choices
  8. Why S4 matters for the overall RAIDT project
Slides
Slide 1 — why s4 matters

Purpose:
Frame the concept for supervisors and workshop participants.

Key message:
Star S4 gives RAIDT its operational evidence backbone by defining what must be captured for one governable GenAI run.

Slide content:

  • RAIDT governs the run, not only the model
  • A run needs inspectable evidence, not just an output
  • S4 defines the evidence architecture and artefacts
  • This makes review, scoring, and intervention possible

Speaker note:
Open by explaining that RAIDT becomes meaningful only if each run can be inspected after the fact. S4 is the star that defines what must be captured so that a run is not a black box event. The emphasis is on governance-ready evidence rather than raw system logging.

Visual idea:
A simple flow from run to evidence pack to five-pillar score profile.

Link to RAIDT:
This slide introduces the operational layer that connects the run-level evidence pack to RAIDT scoring.

Citation support to mention if asked:
Responsible AI documentation, audit trail design, and run-level governance concepts.

Slide 2 — the problem s4 solves

Purpose:
Explain the governance gap that motivates the note.
Key message:
Without structured run evidence, organisations cannot reliably reconstruct, contest, or govern GenAI outputs.

Slide content:

  • Outputs alone do not explain how they were produced
  • Prompts, models, retrieval, and tools may all change behaviour
  • Weak evidence creates weak accountability
  • S4 reduces reconstructability gaps under uncertainty

Speaker note:
Stress that the same task can produce different outputs because of prompt changes, model updates, retrieval differences, or tool calls. The managerial issue is uncertainty: leaders cannot govern a system if they cannot reconstruct the conditions of use.

Visual idea:
Comparison graphic: output-only record versus full run evidence record.

Link to RAIDT:
This slide justifies why RAIDT requires a run-level evidence pack rather than informal usage logs.

Citation support to mention if asked:
AI uncertainty, Information Systems governance, and contestability literature.

Slide 3 — what sits inside the evidence pack

Purpose:
Show the main artefact categories defined by S4.

Key message:
The evidence pack combines context, configuration, augmentation, output integrity, review, and stewardship fields.

Slide content:

  • Run identity, timestamp, task, and user role
  • Prompt registry, version, and hash
  • Model, provider, decoding, adapter, alignment policy
  • Retrieval, tool trace, output hash, review, retention

Speaker note:
Walk through the categories rather than every field. Explain that S4 covers not only technical configuration but also human review and access control, because governable evidence must span the socio-technical chain, not just the model invocation.

Visual idea:
Layered table or evidence-chain graphic with six evidence groups.

Link to RAIDT:
These artefacts are the core contents of the run-level evidence pack.

Citation support to mention if asked:
Documentation practices in MLOps, prompt engineering governance, RAG provenance, and alignment control lineage.

Slide 4 — how s4 supports the five pillars

Purpose:
Connect evidence capture directly to RAIDT scoring.

Key message:
S4 provides the observable basis for judging Responsibility, Auditability, Interpretability, Dependability, and Traceability.

Slide content:

  • Responsibility: roles, review, ownership
  • Auditability: inspectable records and hashes
  • Interpretability: prompts, context, rationale visibility
  • Dependability and Traceability: stable comparison and linkage

Speaker note:
Clarify that S4 does not produce good governance automatically; it supplies the evidence from which governance judgements can be made. Each pillar depends on different parts of the evidence pack, which also helps explain how scoring can reveal different weaknesses across the same run.

Visual idea:
Five-column pillar table with example S4 fields under each pillar.

Link to RAIDT:
This is the direct bridge from evidence architecture to the RAIDT score profile.

Citation support to mention if asked:
RAIDT scoring logic, auditability research, and explainability or interpretability governance sources.

Slide 5 — organisational examples

Purpose:
Make the concept concrete through applied GenAI cases.

Key message:
S4 is useful because it turns ambiguous incidents into inspectable governance cases.

Slide content:

  • University student-support drafting
  • Bank policy summarisation
  • Hospital triage support
  • Public-sector citizen response drafting

Speaker note:
Use one example in detail and mention the others briefly. The point is that S4 allows an organisation to identify whether the failure came from prompt design, retrieval quality, adapter selection, review weakness, or retention and access problems.

Visual idea:
Four-box sector comparison with one governance lesson per sector.

Link to RAIDT:
Shows how a common evidence architecture can support future RAIDT sector playbooks.

Citation support to mention if asked:
Sector-specific AI governance cases, documentation practice, and organisational control literature.

Slide 6 — research and policy relevance

Purpose:
Show why this star matters beyond operational implementation.

Key message:
S4 supports RAIDT's foundations, empirical testing, and policy alignment.

Slide content:

  • Paper 08: operational grammar of the run
  • Paper 09: observable fields for empirical validation
  • Paper 10: pathway to standards and policy alignment
  • Useful for EU AI Act, ISO/IEC 42001, NIST AI RMF discussions

Speaker note:
Explain that S4 is one of the clearest places where the project's theoretical, empirical, and policy strands meet. It gives the project something concrete to test in workshops, pilots, and case studies while also making policy discussions less abstract.

Visual idea:
Three-part bridge diagram: foundations, validation, policy.

Link to RAIDT:
Positions S4 as the artefactual layer that supports the whole programme, not only one operational note.

Citation support to mention if asked:
Standards and regulatory guidance, plus RAIDT Papers 08, 09, and 10.

Slide 7 — limits and design choices

Purpose:
Show methodological realism and avoid overclaiming.

Key message:
S4 supports accountable inspection under uncertainty, not perfect reproducibility or unlimited surveillance.

Slide content:

  • Closed systems limit full reproducibility
  • Evidence capture must be proportionate
  • More logging is not always better governance
  • Human judgement remains necessary

Speaker note:
This slide is important for credibility. Make clear that S4 is not a promise of perfect technical replay. It is a framework for sufficient governance evidence. Also note that excessive capture can create privacy and usability problems, so the design principle is proportionate evidence.

Visual idea:
Balance graphic: governance sufficiency versus over-capture burden.

Link to RAIDT:
Protects RAIDT from overclaiming while keeping the focus on practical governance value.

Citation support to mention if asked:
Responsible AI proportionality, privacy governance, and reproducibility limitations in generative systems.

Slide 8 — why supervisors should care

Purpose:
Close the deck by connecting S4 back to the doctoral project.

Key message:
S4 demonstrates that RAIDT is methodologically serious because it translates governance theory into inspectable evidence design.

Slide content:

  • Makes the run a practical unit of analysis
  • Converts abstract governance into artefacts and fields
  • Enables scoring, comparison, and intervention
  • Strengthens the coherence of the RAIDT thesis

Speaker note:
End by stating that S4 is not a peripheral implementation detail. It is the operational proof that RAIDT can function as a real governance framework. It shows supervisors how the project moves from concept to evidence, from evidence to scoring, and from scoring to governance action.

Visual idea:
Closing hierarchy: theory -> evidence pack -> scoring -> governance action.

Link to RAIDT:
This slide ties S4 back to the full RAIDT logic and its relevance for supervision, workshops, and future publications.

Citation support to mention if asked:
Methodological design, governance instrumentation, and RAIDT programme papers.

Powered by Forestry.md