Evidence Architecture and Artefacts

#raidt/S4

flowchart LR
    A[Run-level uncertainty] --> B[RAIDT governance logic]
    B --> C[Star S4 evidence architecture]
    C --> D[Run-level evidence pack]
    C --> E[Five-pillar score profile]
    D --> F[Reviewer reconstruction]
    D --> G[Audit-ready artefacts]
    E --> H[Governance intervention]
    G --> I[Policy and standards alignment]
    J[Prompts models tools review] --> C

← Circle 2 - Operational governance mechanism

Ring: Operational star

Function

Defines the evidence architecture that turns a single RAIDT run into an inspectable, reviewable, and governable record. This star specifies which fields, identifiers, hashes, traces, review artefacts, and access controls must be captured so that a run-level evidence pack can support scoring across Responsibility, Auditability, Interpretability, Dependability, and Traceability.

Role in the project

This note sits at the operational centre of the RAIDT project. It translates RAIDT from a conceptual governance framework into a concrete evidence model that can be implemented, tested, audited, and discussed with supervisors and organisational stakeholders. In project terms, Star S4 belongs primarily to evidence architecture and implementation, but it also supports foundations, empirical validation, policy alignment, and sector application because the quality of RAIDT depends on whether a run can be evidenced consistently across settings.

Main questions answered by this star

What does evidence architecture mean in RAIDT terms?
Why does RAIDT need a defined set of run-level artefacts rather than informal logging?
What problem does this star solve for governance, audit, and managerial oversight?
What evidence fields are required to make a run inspectable and reproducible enough for governance purposes?
How do prompts, models, tools, retrieval components, adapters, and review decisions become part of one evidence pack?
How does this star support the five RAIDT pillars and the resulting score profile?
What does this star allow supervisors to see about the methodological seriousness of RAIDT?
How does this star connect to policy alignment, standards, and future sector playbooks?

Workshop discussion prompts

10-20 min ? Which minimum evidence fields are necessary before a GenAI run can be governed rather than merely used?
20-40 min ? How should the evidence pack distinguish between prompt design, model behaviour, retrieved context, and human review so that responsibility can be assigned fairly?
40-60 min ? Which S4 artefacts are essential for scoring the five RAIDT pillars in different organisational settings such as healthcare, finance, higher education, or public administration?

Items in this star (17)

Main message

RAIDT treats the run, rather than the model in the abstract, as the unit of governance. That shift matters because organisational risk rarely arises from a model in isolation. It arises when a particular person, using a particular prompt, on a particular system configuration, at a particular time, generates an output that affects work. Star S4 addresses the practical consequence of that claim: if the run is the unit of governance, the project needs an evidence architecture that can capture what actually happened in that run in a structured and inspectable way.

In this note, evidence architecture means the design of the fields, identifiers, traces, and linked artefacts that together make a run record intelligible. Artefacts are the concrete objects that populate that architecture: prompt versions, hashes, retrieval traces, model identifiers, reviewer notes, access controls, and related metadata. The aim is not to create an exhaustive technical log of everything a system could emit. The aim is to create a governance-ready record that is proportionate, decision-relevant, and usable by supervisors, auditors, operators, and policy-facing audiences.

The problem this star solves is straightforward but serious. Generative AI systems are dynamic, configurable, and often partially opaque. The same apparent task can produce different outputs because the prompt changed, the provider updated the model, the retrieval index was refreshed, the decoding parameters varied, a PEFT or LoRA adapter was switched, or an alignment policy altered refusal behaviour. If none of that is captured, organisations are left with a weak form of accountability. They may know that an answer was produced, but they cannot explain which configuration produced it, whether the run followed policy, whether a reviewer intervened, or whether the output can be contested after the fact. In managerial terms, this is an uncertainty problem. Decision-makers cannot govern what they cannot reconstruct.

RAIDT therefore requires a run-level evidence pack. Star S4 specifies the architecture of that pack. The run identifier and timestamp establish the event. User role, task label, and domain label establish organisational context and help distinguish authorised from unauthorised usage. Prompt registry, prompt ID, prompt version, and prompt hash establish which instruction was actually used, which is essential for prompt engineering governance. Model, provider, version, and decoding parameters establish which generative configuration produced the output. Retrieval query, index ID, and retrieved document identifiers with hashes show whether a retrieval-augmented generation workflow shaped the response and whether the supporting documents can later be inspected. Tool-chain trace records downstream actions and therefore becomes crucial when a model does more than generate text, for example by calling a database, workflow engine, or external API. Adapter lineage and alignment policy ID matter because fine-tuning, PEFT, LoRA, and alignment controls can alter behaviour materially even when the headline model name remains constant. Output hash, review decision, reviewer notes, and retention or access controls close the loop by supporting inspection, contestability, and secure stewardship of evidence.
This architecture matters because it provides the bridge between data capture and governance judgement. RAIDT does not merely collect evidence for archival purposes. It uses that evidence to support scoring across five pillars. Responsibility is supported by fields that identify roles, decisions, approval points, and policy ownership. Auditability depends on whether an independent reviewer can inspect the sequence of artefacts and determine what happened. Interpretability is supported indirectly through evidence about prompts, retrieved context, model selection, and reviewer rationale, even when the internal model remains technically opaque. Dependability depends on whether repeated or similar runs can be compared and whether failure points can be diagnosed. Traceability is the most obvious beneficiary because the architecture is explicitly designed to connect an output back to its generating conditions.

A practical example makes the logic clearer. Suppose a university uses a GenAI assistant to draft student support communications. A problematic email is later challenged because it gave misleading advice. Without Star S4 evidence, the institution might only see the final text. With Star S4 evidence, it can inspect the user role, task label, prompt version, model version, retrieved policy documents, reviewer notes, and retention settings. It may discover that the operator used an outdated prompt, the retrieval index did not include the latest student welfare policy, and the reviewer approved the response without escalating uncertainty. The value of the evidence pack here is not only retrospective explanation. It supports governance intervention: update the prompt registry, rebuild the index, revise the review rule, and adjust the RAIDT score profile.

A second example concerns sector sensitivity. In healthcare triage support, retrieved document hashes and tool traces may be especially important because clinical pathways, document provenance, and decision escalation are safety-relevant. In legal drafting support, prompt versioning and alignment-policy identifiers may matter more because acceptable refusal behaviour, jurisdictional constraints, and source boundaries must be documented carefully. The same S4 architecture can therefore support sector playbooks without assuming that all fields carry equal importance in every setting.

Star S4 also supports the research logic of the RAIDT programme. Paper 08 needs a defensible methodological pathway showing that RAIDT is more than a normative slogan; it requires an operational grammar of evidence. Paper 09 needs observable fields that can be tested empirically across cases, scored for completeness and usefulness, and evaluated with practitioners. Paper 10 needs a way to show how organisational controls can map onto policy and standards discussions, including the EU AI Act, ISO/IEC 42001, and the NIST AI RMF. This star supplies the artefactual layer that makes those connections credible.

The boundaries of the star are equally important. Evidence architecture does not eliminate uncertainty, and it does not make a stochastic model fully reproducible in the scientific sense. Some vendor systems are closed, some model internals remain inaccessible, and some outputs are influenced by changing upstream services. Star S4 therefore supports accountable inspection rather than perfect determinism. It also does not claim that every run should capture unlimited detail. Over-capture creates privacy, security, cost, and usability burdens. The design principle should be sufficiency for governance, not maximal logging. For RAIDT, that distinction is essential: the evidence pack must be rigorous enough to support review and proportionate enough to be usable in real organisations.

Key questions and answers

Q1. What is meant by evidence architecture in RAIDT?

Answer:
Evidence architecture is the structured design of the fields and artefacts that make a single GenAI run inspectable. It specifies what must be recorded, how records relate to each other, and which elements are necessary for governance rather than only technical debugging.

Practical example:
A run record captures the prompt version, model version, retrieval source IDs, and reviewer decision for one procurement-support task.

Link to RAIDT:
This is the backbone of the run-level evidence pack and enables scoring across all five pillars because each pillar depends on evidence rather than assertion.

Q2. Why does RAIDT focus on the run rather than only on the model?

Answer:
The model alone does not explain organisational risk. Risk emerges when a configured system is used for a defined task in a real context. The run captures the actual combination of prompt, model, tools, retrieved context, output, and checks.

Practical example:
The same foundation model behaves differently in a customer-service run with a strict prompt template than in a policy-drafting run with a retrieval pipeline.

Link to RAIDT:
RAIDT is explicitly a run-level governance framework, so S4 operationalises the basic unit that the framework governs.

Q3. What problem does S4 solve for organisational governance?

Answer:
S4 solves the problem of weak reconstructability. Many organisations keep outputs but cannot show how those outputs were generated, reviewed, or constrained. That limits accountability, contestability, and learning.

Practical example:
An executive asks why a generated report cited the wrong internal policy. Without S4 evidence, the team can only guess. With S4 evidence, it can inspect the retrieval query, index version, and document hashes.

Link to RAIDT:
Better reconstructability improves Auditability and Traceability and makes governance interventions evidence-based.

Q4. Why are prompt identifiers, versions, and hashes necessary?

Answer:
Prompt text is not trivial input; it is a material part of system behaviour. Versioning and hashing allow an organisation to distinguish approved prompts from ad hoc prompt edits and to verify what instruction was used at the time of the run.

Practical example:
Two analysts claim they used the same template, but the hash shows that one inserted a hidden instruction that changed the tone and scope of the output.

Link to RAIDT:
These artefacts strengthen Responsibility, Auditability, and Interpretability because they connect outputs to governed prompt engineering practices.

Q5. Why does retrieval evidence matter in RAG systems?

Answer:
In retrieval-augmented generation, the retrieved context may shape the output as much as the prompt does. Governance therefore requires visibility into the query, index, and retrieved documents, not only the final answer.
Practical example:
A compliance chatbot gives outdated advice because the retrieval index excluded the latest policy memo. The retrieved document hashes make that gap visible.

Link to RAIDT:
Retrieval evidence feeds the evidence pack directly and supports Traceability, Dependability, and policy-facing review.

Q6. Why should tool-chain traces be included?

Answer:
When a model calls tools, the risk moves beyond text generation into action pathways. Tool traces show whether the system queried a database, triggered a workflow, or used an external service, which is crucial for understanding downstream consequences.

Practical example:
A GenAI assistant drafts an email and also pulls account data from a CRM. The tool trace shows which system was accessed and in what sequence.

Link to RAIDT:
This evidence expands the run record into operational governance and is especially relevant for Responsibility and Dependability.

Q7. Why capture adapter lineage and alignment policy identifiers?

Answer:
A model name can hide meaningful behavioural differences. PEFT or LoRA adapters may specialise outputs for a domain, while alignment policies can alter refusal thresholds, safety behaviour, or stylistic boundaries.

Practical example:
A healthcare deployment uses a clinical adapter and a stricter alignment policy than a general knowledge assistant. Those choices affect output quality and acceptable risk.

Link to RAIDT:
S4 ensures the evidence pack reflects the real configured system, which is necessary for accurate scoring and defensible governance decisions.

Q8. How does S4 support the five RAIDT pillars?

Answer:
S4 provides the artefacts from which each pillar can be assessed. Responsibility depends on role and review fields; Auditability on inspectable records; Interpretability on prompt, retrieval, and rationale visibility; Dependability on stable comparison of runs; and Traceability on end-to-end linkage.

Practical example:
A run with complete prompt and review records may score well on Auditability but poorly on Dependability if repeated runs show unstable behaviour under the same task conditions.

Link to RAIDT:
This star is a direct input to the five-pillar score profile and therefore to governance prioritisation.

Q9. Does S4 make GenAI fully reproducible?

Answer:
No. S4 supports accountable inspection, not perfect reproduction. Closed vendor models, non-deterministic generation, and changing upstream services mean that exact replay is not always possible.

Practical example:
Even with the same prompt and parameters, a provider-side model update may produce a different output one month later.

Link to RAIDT:
RAIDT uses evidence to improve oversight and contestability under uncertainty, not to promise impossible determinism.

Q10. How should supervisors understand the value of this star?

Answer:
Supervisors can read S4 as the artefactual proof that RAIDT is operationally serious. It demonstrates how abstract governance claims are translated into inspectable fields, auditable records, and testable research instruments.

Practical example:
In a supervision meeting, S4 can be shown as the schema that links theoretical governance claims in Paper 08 to measurable case evidence in Paper 09.

Link to RAIDT:
S4 anchors the project's methodological coherence by connecting theory, empirical validation, policy pathways, and implementation.

Practical examples

A bank uses a GenAI assistant for internal policy summarisation. S4 reveals that a low-quality summary was caused by an unapproved prompt variant and an outdated retrieval index, leading to a governance intervention that locks prompt versions and refreshes the index on schedule.
A hospital pilots a triage support tool. S4 records the adapter lineage, alignment policy ID, and reviewer notes, allowing the team to distinguish model behaviour from workflow decisions when a recommendation is challenged.
A public-sector team uses GenAI to draft citizen responses. S4 shows which operator initiated the run, which policy documents were retrieved, and which reviewer authorised release, improving contestability and audit readiness.
A university deploys a writing-support assistant for administrative staff. S4 makes it possible to compare runs across departments, identify where review controls are inconsistent, and refine RAIDT scoring for local governance maturity.

Evidence needed / what to capture

Core run identity: run ID, timestamp, task label, domain label, session or case reference where appropriate.
Human accountability fields: user role, operator role, reviewer identity or role class, review decision, reviewer notes, escalation flag.
Prompt governance fields: prompt registry name, prompt ID, prompt version, prompt hash, prompt status such as approved or experimental.
Model configuration fields: provider, model name, model version, endpoint identifier where available, decoding parameters, system-level configuration markers.
Retrieval fields: retrieval query, index ID or knowledge-base version, retrieved document IDs, retrieved document hashes, retrieval timestamp.
Tooling and augmentation fields: tool-chain trace, plugin or API identifiers, adapter ID, PEFT or LoRA lineage, alignment policy ID.
Output integrity fields: output hash, output class or disposition, confidence or uncertainty annotation where used, storage location reference.
Stewardship fields: retention rule, access-control class, lawful or policy basis for retention where relevant, deletion or review schedule.

Link to RAIDT project

Paper 08: foundations and methodological pathways ? S4 gives RAIDT an operational evidence grammar. It shows how the run becomes the unit of governance in practical rather than merely conceptual terms.
Paper 09: empirical validation ? S4 provides observable variables that can be tested across cases, interviews, workshops, and pilot implementations. It allows completeness, usability, and scoring reliability to be examined empirically.
Paper 10: policy pathways ? S4 creates a bridge between organisational evidence capture and external governance expectations, including standards and regulatory discussions.
Sector playbooks ? S4 is adaptable across sectors because the same evidence architecture can be weighted differently according to domain risk, review intensity, and documentation needs.
RAIDT scoring ? The five-pillar score profile depends on the presence, quality, and governance usefulness of S4 artefacts.
RAIDT evidence pack ? S4 effectively defines the core structure of the run-level evidence pack.
RAIDT governance interventions ? S4 supports targeted intervention such as prompt revision, retrieval redesign, reviewer escalation rules, retention control, or adapter governance.

Citation ideas to support this note

Responsible AI governance literature on accountability, documentation, and contestability.
Information Systems governance research on controls, audit trails, and organisational accountability.
Literature on AI and uncertainty, especially managerial uncertainty and decision support under incomplete visibility.
Prompt engineering and RAG literature that shows how behaviour changes with instructions and retrieved context.
Work on PEFT, LoRA, RLHF, and alignment controls to support the importance of configuration lineage.
Standards and policy materials related to the EU AI Act, ISO/IEC 42001, and the NIST AI RMF.
Empirical studies on AI incident review, documentation practice, model operations, and governance implementation.
RAIDT project materials, especially Paper 08 foundations, Paper 09 empirical validation, and Paper 10 policy pathways.

Boundaries and limitations

S4 does not claim that capturing evidence removes the need for human judgement.
S4 does not guarantee full technical reproducibility where providers, models, or services are opaque or change over time.
S4 does not imply that every possible field should be captured; evidence capture must remain proportionate to risk, privacy, and operational burden.
S4 does not replace broader Responsible AI measures such as ex ante risk assessment, training, policy design, or red-teaming.
S4 supports contestability and auditability, but it cannot by itself prove that an output is true, fair, or lawful.

Conclusion

Star S4 is where RAIDT becomes concrete. The broader project argues that governance should focus on the run, meaning one configured use of a generative AI system for a specific task in a specific context. This note explains what evidence must be captured if that claim is to be operationally credible. The key point is that governance cannot rely on outputs alone. We need a structured evidence architecture covering prompt versioning, model and provider identifiers, decoding settings, retrieval context, tool traces, adapter lineage, alignment controls, review decisions, and retention rules. Together these artefacts form the run-level evidence pack. That pack then supports the five-pillar RAIDT score profile: Responsibility, Auditability, Interpretability, Dependability, and Traceability. For supervision purposes, S4 matters because it links theory to implementation. It shows how RAIDT can be empirically tested, how it can align with standards and policy discussions, and how organisations can move from vague Responsible AI principles to inspectable governance practice.

Suggested slide order for oral presentation

Why evidence architecture matters in RAIDT
The governance problem S4 solves
What sits inside the run-level evidence pack
How S4 supports the five pillars
Worked organisational examples
Research and policy relevance
Limits and design choices
Why S4 matters for the overall RAIDT project

Slides

Slide 1 — why s4 matters

Purpose:
Frame the concept for supervisors and workshop participants.

Key message:
Star S4 gives RAIDT its operational evidence backbone by defining what must be captured for one governable GenAI run.

Slide content:

RAIDT governs the run, not only the model
A run needs inspectable evidence, not just an output
S4 defines the evidence architecture and artefacts
This makes review, scoring, and intervention possible

Speaker note:
Open by explaining that RAIDT becomes meaningful only if each run can be inspected after the fact. S4 is the star that defines what must be captured so that a run is not a black box event. The emphasis is on governance-ready evidence rather than raw system logging.

Visual idea:
A simple flow from run to evidence pack to five-pillar score profile.

Link to RAIDT:
This slide introduces the operational layer that connects the run-level evidence pack to RAIDT scoring.

Citation support to mention if asked:
Responsible AI documentation, audit trail design, and run-level governance concepts.

Slide 2 — the problem s4 solves

Purpose:
Explain the governance gap that motivates the note.
Key message:
Without structured run evidence, organisations cannot reliably reconstruct, contest, or govern GenAI outputs.

Slide content:

Outputs alone do not explain how they were produced
Prompts, models, retrieval, and tools may all change behaviour
Weak evidence creates weak accountability
S4 reduces reconstructability gaps under uncertainty

Speaker note:
Stress that the same task can produce different outputs because of prompt changes, model updates, retrieval differences, or tool calls. The managerial issue is uncertainty: leaders cannot govern a system if they cannot reconstruct the conditions of use.

Visual idea:
Comparison graphic: output-only record versus full run evidence record.

Link to RAIDT:
This slide justifies why RAIDT requires a run-level evidence pack rather than informal usage logs.

Citation support to mention if asked:
AI uncertainty, Information Systems governance, and contestability literature.

Slide 3 — what sits inside the evidence pack

Purpose:
Show the main artefact categories defined by S4.

Key message:
The evidence pack combines context, configuration, augmentation, output integrity, review, and stewardship fields.

Slide content:

Run identity, timestamp, task, and user role
Prompt registry, version, and hash
Model, provider, decoding, adapter, alignment policy
Retrieval, tool trace, output hash, review, retention

Speaker note:
Walk through the categories rather than every field. Explain that S4 covers not only technical configuration but also human review and access control, because governable evidence must span the socio-technical chain, not just the model invocation.

Visual idea:
Layered table or evidence-chain graphic with six evidence groups.

Link to RAIDT:
These artefacts are the core contents of the run-level evidence pack.

Citation support to mention if asked:
Documentation practices in MLOps, prompt engineering governance, RAG provenance, and alignment control lineage.

Slide 4 — how s4 supports the five pillars

Purpose:
Connect evidence capture directly to RAIDT scoring.

Key message:
S4 provides the observable basis for judging Responsibility, Auditability, Interpretability, Dependability, and Traceability.

Slide content:

Responsibility: roles, review, ownership
Auditability: inspectable records and hashes
Interpretability: prompts, context, rationale visibility
Dependability and Traceability: stable comparison and linkage

Speaker note:
Clarify that S4 does not produce good governance automatically; it supplies the evidence from which governance judgements can be made. Each pillar depends on different parts of the evidence pack, which also helps explain how scoring can reveal different weaknesses across the same run.

Visual idea:
Five-column pillar table with example S4 fields under each pillar.

Link to RAIDT:
This is the direct bridge from evidence architecture to the RAIDT score profile.

Citation support to mention if asked:
RAIDT scoring logic, auditability research, and explainability or interpretability governance sources.

Slide 5 — organisational examples

Purpose:
Make the concept concrete through applied GenAI cases.

Key message:
S4 is useful because it turns ambiguous incidents into inspectable governance cases.

Slide content:

University student-support drafting
Bank policy summarisation
Hospital triage support
Public-sector citizen response drafting

Speaker note:
Use one example in detail and mention the others briefly. The point is that S4 allows an organisation to identify whether the failure came from prompt design, retrieval quality, adapter selection, review weakness, or retention and access problems.

Visual idea:
Four-box sector comparison with one governance lesson per sector.

Link to RAIDT:
Shows how a common evidence architecture can support future RAIDT sector playbooks.

Citation support to mention if asked:
Sector-specific AI governance cases, documentation practice, and organisational control literature.

Slide 6 — research and policy relevance

Purpose:
Show why this star matters beyond operational implementation.

Key message:
S4 supports RAIDT's foundations, empirical testing, and policy alignment.

Slide content:

Paper 08: operational grammar of the run
Paper 09: observable fields for empirical validation
Paper 10: pathway to standards and policy alignment
Useful for EU AI Act, ISO/IEC 42001, NIST AI RMF discussions

Speaker note:
Explain that S4 is one of the clearest places where the project's theoretical, empirical, and policy strands meet. It gives the project something concrete to test in workshops, pilots, and case studies while also making policy discussions less abstract.

Visual idea:
Three-part bridge diagram: foundations, validation, policy.

Link to RAIDT:
Positions S4 as the artefactual layer that supports the whole programme, not only one operational note.

Citation support to mention if asked:
Standards and regulatory guidance, plus RAIDT Papers 08, 09, and 10.

Slide 7 — limits and design choices

Purpose:
Show methodological realism and avoid overclaiming.

Key message:
S4 supports accountable inspection under uncertainty, not perfect reproducibility or unlimited surveillance.

Slide content:

Closed systems limit full reproducibility
Evidence capture must be proportionate
More logging is not always better governance
Human judgement remains necessary

Speaker note:
This slide is important for credibility. Make clear that S4 is not a promise of perfect technical replay. It is a framework for sufficient governance evidence. Also note that excessive capture can create privacy and usability problems, so the design principle is proportionate evidence.

Visual idea:
Balance graphic: governance sufficiency versus over-capture burden.

Link to RAIDT:
Protects RAIDT from overclaiming while keeping the focus on practical governance value.

Citation support to mention if asked:
Responsible AI proportionality, privacy governance, and reproducibility limitations in generative systems.

Slide 8 — why supervisors should care

Purpose:
Close the deck by connecting S4 back to the doctoral project.

Key message:
S4 demonstrates that RAIDT is methodologically serious because it translates governance theory into inspectable evidence design.

Slide content:

Makes the run a practical unit of analysis
Converts abstract governance into artefacts and fields
Enables scoring, comparison, and intervention
Strengthens the coherence of the RAIDT thesis

Speaker note:
End by stating that S4 is not a peripheral implementation detail. It is the operational proof that RAIDT can function as a real governance framework. It shows supervisors how the project moves from concept to evidence, from evidence to scoring, and from scoring to governance action.

Visual idea:
Closing hierarchy: theory -> evidence pack -> scoring -> governance action.

Link to RAIDT:
This slide ties S4 back to the full RAIDT logic and its relevance for supervision, workshops, and future publications.

Citation support to mention if asked:
Methodological design, governance instrumentation, and RAIDT programme papers.