S10.16 - Manual_Healthcare_Structured_Prompt_Experiment
Manual Healthcare Experiment (RAIDT )
Structured Prompt for a Synthetic Health Database
Manual execution guide: prompts, evidence pack, run logs, scoring, and workshop script
| Workshop purpose This manual shows supervisors how RAIDT can be executed manually with a structured healthcare summarisation prompt. The experiment is deliberately small, inspectable, and evidence-focused. The aim is not to prove clinical performance; the aim is to show how a GenAI run becomes a governed evidence object that can be reconstructed, reviewed, questioned, and scored. |
|---|
| Safety and ethics boundary Use only synthetic or fully anonymised records. Do not enter personal, identifiable, sensitive, or real patient data into a public model. The worked example is for governance demonstration and PhD supervision only. It does not replace clinical judgement, clinical safety practice, legal compliance, or institutional ethics approval. |
|---|
One-page executive overview
The practical example is a clinical-note summarisation workflow. A synthetic health database contains short consultation-style records. A registered structured prompt asks a model to produce a draft summary with symptoms, assessment discussed, treatment or actions, red flags, uncertainty and limits, evidence links, and reviewer attention points. The researcher then saves the exact prompt, the exact output, and the metadata needed to reconstruct the run. This creates the run-level evidence pack. A reviewer then scores the evidence pack across the five RAIDT pillars: Responsibility, Auditability, Interpretability, Dependability, and Traceability.
The experiment is intentionally manual because the workshop goal is educational. Manual RAIDT execution makes the governance logic visible before any automation or orchestration is introduced. Supervisors can see what the run is, what evidence is captured, how the pillar scores are justified, and how the same method can later be semi-automated or automated.
| RAIDT idea | How the experiment makes it visible |
|---|---|
| Run as unit of governance | One record, one prompt version, one model execution, one output, one review and one score profile become one run. |
| Evidence pack | The run ID, timestamp, prompt ID/version/hash, input hash, model details, output hash and reviewer notes are stored together. |
| Score profile | The run is scored on R, A, I, D and T using visible evidence rather than broad impressions. |
| Manual implementation | A researcher can run the experiment with a CSV/SQLite database, a prompt file, output text, a log sheet and a scoring sheet. |
| Workshop learning | Supervisors can ask: what evidence exists, what is missing, what score is justified, and what would improve the score? |
Good to read
Table 1. Conceptual progression from DSS to RAIDT
| Concept / system | What it is | What it is not | Main concern | Aims | Boundaries / limits | How it grows into the next stage | Practical example |
| Decision Support Systems (DSS) | Information systems that help humans structure choices, compare options, and use data in decisions. | Not autonomous decision-making and not necessarily AI-driven. | Decision quality, efficiency, information availability, and user support. | Provide timely, relevant information and analysis for human decisions. | Often assumes clearer inputs, more stable logic, and less attention to ethical AI issues. | DSS shows the need for structured support, but uncertainty and modern AI behaviour expose new governance gaps. | A manager uses a dashboard to compare supplier cost, lead time, and risk before choosing a vendor. |
| Managerial decision-making under uncertainty | A context where managers must act despite incomplete, conflicting, fast-changing, or misleading information. | Not routine optimisation and not a purely technical prediction problem. | Ambiguity, incomplete evidence, misinformation, speed, accountability, and trust in advice. | Support judgement when evidence is partial and consequences are high. | There may be no single correct answer, and human judgement remains central. | This stage explains why basic DSS is insufficient and why decision tools must address uncertainty, trust, and governance together. | A hospital manager must decide whether to escalate a service incident when data are incomplete and stakeholder reports conflict. |
| Responsible AI | A broad governance and design agenda for making AI systems fairer, safer, more transparent, accountable, and aligned with human values. | Not one tool, not one metric, and not only a technical XAI method. | Ethics, transparency, fairness, accountability, oversight, and risk management. | Move AI from raw capability to socially and organisationally acceptable use. | Often principle-heavy: it can say what should matter without defining the minimum evidence needed for one specific use event. | Responsible AI creates the governance expectation, but RAIDT answers the operational question of what evidence should exist for each important GenAI run. | A bank says its AI process is accountable and transparent, but still struggles to prove what happened in one disputed adverse-action explanation. |
| RAIDT | A run-level evidence framework for responsible governance of generative AI in organisational work, operationalised through an evidence pack and five-pillar score profile. | Not a single software product, not a model card, not a general ethics theory, and not just a checklist. | Reconstructable governance at the point of use: one configured run in one real context. | Make governance readiness inspectable, comparable, challengeable, and improvable through recorded evidence. | It does not guarantee correctness or replace domain safety practice; it depends on evidence capture and proportionate implementation. | RAIDT grows from the earlier stages by turning general governance expectations into a bounded proof object for one configured GenAI use. | A clinical note summarisation run records the prompt version, model ID, output hash, review note, and RAIDT scores so supervisors can inspect the exact governed use. |
Table 2. RAIDT Pillars at a glance
| Pillar | What it is | What it is not | Why GenAI needs it | Problem it solves | How to score it | Evidence to collect | Simple example |
| Responsibility | Whether the run was used for an appropriate purpose under clear limits, with suitable oversight, escalation, and boundary setting. | Not a claim that the model is morally good in general. Not the same as overall legal compliance or factual accuracy. | GenAI can sound confident and helpful even when it should defer, warn, or refuse. Responsibility keeps the system within acceptable use. | Unsafe overreach, hidden misuse, no escalation route, and reliance on outputs beyond intended scope. | Score low when purpose, limits, oversight, or escalation are absent. Score high when intended use, human review, and clear limits are documented and followed. | Task purpose, risk tier, role definition, prompt constraints, policy links, reviewer role, review decision, escalation or refusal flags, uncertainty wording. | A clinical-summary prompt tells the model to summarise only the note, not diagnose, and to flag missing evidence for clinician review. |
| Auditability | Whether another reviewer can reconstruct what happened in the run and inspect the evidence later. | Not just having a nice summary or a screenshot. Not a vague statement that 'the system was logged'. | GenAI outputs are shaped by prompts, settings, tools, and context. If these are not reconstructable, governance claims cannot be tested. | Post-incident confusion, weak internal audit, inability to explain what configuration produced a contested output. | Score low when core artefacts are missing. Score high when prompts, versions, hashes, model IDs, timestamps, and reviewer rationale are complete and retained. | Run ID, timestamp, prompt ID/version, prompt hash, model/provider/version, decoding settings, tool trace, output hash, reviewer notes, retention metadata. | A supervisor questions one summary. Auditability lets the team reopen the exact run record and see the prompt, model version, output hash, and review note. |
| Interpretability | Whether the output and its limits are understandable enough for the intended user and task. | Not a generic XAI promise, not merely fluent prose, and not the same as traceability. | GenAI can produce readable text that is still misleading, over-compressed, or unclear about uncertainty and assumptions. | Opaque reasoning, user over-trust, hidden assumptions, and outputs that cannot be safely used by the real audience. | Score low when outputs are confusing, unsupported, or hide uncertainty. Score high when structure, rationale, and limits are clear for the stakeholder. | Structured output schema, explanation fields, uncertainty statements, rationale sections, stakeholder-appropriate language, reviewer comment on clarity. | A note summary is organised into Symptoms, Diagnosis, Treatment, and Red Flags, with a short 'uncertainty/limits' line for the clinician. |
| Dependability | Whether behaviour is stable, predictable, and acceptably robust across repeated runs and small variations. | Not a claim that outputs never change. Not the same as one good-looking answer. | GenAI can vary across runs or small prompt changes. In high-stakes use, unmanaged variance weakens trust and safe deployment. | Unstable recommendations, drift, inconsistent safety behaviour, and unreliable operational use. | Score low when only one run exists or variance is unmanaged. Score high when repeat-run testing, thresholds, and stability checks are documented. | Repeated-run outputs, seeds/settings where available, perturbation tests, variance notes, drift checks, reviewer comparison notes, change-control records. | Three repeats of the same healthcare prompt produce similar summaries and all preserve the red-flag warning; that supports a stronger dependability score. |
| Traceability | Whether key claims can be linked back to inputs, sources, retrieved passages, configurations, and decision steps. | Not just adding citations after the fact. Not the same as general transparency. | GenAI often produces plausible statements without showing where they came from. Traceability anchors claims to evidence and provenance. | Hallucinated support, weak provenance, and inability to contest or verify specific statements. | Score low when claims cannot be linked to source material or lineage. Score high when provenance is preserved end-to-end and reviewers can follow it. | Input record ID, source-note line references, retrieval query, retrieved document IDs/hashes, citation map, model and adapter lineage, tool outputs. | A summary says 'wheeze worsened at night'; traceability means the tester can point to the exact sentence in the source note or retrieved guideline that supports it. |
1. Experiment design
The minimum successful experiment uses one synthetic health record and one structured prompt. The stronger workshop version uses three repeated runs on the same record to show Dependability, plus one baseline prompt as a comparison. The optional RAG-lite condition adds approved governance guidance snippets so supervisors can see how retrieval-style evidence improves Traceability and Auditability when snapshot IDs and hashes are stored.
| Element | Recommended setting | Purpose |
|---|---|---|
| Domain | Healthcare / clinical note summarisation | High-impact context that makes evidence, oversight, red flags and contestability easy to explain. |
| Data source | Synthetic health database included in the pack | Avoids PII and ethics problems while preserving realistic governance pressure. |
| Task | Summarise a clinical-style note into fixed sections | Shows how structure improves interpretability and reviewability. |
| Primary condition | Structured prompt | Treats prompting as a governance intervention rather than only a writing tactic. |
| Evidence method | Manual logging plus SHA-256 hashing | Creates reconstructable evidence without external infrastructure. |
| Scoring | 1-5 across Responsibility, Auditability, Interpretability, Dependability and Traceability | Turns evidence completeness and governance readiness into a comparable profile. |
| Repeat runs | At least three runs for one record | Supports Dependability by checking output stability and variance. |
2. Health database used in the demonstration
The included file health_records_synthetic.csv contains six synthetic health records. The script converts this CSV into a small SQLite database called health_demo.db. This keeps the demonstration close to the user’s idea of a health database while avoiding the risk of using real patient data in a workshop.
| Record ID | Specialty | Risk tier | Workshop title |
|---|---|---|---|
| HC001 | Respiratory | Medium | Allergic rhinitis with asthma history |
| HC002 | Endocrine and Surgery | High | Diabetes and planned procedure |
| HC003 | Cardiology | High | Chest pain assessment |
| HC004 | Primary Care | Low | Medication refill request |
| HC005 | Neurology | High | Headache with warning features |
| HC006 | Respiratory | Medium | Asthma symptom review |
Recommended first record
Start with HC001 because it is clinically understandable, medium risk, and contains red flag advice without being too complex. Then use HC003 or HC005 to show why high-risk records require stronger Responsibility, Auditability and human review evidence.
3. Prompt registry and prompt versions
The prompt registry makes the prompt a governed artefact. Instead of saying “I used a prompt”, the experiment records a prompt ID, a prompt version, an owner role, the task type, the output schema and the reason for the prompt design. This is important because small prompt changes can materially change the output and the governance score.
| Prompt ID | Version | Task type | Purpose |
| PROMPT-HC-SUMM-BASE | v1.0 | clinical_note_summary | Baseline comparison prompt for unstructured clinical note summarisation in a RAIDT workshop. |
| PROMPT-HC-SUMM-STRUCT | v1.0 | clinical_note_summary | Structured prompt for evidence-grounded clinical note summarisation with scope limits, uncertainty and line references. |
| PROMPT-HC-SUMM-RAG-LITE | v1.0 | clinical_note_summary | Structured prompt plus approved governance guidance snippets for manual RAG-like traceability demonstration. |
4. Baseline prompt for contrast
The baseline prompt is useful only as a comparator. It usually produces fluent text, but the output may be weak on structured uncertainty, explicit limits and line-referenced traceability. This helps supervisors see why prompt engineering alone is not the novelty, while structured prompting can still act as one governance intervention inside RAIDT.
PROMPT-HC-SUMM-BASE v1.0
| You are assisting with a research demonstration. Summarise the following clinical note. Important: This is not clinical advice and not a final clinical decision. Use only the note below. Record ID: {{record_id}} Specialty: {{specialty}} Risk tier: {{risk_tier}} Source note: {{line_numbered_note}} |
|---|
5. Structured prompt for the main manual RAIDT run
This is the main prompt for the workshop. It is intentionally structured, bounded, and evidence-oriented. It instructs the model to use only the source note, avoid invention, state missing information, surface red flags, preserve uncertainty and provide line references. These instructions make the output easier to review, but the prompt itself is not the whole RAIDT contribution. The RAIDT contribution appears when the prompt, output, metadata, review decision and scoring are captured as a run-level evidence pack.
PROMPT-HC-SUMM-STRUCT v1.0
| SYSTEM / ROLE You are a clinical note summarisation assistant used only for a RAIDT governance demonstration. You are not a clinician, you are not making a final diagnosis, and you must not provide treatment advice beyond summarising what is explicitly stated in the source note. Your output is a draft for human clinical review only. GOVERNANCE CONSTRAINTS 1. Use only the provided source note and, if present, the approved governance guidance snippets. 2. Do not invent facts, diagnoses, tests, medicines, demographic details, or clinical conclusions. 3. If information is missing, state: "Not stated in the source note". 4. Preserve uncertainty and limits. 5. Surface any red flags or escalation instructions that are present in the note. 6. Use line references such as [N1] and guidance references such as [GOV-HC-001] where relevant. 7. Do not include private identifiers. This demonstration uses synthetic records only. RUN CONTEXT Record ID: {{record_id}} Specialty: {{specialty}} Risk tier: {{risk_tier}} Prompt ID: PROMPT-HC-SUMM-STRUCT Prompt version: v1.0 Output schema: HC_SUMMARY_SCHEMA_v1 SOURCE NOTE WITH LINE REFERENCES {{line_numbered_note}} REQUIRED OUTPUT FORMAT Draft status 1. Symptoms or presenting issues 2. Diagnosis, assessment, or condition discussed 3. Treatment, actions, or follow-up mentioned 4. Red flags or escalation points 5. Uncertainty and limits 6. Evidence links used 7. Reviewer attention points |
6. Optional RAG-lite prompt condition
The optional RAG-lite condition adds approved governance guidance snippets to the prompt. This is not a full technical RAG system; it is a manual demonstration of the same governance idea. The key point is that the guidance block is identified and hashed, so the run can show exactly what external governance guidance was present. This usually strengthens Auditability and Traceability compared with a prompt-only run.
PROMPT-HC-SUMM-RAG-LITE v1.0 guidance block
| Approved governance guidance snippets for this run: [GOV-HC-001 v1.0] AI-generated clinical summaries are draft outputs for human clinical review only. [GOV-HC-002 v1.0] If a detail is not provided in the source note, state that it is not stated. [GOV-HC-003 v1.0] Separate symptoms, assessment, treatment/actions, red flags, and uncertainty/limits. [GOV-HC-004 v1.0] When red flags are present, surface them clearly. [GOV-HC-005 v1.0] Every material summary claim should point back to a note line reference or guidance ID. Then use the same structured prompt and required output format. |
7. Step-by-step manual execution workflow
This section is written as the workshop operating procedure. It assumes the researcher will manually paste the prompt into a GenAI interface, then copy the exact output back into the experiment folder. The code does not call an external model; it only prepares prompts, hashes artefacts, logs metadata and creates the evidence pack.
| Step | Action | What supervisors should observe | Command / action |
| 1 | Initialise | Run the script once to create the SQLite database, folders and log files. | python raidt_manual_experiment.py init |
| 2 | List records | Show supervisors the available synthetic records and risk tiers. | python raidt_manual_experiment.py list-records |
| 3 | Prepare prompt | Generate a complete prompt file for one record and one configuration. | python raidt_manual_experiment.py prepare --record-id HC001 --config structured --replicate 1 |
| 4 | Run model manually | Open the generated prompt file, paste it into the GenAI interface, and run it once. | Copy/paste manually |
| 5 | Save exact output | Save the exact model output into outputs/<RUN_ID>_output.txt without editing. | Use the filename printed by the script |
| 6 | Log evidence | Hash the prompt, input and output, then create the run evidence pack. | python raidt_manual_experiment.py log --run-id <RUN_ID> --model-provider "..." --model-id "..." |
| 7 | Score run | Use the RAIDT rubric to assign pillar scores and rationale. | python raidt_manual_experiment.py score --run-id <RUN_ID> --r 4 --a 4 --i 5 --d 3 --t 4 --notes "..." |
| 8 | Repeat | Repeat the same record and prompt at least three times to support Dependability scoring. | Change --replicate to 2 and 3 |
| 9 | Compare | Compare baseline, structured and optional RAG-lite evidence packs and scores. | Open run_log.csv and raidt_scoring_sheet.csv |
8. Evidence pack: what to collect and why
The evidence pack is the central practical object in the demonstration. It is not just a screenshot or a paragraph of explanation. It is the structured record that lets a later reviewer reconstruct what happened. In this manual experiment, the evidence pack is stored as JSON and mirrored in the run log CSV.
| Evidence field | Meaning | RAIDT reason |
| run_id | Unique ID for one configured use | Links all evidence elements together for reconstruction and audit sampling. |
| timestamp_utc | UTC time of execution | Shows which model, policy and prompt version were active at that time. |
| record_id and input_sha256 | Synthetic database row and input fingerprint | Proves which record was used without relying on memory. |
| prompt_id, version and prompt_sha256 | Controlled prompt metadata and fingerprint | Shows the exact instruction set and prevents invisible prompt drift. |
| model provider, ID and version | Manual entry for the model used | Anchors the output to a specific model or interface as far as the platform allows. |
| decoding parameters | Temperature, top_p, max_tokens where available | Supports Dependability analysis because these settings affect variance. |
| retrieval snapshot ID and hash | Optional guidance block or retrieved context fingerprint | Improves Traceability and Auditability when retrieval-like context is used. |
| output file and output_sha256 | Exact generated output and fingerprint | Supports integrity checking and prevents later untracked changes. |
| review decision and notes | Human accept, edit, reject or escalate decision | Shows human oversight and Responsibility evidence. |
| RAIDT scores and evidence pointers | Five pillar scores plus rationale | Turns evidence into a governance-readiness profile. |
9. Scoring rubric for the workshop
The scoring should be conservative and evidence-based. Score the run evidence pack, not the attractiveness of the generated prose. A high score does not certify clinical correctness, legal compliance, or patient safety. It means the run is better evidenced for reconstruction, review and improvement.
| Pillar | Score 1 | Score 3 | Score 5 |
| Responsibility | Purpose unclear; no limits or oversight; output could be used outside intended context. | Purpose and limits stated, but oversight or escalation only partial. | Clear draft status, intended use, limits, human review, escalation and accountability evidence. |
| Auditability | Cannot reconstruct the run; missing prompt, model, output or review evidence. | Core artefacts exist, but versions, hashes, settings or rationale are incomplete. | Complete run record with versioned prompt, model/configuration, hashes, retention and review rationale. |
| Interpretability | Output is opaque, misleading, unstructured, or lacks uncertainty. | Readable output, but uncertainty, limits or red flag logic is incomplete. | Clear structured sections, line references, explicit uncertainty, reviewer attention points and appropriate wording. |
| Dependability | Single run only or unstable behaviour unmanaged. | Basic repeated runs or observation exists, but variance is not fully analysed. | Repeated runs show stable structure and bounded variance, with settings and mitigation recorded. |
| Traceability | Claims cannot be linked to source note or guidance. | Some line references or source links exist but not complete or immutable. | Every material claim links to source note lines and/or stored guidance snapshots with hashes. |
10. Worked scoring example for one structured prompt run
The following worked example uses HC001 and the structured prompt. The score is illustrative because the actual score should be assigned after the live model output is inspected. The example is intentionally conservative on Dependability because a single run cannot prove stability.
| Dimension | Score | Evidence-based rationale |
| Responsibility | 4 | The prompt states draft status, source-only use, no diagnosis, no treatment advice beyond source note, and human review. A score of 5 would require stronger workflow-level evidence of accountable clinical review and escalation. |
| Auditability | 4 | The run has run ID, timestamp, prompt ID/version/hash, input hash, output hash and reviewer notes. A score of 5 would require fuller platform-level metadata, controlled retention and reproducibility strategy. |
| Interpretability | 5 | The output schema forces symptoms, assessment, treatment/actions, red flags, uncertainty and reviewer attention. This is highly legible for review. |
| Dependability | 3 | One run is useful but insufficient. Repeat the same prompt and record three times before scoring 4 or 5. |
| Traceability | 4 | The prompt requires line references to the source note. A score of 5 would be stronger if external guidance or retrieval snapshots were also stored with IDs and hashes. |
| Composite | 4.0 | Use the composite only as a summary. Keep the pillar profile visible because it shows that Dependability is the current weakness. |
11. Workshop time plan for a three-hour supervision session
| Time | Activity | Purpose |
| 0-15 min | Frame RAIDT | Explain that the demonstration is about GenAI governance, not clinical automation. |
| 15-30 min | Inspect data and prompt registry | Show the synthetic health database, prompt IDs, versions and risk tiers. |
| 30-55 min | Run structured prompt | Generate one real output in the room and save it exactly. |
| 55-80 min | Build evidence pack | Run logging code and inspect JSON/CSV evidence with supervisors. |
| 80-110 min | Score pillars | Score R, A, I, D and T together; discuss what evidence supports or limits each score. |
| 110-135 min | Repeat or compare baseline | Run a second/third replicate or compare against baseline prompt. |
| 135-160 min | Discuss optional RAG-lite | Show how guidance snapshots and hashes can improve Traceability and Auditability. |
| 160-180 min | Q&A and implications | Connect the manual demonstration back to RAIDT as run-level governance method. |
12. Prepared supervisor questions and answers
| Likely question | Workshop answer |
| Is this just prompt engineering? | No. The structured prompt is only the intervention used to make the run easy to inspect. RAIDT is the governance method: evidence pack plus score profile. |
| Is the model output the contribution? | No. The output is the event being governed. The contribution is showing what evidence must exist to reconstruct, review and score that event. |
| Can RAIDT be done manually? | Yes. The manual pilot saves prompt, output, metadata, hashes and scoring sheet. This proves the logic before automation. |
| What makes this healthcare example useful? | Healthcare makes risk, oversight, uncertainty, escalation and duty of care visible, but the same run-level logic can transfer to finance, law, public services and cyber. |
| Does a high RAIDT score mean the output is clinically correct? | No. RAIDT scores governance readiness from evidence. Clinical correctness still requires domain review and safety practice. |
| Why score Dependability differently? | Dependability needs repeat-run evidence. A single fluent output cannot prove stability. |
| What evidence most improves Auditability? | Stable run ID, timestamp, prompt version/hash, model/configuration details, output hash, review notes and retention/access rules. |
| What evidence most improves Traceability? | Line references to the input note, retrieved context IDs, document hashes, prompt lineage and source-to-claim mapping. |
| What would automation add? | Automation would reduce burden and capture platform metadata directly, but the governance logic remains the same. |
13. Code package and commands
The included Python script is designed to support manual execution. It does not call an AI API and therefore avoids dependency on a particular provider. It prepares prompts, creates a SQLite database, computes SHA-256 hashes, appends CSV logs and writes JSON evidence packs.
Copy-ready command sequence
| # Initialise the synthetic health database and folder structure python raidt_manual_experiment.py init # See available synthetic records python raidt_manual_experiment.py list-records # Prepare the main structured-prompt run python raidt_manual_experiment.py prepare --record-id HC001 --config structured --replicate 1 # After manually running the prompt and saving the output, log the evidence python raidt_manual_experiment.py log --run-id <RUN_ID> --model-provider "Manual interface" --model-id "Model used" --model-version "Version if known" # Score the run after review python raidt_manual_experiment.py score --run-id <RUN_ID> --r 4 --a 4 --i 5 --d 3 --t 4 --notes "Structured run; single-run dependability remains preliminary. |
14. File list in the experiment pack
| File | Role in the experiment |
| health_records_synthetic.csv | Synthetic health records used to populate the SQLite database. |
| approved_health_governance_guidance.csv | Optional governance guidance snippets for RAG-lite traceability demonstration. |
| prompt_registry.csv | Registered prompt IDs, versions, status, owner and purpose statements. |
| raidt_manual_experiment.py | Python helper for database creation, prompt preparation, evidence logging and scoring. |
| run_log_template.csv | CSV header template for run metadata. |
| raidt_scoring_sheet_template.csv | CSV header template for pillar scores and rationale. |
| example_output_for_demo_only.txt | Illustrative output only; replace with live GenAI output during workshop. |
| sample_scored_evidence_pack.json | Example evidence pack generated from the demonstration files. |
15. What counts as a successful demonstration?
The demonstration succeeds if supervisors can inspect the chain from record to prompt, from prompt to output, from output to evidence pack, and from evidence pack to pillar scores. It does not need to produce a medically perfect output. In fact, a flawed output can be useful because it shows that RAIDT can surface governance weaknesses rather than hide them.
| Level | Observable outcome |
| Minimum success | One structured prompt run; exact output saved; evidence pack created; five pillar scores assigned with rationale. |
| Stronger success | Three repeated structured runs for the same record; dependability discussed using observed variance. |
| Best workshop success | Baseline, structured, and RAG-lite runs compared; supervisors can see which evidence fields improve which pillars. |
| Evidence of learning | The team identifies which missing evidence would improve scores and how manual RAIDT could become semi-automated. |
Appendix A. Experiment Main Message
“What I want to show today is not that this prompt is clinically superior, and not that prompt engineering is the novelty of the PhD. I want to show how RAIDT works as a governance method. We will take one synthetic health record, run one structured GenAI task, save the evidence, and score the run. The important object is the run-level evidence pack. If we can reconstruct what happened, we can review it. If we can review it, we can score it. If we can score it, we can compare and improve it.”
“The reason I am starting manually is that manual execution exposes the logic very clearly. Automation can come later. Here, we can see every object: the database row, the registered prompt, the prompt hash, the model output, the output hash, the reviewer decision and the RAIDT profile. This is the practical meaning of run-level evidence.”
Appendix B. Minimum evidence checklist for printing
| Done | Evidence item |
| ☐ | Run ID created |
| ☐ | Synthetic record selected |
| ☐ | Prompt ID and version recorded |
| ☐ | Prompt hash stored |
| ☐ | Input hash stored |
| ☐ | Model/provider/version recorded as far as visible |
| ☐ | Decoding settings recorded if visible |
| ☐ | Output saved exactly |
| ☐ | Output hash stored |
| ☐ | Reviewer decision recorded |
| ☐ | Five RAIDT pillar scores entered |
| ☐ | Score rationale linked to evidence |
| ☐ | Repeat runs completed for Dependability if applicable |
| ☐ | Evidence pack saved and retained securely |
Appendix C. Internal project basis used for this manual
This manual is aligned with the RAIDT project materials that define RAIDT as a run-level evidence framework, treat the run as the unit of governance, identify the two practical outputs as the run-level evidence pack and RAIDT score profile, define the five pillars, and allow manual implementation as an early pilot. It also follows the healthcare playbook logic for clinical note summarisation, prompt engineering, metadata logging, reviewer checkpoints and scoring.
| Source family | How it informs this manual |
| RAIDT Q&A Guide | Manual implementation, evidence pack questions, healthcare applicability and 100-question architecture. |
| RAIDT Mind Map | Implementation modes, governance interventions, evidence architecture and pillar definitions. |
| Healthcare Sector Playbook | Clinical note summarisation task, structured prompts, metadata logging, review checkpoints and scoring examples. |
| RAIDT scoring appendix | 1-5 anchors, composite/profile distinction, calibration and evidence-based scoring. |
| RAIDT rules | Prompt transparency, run logs, output hashes, reviewer forms and reproducibility package expectations. |
Appendix D. Executed local command walkthrough and tester comments
This appendix translates the command-line workflow into a supervision-ready operating note. Each step shows the exact command, the expected result, what the tester should look at, and why the step matters in RAIDT terms. The purpose is not only to run the script successfully, but to help a reviewer understand how manual execution becomes a run-level evidence pack.
Step D1. Initialise the experiment workspace
Command:
& python.exe .\raidt_manual_experiment.py init
Expected output:
Initialised SQLite database: raidt_health_manual_pack\health_demo.db
Created folders: prompts_to_run, outputs, pending_runs, evidence_packs
What this step means: This command creates the local experiment environment. It prepares the SQLite database that stores the synthetic health records and also creates the folders that will later hold the generated prompt files, the saved model outputs, any runs waiting for completion, and the final evidence packs.
Tester comment – what to look at: The tester should check that the database file now exists and that all four folders are visible inside the working pack. This confirms that the experiment has a stable place to store artefacts rather than relying on ad hoc screenshots or temporary copy-paste notes.
Why it matters for RAIDT: In RAIDT terms, this step creates the storage structure that makes later auditability and traceability possible. Without a controlled local workspace, the run cannot be reconstructed cleanly.
Step D2. List the available synthetic health records
Command:
& python.exe .\raidt_manual_experiment.py list-records
Expected output:
HC001 | Respiratory | Medium | Allergic rhinitis with asthma history
HC002 | Endocrine and Surgery | High | Diabetes and planned procedure
HC003 | Cardiology | High | Chest pain assessment
HC004 | Primary Care | Low | Medication refill request
HC005 | Neurology | High | Headache with warning features
HC006 | Respiratory | Medium | Asthma symptom review
What this step means: This command reads the database and shows the available records that can be used for a manual run. It is the selection stage of the experiment. The researcher decides which record will be used in the next step.
Tester comment – what to look at: The tester should confirm that the records are visible, that each one has a record ID, domain label, risk level, and short description, and that the chosen example matches the demonstration goal. For a first workshop run, HC001 is a sensible low-friction choice because it is clinically understandable and medium risk.
Why it matters for RAIDT: This step matters because RAIDT is not scored in the abstract. It is always tied to one configured run against one specific task record in context.
Step D3. Prepare one structured-prompt run
Command:
& python.exe .\raidt_manual_experiment.py prepare --record-id HC001 --config structured --replicate 1
Expected output:
Prepared run: RUN-HC001-STRUCTURED-R1-20260419T215828Z-AAED6477
Prompt file: raidt_health_manual_pack\prompts_to_run\RUN-HC001-STRUCTURED-R1-20260419T215828Z-AAED6477_prompt.txt
Next step: paste the prompt into your chosen GenAI interface.
Then save the exact model output into: raidt_health_manual_pack\outputs\RUN-HC001-STRUCTURED-R1-20260419T215828Z-AAED6477_output.txt
What this step means: This command creates a unique run identifier and writes the exact prompt that must be used for the manual GenAI execution. The script has now frozen the run identity, the selected record, the chosen configuration type, and the replicate number.
Tester comment – what to look at: The tester should inspect the generated prompt text file and check that it contains the expected structured instructions and the chosen health record content. The tester should also note the full run ID, because this same ID must travel through the output file, evidence pack, and scoring step.
Why it matters for RAIDT: This is the key bridge between concept and practice. The prompt becomes a governed artefact with a stable identifier, rather than an invisible piece of transient user behaviour.
Step D4. Save the model output exactly as generated
Command:
notepad .\outputs\RUN-HC001-STRUCTURED-R1-20260419T215828Z-AAED6477_output.txt
Expected output:
Open the output text file and paste the exact response returned by the chosen GenAI interface. Save the file without rewriting or cleaning the answer.
What this step means: This is the manual capture step. After pasting the prepared prompt into the GenAI interface, the researcher copies the returned answer into the named output file. The file must contain the exact output because this text is what will later be hashed and treated as evidence.
Tester comment – what to look at: The tester should check that the file is no longer empty, that the response includes the expected structured sections, and that no silent edits were introduced after generation. If the file is empty, the later SHA-256 hash will be the well-known empty-file hash, which means the experiment has logged an empty output rather than a real run result.
Why it matters for RAIDT: This step directly affects auditability, interpretability, and traceability. If the recorded output is incomplete or altered, the evidence pack will not faithfully represent what the model actually produced.
Step D5. Log the run and generate the evidence pack
Command:
& python.exe .\raidt_manual_experiment.py log --run-id "RUN-HC001-STRUCTURED-R1-20260419T215828Z-AAED6477" --model-provider "OpenAI" --model-id "ChatGPPT" --model-version "model used in interface"
Expected output:
Logged run evidence pack: raidt_health_manual_pack\evidence_packs\RUN-HC001-STRUCTURED-R1-20260419T215828Z-AAED6477_evidence_pack.json
Output SHA-256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
What this step means: This command assembles the run metadata and writes the JSON evidence pack. It records the provider name, the model identifier, the model version description, the run ID, and the hash of the saved output text.
Tester comment – what to look at: The tester should open the JSON evidence pack and confirm that the run ID, prompt reference, output file, provider, model fields, and hash are present. In the example above, the hash shown is the empty-file SHA-256 value, which strongly suggests that the output file was still blank when the log command was executed. That is a useful teaching moment: RAIDT can surface evidential weakness immediately. The tester should rerun this step after saving a real output so that the hash changes to a content-specific value. The tester should also replace the placeholder model text with the real interface name and version used in practice.
Why it matters for RAIDT: This step is the clearest illustration of RAIDT's run-level logic. The evidence pack is the practical object that later supports reconstruction, review, challenge, and scoring.
Step D6. Score the run across the five RAIDT pillars
Command:
& python.exe .\raidt_manual_experiment.py score --run-id "RUN-HC001-STRUCTURED-R1-20260419T215828Z-AAED6477" --r 4 --a 4 --i 5 --d 3 --t 4 --notes "Structured output with source-note line references. Single run only, so dependability is conservative."
Expected output:
Updated scoring in evidence pack: raidt_health_manual_pack\evidence_packs\RUN-HC001-STRUCTURED-R1-20260419T215828Z-AAED6477_evidence_pack.json
Composite score: 4.0
What this step means: This command appends the reviewer judgement to the evidence pack. The five numeric inputs correspond to Responsibility, Auditability, Interpretability, Dependability, and Traceability. The note explains the reason for the scores and preserves the review logic.
Tester comment – what to look at: The tester should first confirm that all five pillar scores and the written notes were saved into the JSON evidence pack. Then the tester should read the scores as a governance judgement about the run evidence, not as a judgement about medical correctness alone. Responsibility asks whether the run was used for an appropriate purpose, under clear limits, with suitable human oversight and escalation. To justify Responsibility, the tester should look for evidence that the prompt states the task clearly, avoids making a final diagnosis, includes uncertainty where information is incomplete, and supports safe human review. Auditability asks whether the run can be reconstructed later. To justify Auditability, the tester should look for the run ID, timestamp, prompt reference, model/provider details, output file path, and output hash, because these allow another reviewer to inspect what happened. Interpretability asks whether the output is understandable and structured for the intended user. To justify Interpretability, the tester should check that the response uses the expected structured sections, stays readable, and makes its limits visible rather than hiding uncertainty behind fluent text. Dependability asks whether the system behaves in a stable and predictable way across repeated runs. To justify Dependability, the tester should look for repeat-run evidence, comparison logs, or variance checks. In this example, Dependability is lower because only one run was executed, so behavioural stability has not yet been tested through repeats. Traceability asks whether key statements can be linked back to the input note, sources, or recorded decision steps. To justify Traceability, the tester should check whether the summary stays anchored to the source record and whether the evidence pack preserves enough information to connect the output back to the prompt and input case. The tester should finally emphasise that the composite score is only a summary. The real governance picture is the five-pillar profile, because trade-offs between Responsibility, Auditability, Interpretability, Dependability, and Traceability are usually more informative than the average alone.
Why it matters for RAIDT: This is where RAIDT becomes measurable. The score is not a claim that the answer is medically correct. It is a structured judgement about governance readiness as evidenced by the recorded run.
Appendix D closing note
When demonstrated live, these six steps show the complete RAIDT chain: select one record, prepare one governed prompt, run the model manually, save the exact output, generate the evidence pack, and score the run profile. If a step is weak or incomplete, the weakness becomes visible in the evidence itself. That is precisely the educational value of the manual demonstration.