Q044 - What_does_an_Alignment_policy_ID_add_to_RAIDT_evidence

Q044 — What does an Alignment policy ID add to RAIDT evidence?

← RAIDT · Star S4 - Evidence Architecture and Artefacts · primary item: S4.14 · Alignment policy ID

It shows which behavioural constraints governed the run, rather than treating safety as an invisible default.

Appears in sources
Answer

An Alignment policy ID adds a distinct provenance element to the run-level evidence pack: it identifies which active safety or preference layer shaped the response in that specific run. Across the RAIDT papers, this matters because deployed behaviour is not determined by the base model alone. It is also shaped by prompts, retrieval, adapters, toolchains, and alignment layers. The Foundations paper is explicit that alignment controls must be logged as an active policy layer, and that traceability extends to adapter and alignment policy versions. In RAIDT terms, the field therefore turns a vague claim that a system was "safety aligned" into inspectable run evidence tied to one configured use.

This addition matters across the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability). For Responsibility, it helps show which safeguards or refusal patterns were active when a user relied on the output. For Auditability and Traceability, it enables later reconstruction of why one run refused, hedged, or reformulated content differently from another apparently similar run. For Dependability, it supports comparison of output stability across runs when the alignment layer changes. For Interpretability, it clarifies that part of the observable behaviour came from a policy layer rather than from the prompt alone. Because RAIDT treats run as the unit of governance and influence methods as governance interventions, an Alignment policy ID strengthens the score profile by converting an otherwise hidden governance intervention into a recorded, reviewable artefact. Without it, reviewers are pushed towards anchors 1=missing / 3=partial / 5=audit-ready on the basis of inference rather than evidence.

Practical example

In a healthcare note-summarisation workflow, a hospital may use the same underlying model and the same prompt template on two different dates, yet receive noticeably different outputs about uncertainty, escalation, or refusal to infer missing clinical details. If the run-level evidence pack records the prompt hash and model deployment but omits the Alignment policy ID, reviewers can see that behaviour changed but cannot show which active safety layer produced that change.

If the pack includes an Alignment policy ID such as a policy-layer version linked to refusal templates and safety settings, the organisation can reconstruct the run properly. That makes it possible to explain why a summary was more conservative, why certain inferences were blocked, and whether the change improved or degraded the score profile for a high-stakes clinical use.

Sources in RAIDT papers
Powered by Forestry.md