S4.14 - Alignment_policy_ID
S4.14 — Alignment policy ID
flowchart LR
A[Problem: behavioural differences are often unexplained
if policy layers remain hidden] --> B[RAIDT
run-level evidence framework]
H[Practical artefacts
model version, safety profile, moderation settings,
wrapper rules, reviewer notes, timestamps] --> C[[Alignment policy ID
active behavioural-governance layer]]
B --> C
C --> D[Evidence pack]
C --> E[RAIDT score profile]
D --> F[Reviewer reconstruction
and contestability]
E --> G[Governance readiness
and organisational learning]← Star S4 - Evidence Architecture and Artefacts
Star context: Specifies the concrete fields and artefacts that make a run record inspectable. In this star, the alignment policy ID identifies which behavioural governance layer was active during a run, so that response conduct, refusal style, safety intervention, and policy-conditioned behaviour can be reviewed as evidence rather than inferred after the fact.
Academic picture
Definition / background
The alignment policy ID records the specific safety, preference, or behavioural-governance layer that shaped a model's conduct in a given run. In practice, this may point to an RLHF-influenced policy release, a DPO-style preference layer, a provider safety profile, a moderation policy pack, or an institutional wrapper that constrains outputs through refusal rules, escalation thresholds, or domain-specific safeguards. The item does not merely say that a model was aligned; it identifies which alignment regime was active when the output was produced.
Conceptually, this item emerges from a practical difficulty in generative AI governance: response behaviour is often partly determined by layers that sit between raw model capability and the final observed answer. A model/provider/version identifier tells us which model family was invoked. An adapter ID or PEFT lineage tells us whether a fine-tuned derivative influenced behaviour. A prompt ID tells us what instructions were supplied. The alignment policy ID is different. It names the active behavioural policy layer that may govern refusal, compliance, tone, warning behaviour, protected-topic handling, and the boundary between helpfulness and restraint.
Within RAIDT, this belongs inside Evidence Architecture and Artefacts because it is a concrete field needed to make a run record inspectable. If a reviewer cannot identify the active alignment policy layer, then a material cause of behavioural variation remains hidden. That weakens run reconstruction and reduces confidence in governance judgements about why an output was produced, blocked, reframed, or escalated.
The item also matters because RAIDT is built around run-level evidence, evidence packs, and five-pillar scoring. Alignment policy ID contributes to the evidence pack by recording one of the conditions under which the run occurred. It contributes to the score profile because the adequacy of behavioural controls, traceability of safety intervention, and reviewability of refusals or constrained outputs all depend in part on whether the active policy layer can be identified and interpreted.
Why this concept matters
This concept solves a recurring governance problem: organisations often observe differences in output behaviour without being able to explain whether those differences arose from the prompt, the model, retrieval context, user role, or an unseen alignment layer. Without an explicit alignment policy ID, behavioural governance remains partly opaque. Reviewers may know that a system responded cautiously or refused a request, but they cannot reliably determine whether that conduct reflected the expected policy baseline, an updated safety regime, or an unintended policy mismatch.
The item also prevents a common confusion between model identity and behavioural policy identity. Two runs can use the same model version yet behave differently because different alignment policies, moderation settings, or wrapper-level safety rules were active. Conversely, a provider may update a policy layer without changing the model name in a way that is visible to end users. Recording the alignment policy ID helps organisations avoid attributing behaviour to the wrong cause.
If this field is missing, several risks arise: weak accountability for refusal or safety decisions, poor comparison between runs, difficulty explaining changes after platform updates, and limited ability to contest or defend behaviour during review. In organisational settings, these gaps matter because stakeholders often ask not only what the system produced, but why it behaved in that way under that set of conditions.
RAIDT uses this concept to move from principles to operational governance. Rather than simply asserting that a system is aligned or safe, RAIDT asks whether the active behavioural policy layer for a specific run can be evidenced, reviewed, and linked to governance outcomes.
Key idea: Alignment policy ID matters because it makes the active behavioural-governance layer of a GenAI run inspectable, allowing response conduct and safety intervention to be reviewed as evidence rather than guessed from outputs alone.
What this item captures
- The identifier or version label of the active alignment, safety, or preference-governance layer in a run.
- The behavioural regime that may shape refusal patterns, warning behaviour, tone, escalation, and compliance boundaries.
- A distinction between model identity and policy-layer identity when the same model can operate under different behavioural constraints.
- Evidence needed to explain why similar prompts may yield materially different outputs across runs.
- A linkage point between provider-side controls and organisation-side governance review.
- A traceable artefact for comparing policy changes, regressions, overrides, or region-specific policy configurations.
- A practical basis for scoring reviewability, auditability, and traceability when behaviour is conditioned by hidden or semi-hidden controls.
Practical example / likely audience question
Audience question
Why should RAIDT record an alignment policy ID if we already log the model version and prompt version?
Answer
The concern behind this question is that model and prompt identifiers may appear sufficient to explain behaviour. In many cases they are not. A model version identifies the underlying system, and a prompt version identifies the task instructions, but neither necessarily tells a reviewer which behavioural-governance layer was active when the response was generated. If refusal logic, safety thresholds, or preference tuning changed between runs, then the observed behaviour may not be fully explained by model and prompt data alone.
Consider a public-service drafting assistant used to help staff prepare responses to citizens' enquiries. Two staff members run the same prompt template against the same model family a week apart. In the first run, the system gives a direct procedural answer. In the second, it refuses to provide a similar answer and instead redirects the user to formal channels. If the organisation has logged only the model and prompt, reviewers may wrongly assume inconsistency or user error. If the alignment policy ID is also captured, they may discover that a new safety policy pack was activated after a provider update or an internal governance wrapper change.
RAIDT handles this better than a generic AI governance approach because it treats behavioural policy as a run-level evidential condition rather than an abstract assumption. The framework does not ask reviewers to infer alignment from surface behaviour alone. It asks whether the specific policy layer that governed the run can be named, compared, and assessed alongside the rest of the evidence pack.
Practical example in RAIDT terms
Consider a finance setting in which an analyst uses a GenAI assistant to draft an internal explanation of suspicious transaction patterns for escalation to a compliance team. The use case is legitimate, but the run-level issue is that the assistant may refuse to discuss certain scenarios, produce heavily hedged text, or block content that appears to offer procedural guidance on financial crime patterns. The organisation therefore needs to know whether the behaviour reflected the intended compliance-aligned policy layer or an over-restrictive configuration that reduced operational usefulness.
The evidence needed includes the run ID, timestamp, user role, prompt ID, model/provider/version identifier, any retrieval inputs, the alignment policy ID, the generated output, and reviewer notes on whether the result was usable or inappropriately constrained. Responsibility is affected because a compliance lead may need to justify whether the system was configured appropriately for internal analytical work. Auditability is affected because reviewers need to reconstruct why the assistant behaved cautiously or refused. Interpretability is affected because the organisation must explain how policy-layer constraints shaped the answer. Dependability is affected because repeated over-refusal or inconsistent safety behaviour undermines reliable workflow support. Traceability is affected because policy-conditioned behaviour must be linked to the specific run and its artefacts.
In governance-readiness terms, alignment policy ID improves the organisation's position because it allows behavioural variation to be diagnosed rather than merely observed. The evidence pack can show whether the run was governed by the expected safety policy, whether that policy was proportionate to the task, and whether changes in alignment settings created operational or accountability implications.
Detailed link to RAIDT
Alignment policy ID links to RAIDT in four ways.
First, it supports the RAIDT core idea that governance should rest on inspectable evidence from actual GenAI use, including the behavioural constraints that shaped the run.
Second, it links directly to the run because alignment is not treated only as a system property in the abstract; in RAIDT it is captured as a run condition that may affect what happened in one concrete use event.
Third, it strengthens the evidence pack and the score profile by documenting a hidden but materially relevant factor in response behaviour, refusal, and safety intervention.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning by making behavioural policy changes visible across runs rather than leaving them as unexplained drift.
Alignment policy ID -> Run-level evidence -> Evidence pack -> RAIDT score profile -> Governance readiness
This chain matters because RAIDT treats behavioural governance as something that should be reconstructable. If the active alignment layer is known, reviewers can assess whether the run's conduct matched organisational expectations. If it is unknown, behavioural explanation remains partial and governance claims weaken.
Link to the five RAIDT pillars
Responsibility
Alignment policy ID supports Responsibility by clarifying which behavioural policy regime the organisation relied upon when a run was conducted and reviewed. This matters where an output was refused, softened, or redirected and someone must explain whether that behaviour reflected an approved governance setting.
Example evidence / implication:
- The run record shows that a named policy profile was approved for a particular task class or user role.
- Reviewer notes indicate whether the active alignment layer was appropriate for the organisational purpose.
Auditability
This item has a strong effect on Auditability because review of behavioural outcomes is weakened if the governing policy layer cannot be identified. When auditors examine why an answer was blocked or constrained, the alignment policy ID becomes part of the reconstruction logic.
Example evidence / implication:
- The evidence pack includes the policy-layer identifier alongside model, prompt, and output records.
- Reviewers can compare two runs and see whether a policy change explains a difference in refusal or safety behaviour.
Interpretability
Alignment policy ID supports Interpretability by improving explanation of behavioural conduct even when internal model reasoning remains opaque. It does not reveal the full internal mechanism, but it helps explain the governance layer that framed permissible response behaviour.
Example evidence / implication:
- The run record shows that a stricter safety profile was active during a refusal-heavy session.
- Analysts can distinguish between prompt-induced behaviour and policy-induced behaviour when interpreting outputs.
Dependability
This item supports Dependability because stable and appropriate behaviour across runs depends partly on consistent policy-layer control. Untracked alignment changes can create unexpected shifts in usefulness, caution, or refusal patterns.
Example evidence / implication:
- A platform team can identify whether recurring over-refusal is linked to one policy release rather than to the base model itself.
- Repeated monitoring can test whether a policy configuration remains fit for task conditions over time.
Traceability
Alignment policy ID is especially important for Traceability because it links one run to the specific behavioural-governance regime under which it occurred. This makes policy-conditioned conduct visible in the evidential chain.
Example evidence / implication:
- The run record links output behaviour to a named alignment or safety policy version.
- Investigators can trace when a policy change entered operational use and which runs it affected.
This item affects all five pillars, but its strongest effects are on Auditability, Traceability, and Responsibility because those pillars depend directly on being able to reconstruct and justify behavioural control conditions.
Why this item is more than a generic concept
In general AI governance, alignment may be discussed as a broad aspiration: making systems safer, more helpful, or more consistent with human preferences. In RAIDT, alignment policy ID has a narrower and more operational meaning. It is the recorded identifier of the specific behavioural-governance layer active in one run.
The RAIDT meaning is more operational because it is tied to evidence architecture, run reconstruction, evidence-pack assembly, and five-pillar scoring. It therefore shifts alignment from a high-level normative claim to an inspectable evidential field. What matters is not simply whether an organisation says a system is aligned, but whether a reviewer can identify which policy layer shaped a particular output and assess whether that was appropriate.
Common misunderstanding
Misunderstanding
Alignment policy ID is just another name for the model version.
Correction
The model version and the alignment policy ID may be related, but they are not the same thing. A model identifier tells us which underlying model instance or release was used. An alignment policy ID tells us which behavioural-governance layer shaped how that model responded in the run. For example, the same base model might operate under different safety profiles for internal research, public-facing support, or high-risk domain use. If a system refuses a prompt in one context but answers it in another, the explanation may lie in the active alignment policy rather than in a different model altogether. RAIDT therefore records the policy layer separately so that behavioural differences are not misattributed.
Boundary and limitation
Alignment policy ID does not prove that a run was safe, appropriate, lawful, or substantively correct. It only identifies the behavioural-governance layer that was active. A well-recorded policy ID cannot by itself show that the policy was well designed, correctly configured, consistently enforced, or suitable for the context of use.
The item also depends on implementation visibility. In some environments, providers do not expose policy-layer identifiers clearly, or policy changes may occur behind managed services with limited transparency. In such cases, organisations may need to record the nearest available proxy, such as a wrapper policy profile, deployment configuration, or provider release label. RAIDT handles this limitation by treating the item as part of a broader evidence architecture. Where exact identifiers are unavailable, the framework still asks for the most specific reconstructable evidence and for clear acknowledgement of uncertainty.
Implementation levels
Manual implementation
A researcher or small team can apply this item manually by recording, for each important run, the named safety or behavioural policy profile that was selected or believed to be active. This may be captured in a structured note, template, or spreadsheet alongside the run ID, prompt, model, and reviewer observations.
Semi-automated implementation
Semi-automated implementation can use templates, wrappers, or form-based tooling that automatically attach a policy-profile label when a user selects a task mode such as public-facing support, internal analysis, or restricted-domain drafting. Reviewers can then confirm whether the selected alignment setting matched the task.
Fully automated implementation
At scale, a platform, orchestration layer, or governance pipeline can automatically log the active policy-layer identifier, moderation configuration, refusal profile, escalation rules, and any policy-release metadata as part of the run record. These fields can feed dashboards, evidence packs, change-monitoring workflows, and alerts when behavioural policy changes affect governance-critical use cases.
Practical use in the RAIDT project
Within the RAIDT project, this item is useful for Paper 08 Foundations because it clarifies that governance-relevant behaviour is shaped not only by the model and prompt, but also by the active policy layer that mediates safety and preference alignment in practice. It strengthens the conceptual claim that run-level evidence must include the artefacts needed to explain behavioural outcomes, not just the artefacts that are easiest to log.
For Paper 09 Empirical Validation, alignment policy ID provides a testable field for examining whether behavioural variation between runs can be explained more accurately when policy-layer evidence is captured. It supports comparative analysis of refusals, caution levels, and task fitness across settings. For Paper 10 Policy Pathways, it helps translate abstract discussion of alignment into operational governance controls that organisations can document, review, and improve.
The item also has value for sector playbooks, evidence-pack design, scoring rubrics, and governance interventions. In a supervision meeting or viva, it helps answer a pointed question: how does RAIDT deal with the fact that response behaviour is often governed by hidden safety or preference layers? The answer is that RAIDT treats those layers as inspectable run artefacts wherever possible, rather than leaving them outside the evidential frame.
Key audience questions to prepare for
Q1. Why is alignment policy ID needed if the output itself already shows whether the system was cautious or restrictive?
Because surface behaviour alone does not reliably explain cause. The same cautious output could result from prompt wording, retrieval context, user permissions, or a stricter policy layer. Recording the alignment policy ID reduces guesswork and supports defensible review.
Q2. What if the provider does not expose a clear alignment policy identifier?
The organisation should record the closest available reconstructable proxy, such as a deployment profile, moderation configuration, wrapper safety mode, or documented provider release condition. RAIDT values evidential clarity, including transparent acknowledgement of what remains unknown.
Q3. Is this item mainly relevant for refusals and safety blocks?
Those are major use cases, but the item also matters for tone, caution, ranking of acceptable answers, escalation behaviour, and other policy-conditioned response features. It is relevant wherever behavioural governance shapes outputs in ways that matter organisationally.
Q4. Does recording an alignment policy ID mean the organisation fully understands the model's internal alignment mechanism?
No. The item supports governance-level explanation, not full mechanistic interpretability. It identifies the active behavioural policy layer so that reviewers can reconstruct conditions of use more accurately, even if deeper internal processes remain opaque.
Q5. What makes this concept distinctive in RAIDT?
RAIDT makes alignment policy operational at the level of one run. Instead of discussing alignment only as a principle or provider claim, the framework asks whether the active behavioural policy layer can be evidenced, included in an evidence pack, and used to justify governance judgements.
Suggested citation concepts to support this item
- alignment layers in generative AI governance
- RLHF and DPO implications for governance documentation
- safety policy versioning in large language model deployments
- moderation profiles and behavioural control in enterprise GenAI systems
- auditability of refusal behaviour in AI assistants
- run-level traceability for policy-conditioned AI outputs
- operational documentation of safety configurations in AI systems
- behavioural drift after policy updates in generative AI platforms
- governance implications of hidden alignment layers in LLM services
- evidence-based review of safety interventions in organisational AI use
Short explanation for presentation
Alignment policy ID is the run-level record of which safety or behavioural policy layer shaped a GenAI response. In RAIDT, this matters because behaviour is not determined only by the base model or the prompt. Refusal style, caution level, escalation behaviour, and other response boundaries may be governed by policy layers that are easy to overlook unless they are explicitly captured. By recording the active alignment policy ID, RAIDT makes those behavioural constraints inspectable within the evidence pack for one run. That improves reviewability, supports more defensible scoring across the five pillars, and helps organisations explain why two apparently similar runs behaved differently. In short, the item turns alignment from a vague governance claim into a concrete evidential field tied to governance readiness.
One-line takeaway
Alignment policy ID is the recorded identifier of the behavioural-governance layer active in a run because RAIDT needs response conduct and safety intervention to be reconstructable as evidence.
Related items in evidence architecture and artefacts
- S4.01 ? run_id
- S4.02 ? Timestamp
- S4.03 ? User role / operator role
- S4.04 ? Task and domain label
- S4.05 ? Prompt registry
- S4.06 ? Prompt ID and version
- S4.07 ? Prompt hash
- S4.08 ? Model/provider/version identifier
- S4.09 ? Decoding parameters
- S4.10 ? Retrieval query and index ID
- S4.11 ? Retrieved document IDs and hashes
- S4.12 ? Tool-chain trace
- S4.13 ? Adapter ID / PEFT lineage
- S4.15 ? Output hash
- S4.16 ? Review decision and reviewer notes
- ? and 1 more