S4.14 - Alignment_policy_ID

S4.14 — Alignment policy ID

flowchart LR
    A[Problem: behavioural differences are often unexplained
if policy layers remain hidden] --> B[RAIDT
run-level evidence framework]
    H[Practical artefacts
model version, safety profile, moderation settings,
wrapper rules, reviewer notes, timestamps] --> C[[Alignment policy ID
active behavioural-governance layer]]
    B --> C
    C --> D[Evidence pack]
    C --> E[RAIDT score profile]
    D --> F[Reviewer reconstruction
and contestability]
    E --> G[Governance readiness
and organisational learning]

← Star S4 - Evidence Architecture and Artefacts

Star context: Specifies the concrete fields and artefacts that make a run record inspectable. In this star, the alignment policy ID identifies which behavioural governance layer was active during a run, so that response conduct, refusal style, safety intervention, and policy-conditioned behaviour can be reviewed as evidence rather than inferred after the fact.

Academic picture

Definition / background

The alignment policy ID records the specific safety, preference, or behavioural-governance layer that shaped a model's conduct in a given run. In practice, this may point to an RLHF-influenced policy release, a DPO-style preference layer, a provider safety profile, a moderation policy pack, or an institutional wrapper that constrains outputs through refusal rules, escalation thresholds, or domain-specific safeguards. The item does not merely say that a model was aligned; it identifies which alignment regime was active when the output was produced.

Conceptually, this item emerges from a practical difficulty in generative AI governance: response behaviour is often partly determined by layers that sit between raw model capability and the final observed answer. A model/provider/version identifier tells us which model family was invoked. An adapter ID or PEFT lineage tells us whether a fine-tuned derivative influenced behaviour. A prompt ID tells us what instructions were supplied. The alignment policy ID is different. It names the active behavioural policy layer that may govern refusal, compliance, tone, warning behaviour, protected-topic handling, and the boundary between helpfulness and restraint.

Within RAIDT, this belongs inside Evidence Architecture and Artefacts because it is a concrete field needed to make a run record inspectable. If a reviewer cannot identify the active alignment policy layer, then a material cause of behavioural variation remains hidden. That weakens run reconstruction and reduces confidence in governance judgements about why an output was produced, blocked, reframed, or escalated.

The item also matters because RAIDT is built around run-level evidence, evidence packs, and five-pillar scoring. Alignment policy ID contributes to the evidence pack by recording one of the conditions under which the run occurred. It contributes to the score profile because the adequacy of behavioural controls, traceability of safety intervention, and reviewability of refusals or constrained outputs all depend in part on whether the active policy layer can be identified and interpreted.

Why this concept matters

This concept solves a recurring governance problem: organisations often observe differences in output behaviour without being able to explain whether those differences arose from the prompt, the model, retrieval context, user role, or an unseen alignment layer. Without an explicit alignment policy ID, behavioural governance remains partly opaque. Reviewers may know that a system responded cautiously or refused a request, but they cannot reliably determine whether that conduct reflected the expected policy baseline, an updated safety regime, or an unintended policy mismatch.

The item also prevents a common confusion between model identity and behavioural policy identity. Two runs can use the same model version yet behave differently because different alignment policies, moderation settings, or wrapper-level safety rules were active. Conversely, a provider may update a policy layer without changing the model name in a way that is visible to end users. Recording the alignment policy ID helps organisations avoid attributing behaviour to the wrong cause.

If this field is missing, several risks arise: weak accountability for refusal or safety decisions, poor comparison between runs, difficulty explaining changes after platform updates, and limited ability to contest or defend behaviour during review. In organisational settings, these gaps matter because stakeholders often ask not only what the system produced, but why it behaved in that way under that set of conditions.

RAIDT uses this concept to move from principles to operational governance. Rather than simply asserting that a system is aligned or safe, RAIDT asks whether the active behavioural policy layer for a specific run can be evidenced, reviewed, and linked to governance outcomes.

Key idea: Alignment policy ID matters because it makes the active behavioural-governance layer of a GenAI run inspectable, allowing response conduct and safety intervention to be reviewed as evidence rather than guessed from outputs alone.

What this item captures

The identifier or version label of the active alignment, safety, or preference-governance layer in a run.
The behavioural regime that may shape refusal patterns, warning behaviour, tone, escalation, and compliance boundaries.
A distinction between model identity and policy-layer identity when the same model can operate under different behavioural constraints.
Evidence needed to explain why similar prompts may yield materially different outputs across runs.
A linkage point between provider-side controls and organisation-side governance review.
A traceable artefact for comparing policy changes, regressions, overrides, or region-specific policy configurations.
A practical basis for scoring reviewability, auditability, and traceability when behaviour is conditioned by hidden or semi-hidden controls.

Practical example / likely audience question

Audience question

Why should RAIDT record an alignment policy ID if we already log the model version and prompt version?

Answer

The concern behind this question is that model and prompt identifiers may appear sufficient to explain behaviour. In many cases they are not. A model version identifies the underlying system, and a prompt version identifies the task instructions, but neither necessarily tells a reviewer which behavioural-governance layer was active when the response was generated. If refusal logic, safety thresholds, or preference tuning changed between runs, then the observed behaviour may not be fully explained by model and prompt data alone.

Consider a public-service drafting assistant used to help staff prepare responses to citizens' enquiries. Two staff members run the same prompt template against the same model family a week apart. In the first run, the system gives a direct procedural answer. In the second, it refuses to provide a similar answer and instead redirects the user to formal channels. If the organisation has logged only the model and prompt, reviewers may wrongly assume inconsistency or user error. If the alignment policy ID is also captured, they may discover that a new safety policy pack was activated after a provider update or an internal governance wrapper change.

RAIDT handles this better than a generic AI governance approach because it treats behavioural policy as a run-level evidential condition rather than an abstract assumption. The framework does not ask reviewers to infer alignment from surface behaviour alone. It asks whether the specific policy layer that governed the run can be named, compared, and assessed alongside the rest of the evidence pack.

Practical example in RAIDT terms

Consider a finance setting in which an analyst uses a GenAI assistant to draft an internal explanation of suspicious transaction patterns for escalation to a compliance team. The use case is legitimate, but the run-level issue is that the assistant may refuse to discuss certain scenarios, produce heavily hedged text, or block content that appears to offer procedural guidance on financial crime patterns. The organisation therefore needs to know whether the behaviour reflected the intended compliance-aligned policy layer or an over-restrictive configuration that reduced operational usefulness.

The evidence needed includes the run ID, timestamp, user role, prompt ID, model/provider/version identifier, any retrieval inputs, the alignment policy ID, the generated output, and reviewer notes on whether the result was usable or inappropriately constrained. Responsibility is affected because a compliance lead may need to justify whether the system was configured appropriately for internal analytical work. Auditability is affected because reviewers need to reconstruct why the assistant behaved cautiously or refused. Interpretability is affected because the organisation must explain how policy-layer constraints shaped the answer. Dependability is affected because repeated over-refusal or inconsistent safety behaviour undermines reliable workflow support. Traceability is affected because policy-conditioned behaviour must be linked to the specific run and its artefacts.

In governance-readiness terms, alignment policy ID improves the organisation's position because it allows behavioural variation to be diagnosed rather than merely observed. The evidence pack can show whether the run was governed by the expected safety policy, whether that policy was proportionate to the task, and whether changes in alignment settings created operational or accountability implications.

Detailed link to RAIDT

Alignment policy ID links to RAIDT in four ways.

First, it supports the RAIDT core idea that governance should rest on inspectable evidence from actual GenAI use, including the behavioural constraints that shaped the run.

Second, it links directly to the run because alignment is not treated only as a system property in the abstract; in RAIDT it is captured as a run condition that may affect what happened in one concrete use event.

Third, it strengthens the evidence pack and the score profile by documenting a hidden but materially relevant factor in response behaviour, refusal, and safety intervention.

Fourth, it supports reviewability, contestability, audit readiness, and organisational learning by making behavioural policy changes visible across runs rather than leaving them as unexplained drift.

Alignment policy ID -> Run-level evidence -> Evidence pack -> RAIDT score profile -> Governance readiness

This chain matters because RAIDT treats behavioural governance as something that should be reconstructable. If the active alignment layer is known, reviewers can assess whether the run's conduct matched organisational expectations. If it is unknown, behavioural explanation remains partial and governance claims weaken.

Link to the five RAIDT pillars

Responsibility

Alignment policy ID supports Responsibility by clarifying which behavioural policy regime the organisation relied upon when a run was conducted and reviewed. This matters where an output was refused, softened, or redirected and someone must explain whether that behaviour reflected an approved governance setting.