S6.12 - RLHF-type_DPO_controls

S6.12 ? RLHF-type / DPO controls

flowchart LR
    A[Background problem:
aligned behaviour can hide
reward provenance and tuning history] --> B[RAIDT:
run-level evidence framework]
    B --> C[[RLHF-type / DPO controls:
preference-based behavioural influence]]
    H[Practical fields:
healthcare, finance, public services,
enterprise copilots] --> C
    I[Operational metadata:
policy version, rater protocol,
checkpoint lineage, preference-data reference] --> C
    C --> D[Evidence pack:
document active control and provenance]
    C --> E[RAIDT score profile:
Responsibility, Auditability,
Interpretability, Dependability, Traceability]
    D --> F[Reviewer reconstruction
and contestability]
    E --> G[Governance readiness
and organisational learning]

? Star S6 - Influence Methods as Governance Interventions

Star context: Positions prompting, RAG, PEFT/LoRA, RLHF/DPO and stacked influence as components that shape governance evidence, not as the project core. In RAIDT, these are governance-relevant influence methods only when their effects can be evidenced, reviewed, and contested at run level.

Academic picture

Definition / background

RLHF-type and DPO controls are post-training or policy-shaping mechanisms that use human preferences, preference-labelled comparisons, reward modelling, or closely related optimisation routines to steer how a generative AI system behaves. Reinforcement Learning from Human Feedback (RLHF) typically uses human judgements to train a reward signal and then optimise behaviour against it, while Direct Preference Optimisation (DPO) more directly adjusts model behaviour against preference comparisons without the same online reinforcement loop. In practical governance discussions, both are often treated as families of preference-based behavioural controls.

Within RAIDT, the main issue is not the algorithmic detail alone but the governance status of the control. A model that has been shaped by RLHF-type or DPO-style optimisation may appear more helpful, safer, more cautious, or more policy-compliant, yet those behavioural gains do not automatically create auditability. If the provenance of the preference data, reward logic, rater protocol, or tuned checkpoint is hidden, then a run may be easier to defend rhetorically than to reconstruct evidentially.

This item belongs in RAIDT because RAIDT is concerned with governable use, not merely improved outputs. At run level, an organisation needs to know whether a response was influenced by a preference-tuned checkpoint, an added policy layer, a vendor safety overlay, or another post-training control. That information affects the run-level evidence pack and may alter how the run is scored across Responsibility, Auditability, Interpretability, Dependability, and Traceability.

RLHF-type or DPO controls therefore differ from prompting and RAG in an important way. Prompting shapes the immediate instruction context for a run; RAG shapes the information context; PEFT or LoRA may adapt the model efficiently for a task; RLHF-type or DPO controls shape the behavioural preference structure that governs how the system tends to act. In RAIDT, that makes them a governance intervention only when the active control can be evidenced and related to the specific run under review.

Why this concept matters

This concept matters because organisations can easily confuse better-behaved output with better-governed output. A model that appears more responsible after RLHF or DPO may indeed reduce certain harmful responses, but if nobody can identify which policy layer, preference dataset, or tuned checkpoint produced that behaviour, the organisation has improved conduct without securing reviewability.

The concept also avoids a common governance error: treating alignment claims as sufficient evidence. In many operational settings, teams inherit a vendor model or internal checkpoint that has already been ?safety tuned?. Without structured logging, reviewers cannot tell whether a contested answer came from the prompt, the retrieved evidence, the preference-tuned control layer, or a later policy overlay. That ambiguity weakens assurance, incident analysis, and organisational learning.

For RAIDT, the value of the concept is that it converts a broad alignment narrative into a governable run-level question: what behavioural control was active for this run, how was it produced, and what evidence exists to justify trust in it? This helps organisations move from general principles to operational governance, because the control becomes inspectable, comparable, and contestable inside the evidence pack and score profile.

Key idea: RLHF-type or DPO controls matter in RAIDT because behavioural alignment only becomes governance-ready when the active control and its provenance are visible at run level.

What this item controls

The behavioural preferences a model is encouraged to follow, such as caution, refusal style, tone, deference to policy, or prioritisation of safety over completeness.
The active preference or policy layer that may sit behind a run, including a tuned checkpoint, reward-shaped model variant, or vendor safety configuration.
The provenance fields needed to evaluate that control, such as preference-data source, rater protocol, reward-function reference, and checkpoint lineage.
The distinction between a model that is merely said to be aligned and a model whose alignment-related controls can be evidenced and reviewed.
The interaction between behavioural controls and other influence methods, including prompts, retrieval layers, adapters, and stacked governance interventions.
The conditions under which a run can be reconstructed after a challenge, incident, or audit request.

Practical example / likely audience question

Audience question

If a model has already been improved through RLHF or DPO, why does RAIDT still need to log that control at the level of the individual run?

Answer

The concern behind this question is the assumption that once alignment has happened upstream, governance has effectively been solved downstream. RAIDT rejects that assumption. RLHF-type or DPO controls shape behaviour, but they do not remove the need to identify which control was active when a specific output was produced.

A practical example is a compliance assistant that has been tuned to avoid unsupported legal or regulatory claims. If a reviewer later finds that the assistant became overly conservative and withheld a relevant answer, the organisation needs to know whether that behaviour came from the user prompt, the retrieval context, the base model, or the preference-tuned safety layer. Without run-level evidence, the team can only guess.

RAIDT handles this better than a generic AI governance approach because it asks for operational evidence rather than assurance language. Instead of recording only that the system is ?aligned?, RAIDT would expect evidence such as the policy version, preference dataset or reward-function reference, reviewer or rater protocol, checkpoint lineage, approval status, and the run context in which the control was active. That makes the behaviour reviewable rather than merely asserted.

Practical example in RAIDT terms

Consider a healthcare organisation using a generative AI assistant to draft patient-facing discharge instructions from clinician notes.

The run-level issue is that the deployed model uses a DPO-style preference-tuned checkpoint designed to favour caution, explicit uncertainty, and escalation language when the source notes appear incomplete. That may be clinically prudent, but it may also cause the assistant to omit useful practical guidance or over-refer routine issues back to staff.

In RAIDT terms, the evidence needed for that run would include the base model identifier, the active DPO-tuned checkpoint or policy-layer version, a summary reference for the preference data used, the rater or reviewer protocol that defined ?better? answers, any checkpoint lineage linking the model to prior versions, and the governance approval status for that control. The evidence pack would also benefit from the prompt, retrieved notes, clinician edits, and any override or escalation taken by the user.

The most affected RAIDT pillars would be Responsibility, Auditability, Dependability, and Traceability, with Interpretability also relevant. Governance readiness improves because reviewers can distinguish whether a problematic discharge instruction arose from poor source notes, weak prompting, retrieval failure, or the preference-based behavioural control itself.

Detailed link to RAIDT

RLHF-type / DPO controls links to RAIDT in four ways.

First, it connects to RAIDT?s core idea that governance should be grounded in evidence about actual system use rather than in abstract claims about model quality.
Second, it links directly to the run because a run may be materially shaped by a preference-tuned checkpoint, reward-informed policy layer, or vendor alignment overlay that changes response behaviour.
Third, it influences both the evidence pack and the score profile, because undocumented behavioural controls weaken the quality of evidence and can depress confidence across multiple RAIDT pillars.
Fourth, it supports reviewability, contestability, audit readiness, and organisational learning by making it possible to reconstruct how and why a model behaved in a given way.

RLHF-type / DPO controls ? Run-level evidence ? Evidence pack ? RAIDT score profile ? Governance readiness

When the control is visible, RAIDT can assess it. When it is hidden, the organisation inherits behavioural shaping without adequate governance visibility.

Link to the five RAIDT pillars

Responsibility

RLHF-type or DPO controls often aim to make outputs safer, more compliant, or more socially acceptable, so they have a direct relationship to responsibility. However, responsibility is stronger when the organisation can explain what values or preferences were encoded and who authorised them.