Q249 - RLHF-type_DPO_definition_example_and_why_it_matters_in_RAIDT

Q249 — RLHF-type / DPO — definition, example, and why it matters in RAIDT

← RAIDT · Star S6 - Influence Methods as Governance Interventions · primary item: S6.12 · RLHF-type / DPO controls

F. Governance Interventions | Ordered by mind-map priority: inner circles first, then operational detail.

Appears in sources

workshop_dense_100#slide 72

Answer

RLHF-type / DPO controls are preference-based policy controls used to shape how an LLM behaves after baseline prompting or alongside other influence levers. In the papers, RLHF is the fuller stack: supervised fine-tuning, human preference collection, reward modelling, and policy optimisation. DPO is the leaner variant: it learns from preferred and dispreferred pairs directly, without an explicit reward model. In RAIDT terms, both are best understood as behavioural controls, not provenance controls. They are meant to improve how the system speaks, refuses, escalates, and signals risk.

A concrete example from the corpus is high-stakes healthcare summarisation, where preference learning increases red-flag surfacing and safer wording; parallel finance and public-policy cases show similar gains in caution, neutrality, and regulation-aware phrasing. Why this matters in RAIDT is that the governance benefit is real but incomplete. RLHF/DPO can lift Responsibility and sometimes Interpretability, yet they do not by themselves tell an auditor where the content came from or how a preference was formed. That is why the papers repeatedly recommend stacking them with RAG for provenance and with PEFT/LoRA for stable domain phrasing. The practical lesson is simple: use RLHF/DPO when behaviour needs steering, but judge the intervention by its score profile, document it as part of a run-level evidence pack, and keep the run as the unit of governance so that preference-based control does not become a new source of opacity.

Practical example

In cybersecurity, a team could use DPO on an intrusion-narrative assistant so that incident reports prefer disciplined, non-sensational wording and consistently surface uncertainty when evidence is incomplete. Analysts would compare candidate narratives and prefer outputs that tie severity claims to observable indicators rather than dramatic language.

This matters because an aligned narrative can improve operational Responsibility: it reduces overstatement, flags limits, and supports safer escalation decisions. But if the security team keeps only the polished final report and not the preference pairs, guidance notes, model version, and output hashes, post-incident review becomes weaker. RAIDT therefore treats the behavioural gain and the audit burden together: better conduct is useful, but only evidence-backed conduct is governance-ready.

Sources in RAIDT papers

07-RAIDT_RLHF_V1
06-RAIDT_RAG_V1
05-RAIDT_LoRA_V2