Q074 - What_makes_RLHF-type_controls_or_DPO_acceptable_as_governanc

Q074 — What makes RLHF-type controls or DPO acceptable as governance interventions?

← RAIDT · Star S6 - Influence Methods as Governance Interventions · primary item: S6.12 · RLHF-type / DPO controls

Preference-based alignment helps only when the preference process itself is documented as evidence.

Appears in sources
Answer

RLHF-type controls or DPO become acceptable in RAIDT when they are governed as influence methods as governance interventions rather than treated as hidden optimisation tricks. Across the papers, RLHF is presented as supervised fine-tuning plus human preference collection, reward modelling, and policy optimisation, while DPO keeps the preference signal but removes the explicit reward-model layer. What makes either acceptable is not the label but the resulting score profile against the five pillars (Responsibility, Auditability, Interpretability, Dependability, Traceability). The evidence base is consistent: these controls can improve Responsibility by moderating unsafe tone, surfacing red flags, and making outputs more socially aligned, but they are only defensible where those gains are measured and tied to explicit reviewer protocols.

The acceptability threshold is therefore evidential. The papers argue that organisations should treat the run as the unit of governance, so each aligned run carries prompt/version data, model or adapter lineage, preference or reward identifiers, reviewer calibration, hashes, and adjudication notes. In item language, that amounts to a run-level evidence pack rather than a general claim that the model is better behaved. Using anchors 1=missing / 3=partial / 5=audit-ready, RLHF or DPO is acceptable only when annotator consent, rubric versions, inter-rater checks, reward-model cards or pair-dataset cards, and rollback notes are present and reviewable. If those artefacts are absent, the intervention may still raise Responsibility, but it is not a satisfactory governance control because Auditability and Traceability remain too weak for serious oversight.

Practical example

In healthcare, a hospital could use RLHF to improve a discharge-summary assistant that already runs with retrieval over local guidance. Clinicians rank candidate summaries, rewarding outputs that surface polypharmacy risks, frailty flags, and appropriate uncertainty language, while penalising unsupported reassurance. That makes the control acceptable only if the hospital keeps the preference rubric, annotator eligibility records, reward or preference IDs, and output hashes for every reviewed run.

If a governance officer can inspect the run and see which prompt version, retrieval context, reviewer guidance, and preference labels shaped the final summary, the intervention functions as a defensible governance control. If the same hospital only says that RLHF made the assistant safer, but cannot show who labelled what and under which rubric, it has improved behaviour without producing evidence strong enough for audit or incident review.

Sources in RAIDT papers
Powered by Forestry.md