Direct Preference Optimisation (DPO)

Alignment technique that optimises a model directly from preference pairs (chosen vs rejected responses) without training a separate reward model.

1.
Zephyr (HuggingFace, 2023) uses DPO to align Mistral 7B on preference data - achieving MT-Bench score of 7.34 that surpasses Llama 2 Chat 70B without the reward-model complexity of RLHF.
2.
Mistral AI uses DPO in its instruction-tuning pipeline to align Mistral 7B Instruct v0.2 - DPO's stability advantage over PPO-RLHF allows smaller teams to run alignment without a reward-model training infrastructure.
3.
LLaMA-3.1's alignment pipeline combines SFT with DPO and rejection sampling - Meta reports that DPO on 10,000 preference pairs improved chat quality more efficiently than PPO on the same data budget.

Direct Preference OptimisationDPO