Glossary term
Glossary term
Foundations
Reinforcement Learning from Human Feedback - training technique aligning models to human preferences.
OpenAI used RLHF to align GPT-4 from a base language model to a helpful, harmless assistant - human raters provided preference labels on model outputs, which trained a reward model used in PPO training.
Anthropic's Constitutional AI extends RLHF by using an AI model to generate critique and revision pairs, reducing the number of human labels needed while achieving stronger safety alignment on Claude.
DeepSeek R1 uses GRPO (Group Relative Policy Optimisation), a variant of RLHF, to align its reasoning model - achieving o1-level math and coding performance at 10% of the compute cost of standard RLHF.