Alignment Tax

Performance degradation on capability benchmarks observed after applying safety alignment (RLHF, SFT) to a pre-trained base model.

1.
Llama 2 Chat is more helpful but scores lower on MMLU (68.9%) than its unaligned base model (69.8%) - a small but measurable alignment tax from the safety SFT and RLHF training applied to the base model.
2.
OpenAI's InstructGPT paper (2022) acknowledged that aligning GPT-3 with RLHF reduced its performance on held-out NLP benchmarks by 5-10%, describing this as the alignment-capability trade-off motivating the 'helpful, harmless, and honest' research agenda.
3.
Anthropic's Constitutional AI research aims to reduce the alignment tax by using AI-generated preference data rather than human labels, achieving similar safety scores with smaller reductions in general capability benchmarks.

Loading…