Glossary term
Glossary term
Safety and Alignment
Reinforcement Learning from AI Feedback - alignment technique using AI models to generate preference labels instead of human raters.
Anthropic's Constitutional AI uses RLAIF - a Claude model critiques and revises its own outputs according to principles, generating AI preference labels that train the reward model without human annotation.
Google's FLAN-v2 alignment uses a PaLM model as a preference judge to generate RLAIF training data - scaling preference collection to 1M+ pairs at a fraction of the cost of human labelling.
Llama 2's safety alignment combines RLHF with RLAIF - human labellers provide preference data for helpfulness while a fine-tuned model generates safety preference data, reducing human labelling cost by 5x.