RLAIF

Reinforcement Learning from AI Feedback - alignment technique using AI models to generate preference labels instead of human raters.

1.
Anthropic's Constitutional AI uses RLAIF - a Claude model critiques and revises its own outputs according to principles, generating AI preference labels that train the reward model without human annotation.
2.
Google's FLAN-v2 alignment uses a PaLM model as a preference judge to generate RLAIF training data - scaling preference collection to 1M+ pairs at a fraction of the cost of human labelling.
3.
Llama 2's safety alignment combines RLHF with RLAIF - human labellers provide preference data for helpfulness while a fine-tuned model generates safety preference data, reducing human labelling cost by 5x.

Loading…