Glossary term
Glossary term
Safety and Alignment
Machine learning approach for training models from human or AI comparisons of output pairs rather than absolute labels.
Anthropic's preference learning approach for Claude uses comparisons of response pairs rated by human contractors - the preference model trained on these comparisons is used as the reward signal in PPO alignment training.
OpenAI's InstructGPT uses preference learning via the Bradley-Terry model to train a reward model from 50,000 pairwise comparisons of GPT-3 outputs, creating a scalar reward signal for RLHF training.
Google's Gemini alignment uses preference learning from both human raters and model-generated critiques (RLAIF), scaling the preference data collection 10x compared to pure human labelling.