Glossary term
Glossary term
Training and Fine-Tuning
Model trained to predict human preference scores for LLM outputs, used as a surrogate signal in RLHF training.
OpenAI trained a reward model on 50,000 human preference comparisons to align GPT-4 via RLHF - the reward model scores candidate responses so PPO can optimise the policy without further human labels.
Anthropic's Constitutional AI trains a preference model from AI-generated critiques rather than purely human labels - scaling alignment training by 10x compared to pure human-labelling approaches.
Cohere trains custom reward models for enterprise clients based on their specific quality criteria - a legal firm's reward model scores responses on citation accuracy and regulatory conservatism rather than general helpfulness.