Reward Model

Model trained to predict human preference scores for LLM outputs, used as a surrogate signal in RLHF training.

1.
OpenAI trained a reward model on 50,000 human preference comparisons to align GPT-4 via RLHF - the reward model scores candidate responses so PPO can optimise the policy without further human labels.
2.
Anthropic's Constitutional AI trains a preference model from AI-generated critiques rather than purely human labels - scaling alignment training by 10x compared to pure human-labelling approaches.
3.
Cohere trains custom reward models for enterprise clients based on their specific quality criteria - a legal firm's reward model scores responses on citation accuracy and regulatory conservatism rather than general helpfulness.

Loading…