Glossary term
Glossary term
Training and Fine-Tuning
Reward model that evaluates each intermediate reasoning step rather than just the final answer, enabling step-level feedback.
OpenAI's Math Shepherd uses a PRM to score each step in a chain-of-thought math solution - used in o1 training to reward correct reasoning processes rather than just correct final answers.
Google DeepMind's AlphaCode 2 uses a step-level verifier (PRM equivalent) to evaluate each line of generated code against correctness and style criteria - filtering 1M candidate solutions to the top-1 submission.
Numina AI uses PRMs trained on competition mathematics to verify multi-step proofs - identifying the specific step where reasoning diverges from correctness, enabling targeted resampling.