Process Reward Model (PRM)

Reward model that evaluates each intermediate reasoning step rather than just the final answer, enabling step-level feedback.

1.
OpenAI's Math Shepherd uses a PRM to score each step in a chain-of-thought math solution - used in o1 training to reward correct reasoning processes rather than just correct final answers.
2.
Google DeepMind's AlphaCode 2 uses a step-level verifier (PRM equivalent) to evaluate each line of generated code against correctness and style criteria - filtering 1M candidate solutions to the top-1 submission.
3.
Numina AI uses PRMs trained on competition mathematics to verify multi-step proofs - identifying the specific step where reasoning diverges from correctness, enabling targeted resampling.

Process Reward ModelPRM