Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark of 8,500 grade-school math word problems requiring multi-step numerical reasoning.
GSM8K (Cobbe et al. 2021, OpenAI) is used to benchmark mathematical reasoning - GPT-4o achieves 95.3% with chain-of-thought prompting, vs 56% for GPT-3.5, used to select models for financial calculation tasks.
DeepSeek R1 achieves 97.3% on GSM8K using reinforcement learning for reasoning - benchmarking math performance has become a key differentiator between frontier models in 2024-2025.
MATH benchmark (Hendrycks 2021) covers competition-level maths (AMC, AIME) and shows more differentiation than GSM8K for frontier models - used by quantitative finance firms to select models for formula derivation.