Glossary term
Glossary term
Evaluation and Benchmarks
A metric to determine an LLM's accuracy in solving a math problem within K attempts. For example, math-pass@2 measures an LLM's ability to solve math problems within two attempts. An accuracy of 0.85 on math-pass@2 indicates that an LLM was able to solve math problems 85% of the time within two attempts.
math-pass@k is identical to the pass@k metric, except that the term math-pass@k is specifically used for math evaluation.
Created for this library
An LLM evaluation team uses math-pass@k in its model release reviews to track how often the model produces a correct math solution among k samples.
A research lab reports math-pass@k scores in its preprint to compare reasoning ability of fine-tuned versions of its model.
A model release team includes math-pass@k as one of several reasoning benchmarks gating production promotion.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License