Glossary term
Glossary term
Evaluation and Benchmarks
A benchmark is a standardized evaluation that measures AI model performance on tasks like reasoning or language understanding, enabling consistent comparison, progress tracking, and identification of strengths and weaknesses.
MMLU, GPQA Diamond, and HumanEval are standard LLM benchmarks reported by OpenAI, Anthropic, and Google.
SWE-bench Verified measures coding agent performance and is used to rank Claude Code, Devin, and Cursor.
ARC-AGI by François Chollet is a high-profile reasoning benchmark with a public leaderboard and prize.