Benchmark

A benchmark is a standardized evaluation that measures AI model performance on tasks like reasoning or language understanding, enabling consistent comparison, progress tracking, and identification of strengths and weaknesses.

Examples

1.
MMLU, GPQA Diamond, and HumanEval are standard LLM benchmarks reported by OpenAI, Anthropic, and Google.
2.
SWE-bench Verified measures coding agent performance and is used to rank Claude Code, Devin, and Cursor.
3.
ARC-AGI by François Chollet is a high-profile reasoning benchmark with a public leaderboard and prize.

Related terms

Back to glossary

Examples

Related terms

Loading…

Examples

Related terms