Glossary term
Glossary term
Evaluation and Benchmarks
Massive Multitask Language Understanding benchmark covering 57 academic and professional subjects to test breadth of knowledge.
MMLU (Hendrycks et al. 2021) is used in every major LLM evaluation - GPT-4 achieves 86.4%, Claude 3 Opus achieves 86.8%, and Llama 3.1 405B achieves 88.6%, used by enterprises for model selection.
Vendors use MMLU scores in marketing and procurement - a hospital purchasing an AI assistant looks for models scoring >80% on MMLU Medical category to establish baseline professional knowledge.
MMLU-Pro (2024) extends MMLU with harder, expert-level questions and 10-choice options - used to differentiate frontier models that have saturated standard MMLU, where GPT-4o and Claude 3.5 Sonnet score ~73%.