Glossary term
Glossary term
Evaluation and Benchmarks
A structured test of what a model or system can do, including both intended benefits and dangerous or unexpected capabilities. It differs from ordinary performance testing because it looks for risk-relevant capability thresholds. Capability evaluations should be performed before broad release and repeated when models, tools, scaffolding, or deployment context changes.
METR (formerly ARC Evals) specializes in autonomous task capability evaluations against frontier models for major AI developers.
The UK AISI and US AISI publish capability evaluation results for frontier models, including pre-deployment testing of Anthropic's and OpenAI's models.
DeepMind's Dangerous Capability Evaluations paper (2024) catalogs evaluations across persuasion, deception, cyber capability, and self-proliferation.