Glossary term
Glossary term
Evaluation and Benchmarks
The tooling and process used to run repeatable model evaluations across datasets, prompts, scenarios, metrics, and model versions. It supports release decisions, regression testing, and audit evidence. A mature harness enables repeatable pre-release checks, regression tests after model changes, and evidence that acceptance criteria were actually tested.
EleutherAI's lm-evaluation-harness is the leading open source LLM evaluation framework with hundreds of supported tasks.
Stanford CRFM's HELM provides a holistic evaluation harness used to compare leading foundation models on standardized metrics.
OpenAI Evals is an open source framework for creating and running LLM evaluations, widely used in custom enterprise eval pipelines.