Glossary term
Glossary term
Evaluation and Benchmarks
Systematic assessment of model or agent quality, risk, and behavior.
The process of measuring a model's quality or comparing different models against each other.
To evaluate a supervised machine learning model, you typically judge it against a validation set and a test set. Evaluating a LLM typically involves broader quality and safety assessments.
OpenAI's Evals framework is open-sourced and used by hundreds of teams to define and run regression tests on GPT-4 - measuring accuracy, refusal rates, and format compliance across thousands of curated prompts.
HELM (Holistic Evaluation of Language Models, Stanford) evaluates 30+ LLMs across 42 scenarios including accuracy, calibration, fairness, robustness, and efficiency - used by procurement teams to compare models.
Ragas (open-source) provides RAG-specific evaluation metrics - faithfulness, answer relevance, context recall, and context precision - used by engineering teams to measure RAG pipeline quality before production deployment.
Created for this library
A SaaS company runs evaluation on a fixed held-out month of data each release so model quality is comparable across versions.
A medical AI team runs evaluation on a curated test set of edge cases that clinicians find most informative for safety review.
A search team runs evaluation on rated queries weekly to monitor quality drift between scheduled model updates.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License