LLM Evaluations (evals)

A set of metrics and benchmarks for assessing the performance of large language models (LLMs). At a high level, LLM evaluations:

Help researchers identify areas where LLMs need improvement.

Are useful in comparing different LLMs and identifying the best LLM for a particular task.

Help ensure that LLMs are safe and ethical to use.

See Large language models (LLMs) in Machine Learning Crash Course for more information.

Created for this library

1.
An LLM product team runs evals nightly so any prompt or model change must show measurable quality improvement before launch.
2.
An evaluation team curates domain-specific evals for its legal-tech product so model selection is grounded in client-relevant tasks.
3.
A research lab uses public and private evals together to gate model promotions for production use.

Loading…