Automatic Evaluation

Using software to judge the quality of a model's output.

When model output is relatively straightforward, a script or program can compare the model's output to a golden response. This type of automatic evaluation is sometimes called programmatic evaluation. Metrics such as ROUGE or BLEU are often useful for programmatic evaluation.

When model output is complex or has no one right answer, a separate ML program called an autorater sometimes performs the automatic evaluation.

Contrast with human evaluation.

Real-world uses

Created for this library

1.
An LLM team runs automatic evaluation nightly against a held-out prompt set so model regressions are caught before any human review.
2.
A search-quality team relies on automatic evaluation with NDCG to compare ranking variants quickly before sending the top candidates for human rating.
3.
A translation vendor uses automatic evaluation with BLEU as a fast first pass before paying for full human assessment on new language pairs.

Back to glossary

Automatic Evaluation

Real-world uses

Related terms

Loading…

Automatic Evaluation

Real-world uses

Related terms