Glossary term
Glossary term
Agentic Systems
Using software to judge the quality of a model's output.
When model output is relatively straightforward, a script or program can compare the model's output to a golden response. This type of automatic evaluation is sometimes called programmatic evaluation. Metrics such as ROUGE or BLEU are often useful for programmatic evaluation.
When model output is complex or has no one right answer, a separate ML program called an autorater sometimes performs the automatic evaluation.
Contrast with human evaluation.
Created for this library
An LLM team runs automatic evaluation nightly against a held-out prompt set so model regressions are caught before any human review.
A search-quality team relies on automatic evaluation with NDCG to compare ranking variants quickly before sending the top candidates for human rating.
A translation vendor uses automatic evaluation with BLEU as a fast first pass before paying for full human assessment on new language pairs.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License