Glossary term
Glossary term
Agentic Systems
A process in which people judge the quality of an ML model's output; for example, having bilingual people judge the quality of an ML translation model. Human evaluation is particularly useful for judging models that have no one right answer.
Contrast with automatic evaluation and autorater evaluation.
Created for this library
A search-quality team runs human evaluation each release on a curated set of queries to validate offline metrics before launching.
An LLM product team runs human evaluation by paid raters on a sample of generated answers before promoting any prompt change.
A translation vendor runs human evaluation by professional translators on a sample of pairs before standardizing on a new model.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License