Glossary term
Glossary term
Evaluation and Benchmarks
Comparing the quality of two models by judging their responses to the same prompt. For example, suppose the following prompt is given to two different models:
Create an image of a cute dog juggling three balls.
In a side-by-side evaluation, a rater would pick which image was "better" (More accurate? More beautiful? Cuter?).
Created for this library
An LLM team runs side-by-side evaluation where raters compare two candidate models on the same prompts and pick the preferred response.
A search-quality team uses side-by-side evaluation on rated queries to compare two rankers in a controlled review.
A translation team uses side-by-side evaluation by professional translators to compare two model variants before production rollout.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License