Glossary term
Glossary term
Evaluation and Benchmarks
A set of manually curated data that captures ground truth. Teams can use one or more golden datasets to evaluate a model's quality.
Some golden datasets capture different subdomains of ground truth. For example, a golden dataset for image classification might capture lighting conditions and image resolution.
Created for this library
A search-quality team curates a golden dataset of human-rated queries to evaluate every new ranker before launch.
An LLM evaluation team maintains a golden dataset of prompts and expected outputs so model regressions are caught before any rollout.
A medical AI team curates a golden dataset of edge-case radiology images that clinicians find most informative for safety review.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License