Glossary term
Glossary term
Evaluation and Benchmarks
A condition where evaluation data or answers influence training, tuning, prompt design, or model selection, making results look better than real-world performance. Contamination risk should be assessed when vendors publish high scores or when internal test prompts are reused in tuning, demonstrations, or prompt engineering.
Zhou et al. (2023) Don't Make Your LLM an Evaluation Benchmark Cheater documented widespread benchmark contamination in open models.
Sainz et al. (2023) NLP Evaluation in Trouble surveyed contamination across MMLU, GSM8K, and HumanEval in leading models.
Scale AI's SEAL leaderboards (2024) introduced held-out private evaluation sets specifically to mitigate contamination.