Glossary term
Glossary term
Evaluation and Benchmarks
Phenomenon where a model's training data includes test set examples, artificially inflating benchmark performance.
OpenAI acknowledged potential GPT-4 contamination on HumanEval in the technical report - disclosing that some coding problems may appear in training data and recommending contamination-aware evaluation.
Praveen et al. (2024) found evidence of benchmark contamination in multiple open-source LLMs on MMLU - demonstrating that memorised answers rather than genuine reasoning explain high scores on some models.
LiveBench (2024) releases new benchmark questions monthly from recent sources to prevent contamination - used by model evaluation teams who distrust static benchmarks where training cutoff dates approach the test set creation date.