Glossary term
Glossary term
Evaluation and Benchmarks
A dataset prepared specifically to test model behavior under defined conditions. Good evaluation datasets include realistic, edge, adversarial, and underrepresented cases relevant to the intended use. Evaluation datasets should be protected from training contamination, versioned, and refreshed as deployment context changes.
MMLU, GSM8K, HumanEval, and Big-Bench Hard are publicly available evaluation datasets widely used in LLM benchmarking.
Anthropic's BBQ (Bias Benchmark for QA) and StereoSet are evaluation datasets for measuring social bias in language models.
Internal evaluation datasets built by enterprises typically include domain-specific edge cases, regulatory scenarios, and adversarial inputs collected through red teaming.