Glossary term
Glossary term
Training and Fine-Tuning
Data used to evaluate model performance before release or change approval. Test data should reflect intended operating conditions and important edge cases. Test results should be reproducible and tied to acceptance criteria that the system must meet before release or material change approval.
The MMLU benchmark, released by Hendrycks et al. (2020), is a widely used multi-task test dataset for measuring LLM general knowledge.
GLUE and SuperGLUE benchmarks have published held-out test sets used to evaluate natural language understanding capabilities.
Hugging Face's Open LLM Leaderboard automates testing of open weight models across multiple held-out test sets.