Glossary term
Glossary term
Evaluation and Benchmarks
An ensemble of datasets for rating an LLM's overall ability to understand and generate text. The ensemble consists of the following datasets:
Choice of Plausible Alternatives (COPA)
Multi-sentence Reading Comprehension (MultiRC)
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD)
Recognizing Textual Entailment (RTE)
Winograd Schema Challenge (WSC)
For details, see SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.
Created for this library
An LLM evaluation team uses SuperGLUE in its standard NLU benchmark suite for model release reviews.
A research lab reports SuperGLUE scores in model cards so downstream users can compare NLU performance across versions.
A model release team uses SuperGLUE as a baseline NLU benchmark suite gating production promotion.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License