Glossary term
Glossary term
Evaluation and Benchmarks
A dataset to evaluate an LLM's ability to perform commonsense reasoning. Each example in the dataset contains three components:
A paragraph or two from a news article
A query in which one of the entities explicitly or implicitly identified in the passage is masked.
The answer (the name of the entity that belongs in the mask)
See ReCoRD for an extensive list of examples.
ReCoRD is a component of the SuperGLUE ensemble.
Created for this library
An LLM evaluation team includes ReCoRD in its standard reasoning benchmark suite to test commonsense reading comprehension.
A research lab reports ReCoRD scores in model cards so downstream users can compare commonsense reasoning across model versions.
A model release team uses ReCoRD as one of several reading-comprehension benchmarks gating production promotion.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License