Glossary term
Glossary term
Evaluation and Benchmarks
A dataset for evaluating an LLM's ability to determine whether a hypothesis can be entailed (logically drawn) from a text passage. Each example in an RTE evaluation consists of three parts:
A passage, typically from news or Wikipedia articles
A hypothesis
The correct answer, which is either:
True, meaning the hypothesis can be entailed from the passage
False, meaning the hypothesis can't be entailed from the passage
For example:
Passage: The Euro is the currency of the European Union.
Hypothesis: France uses the Euro as currency.
Entailment: True, because France is part of the European Union.
RTE is a component of the SuperGLUE ensemble.
Created for this library
An LLM evaluation team includes RTE in its benchmark suite to measure textual entailment ability across model versions.
A research lab reports RTE scores in model cards so downstream users can compare reasoning ability across model versions.
A model release team uses RTE as one of several reasoning benchmarks to gate production promotion.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License