Glossary term
Glossary term
Evaluation and Benchmarks
A dataset for evaluating an LLM's proficiency in answering yes-or-no questions. Each of the challenges in the dataset has three components:
A query
A passage implying the answer to the query.
The correct answer, which is either yes or no.
For example:
Query: Are there any nuclear power plants in Michigan?
Passage: ...three nuclear power plants supply Michigan with about 30% of its electricity.
Correct answer: Yes
Researchers gathered the questions from anonymized, aggregated Google Search queries and then used Wikipedia pages to ground the information.
For more information, see BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.
BoolQ is a component of the SuperGLUE ensemble.
Created for this library
A research team uses the BoolQ benchmark to evaluate yes-or-no question answering as part of a model release readiness check.
An LLM evaluation team includes BoolQ scores in its model card to give downstream developers a quick view of reading comprehension on yes-no questions.
A vendor benchmarks its open-weights LLM on BoolQ in its model release notes so enterprise buyers can compare reasoning quality between checkpoints.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License