Glossary term
Glossary term
Evaluation and Benchmarks
Acronym for Stanford Question Answering Dataset, introduced in the paper SQuAD: 100,000+ Questions for Machine Comprehension of Text. The questions in this dataset come from people posing questions about Wikipedia articles. Some of the questions in SQuAD have answers, but other questions intentionally don't have answers. Therefore, you can use SQuAD to evaluate an LLM's ability to do both of the following:
Answer questions that can be answered.
Identify questions that cannot be answered.
Exact match in combination with F1 are the most common metrics for evaluating LLMs against SQuAD.
Created for this library
An LLM evaluation team uses SQuAD in its standard reading-comprehension benchmark suite for model release reviews.
A research lab reports SQuAD scores in model cards so downstream users can compare reading comprehension across versions.
A model release team uses SQuAD as one of several reading-comprehension benchmarks gating production promotion.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License