Glossary term
Glossary term
Evaluation and Benchmarks
A dataset for evaluating an LLM's ability to summarize short articles. WikiHow, an encyclopedia of articles explaining how to do various tasks, is the human-authored source for both the articles and the summaries. Each entry in the dataset consists of:
An article, which is created by appending each step of the prose (paragraph) version of the numbered list, minus the opening sentence of each step.
A summary of that article, consisting of the opening sentence of each step in the numbered list.
For details, see WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization.
Created for this library
An LLM evaluation team uses WikiLingua to measure cross-lingual summarization quality across language pairs.
A research lab reports WikiLingua scores in its model card so downstream users can compare multilingual summarization.
A multilingual NLP team uses WikiLingua as one of several benchmarks for cross-lingual generation quality.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License