Glossary term
Glossary term
Evaluation and Benchmarks
Metric measuring recall-oriented n-gram overlap between generated summaries and reference summaries.
ROUGE-L is used to evaluate summarisation models on CNN/DailyMail benchmark - Claude 3.5 Sonnet achieves ROUGE-L of 42.1 on this benchmark, used by enterprises to compare summarisation providers.
Microsoft Azure's document summarisation service reports ROUGE scores in its evaluation dashboard - used by news agencies to validate AI summarisation quality before replacing human abstractors.
Meta's BART achieves ROUGE-1 of 44.16 on XSum (extreme summarisation) - used as the baseline model by academic researchers developing new summarisation architectures.