Glossary term
Glossary term
Evaluation and Benchmarks
A set of metrics within the ROUGE family that compares the shared N-grams of a certain size in the reference text and generated text. For example:
ROUGE-1 measures the number of shared tokens in the reference text and generated text.
ROUGE-2 measures the number of shared bigrams (2-grams) in the reference text and generated text.
ROUGE-3 measures the number of shared trigrams (3-grams) in the reference text and generated text.
You can use the following formulas to calculate ROUGE-N recall and ROUGE-N precision for any member of the ROUGE-N family:
You can then use F1 to roll up ROUGE-N recall and ROUGE-N precision into a single metric:
Click the icon for an example.
Created for this library
A summarization team reports ROUGE-1 and ROUGE-2 to compare model versions on unigram and bigram overlap with reference summaries.
A research team uses ROUGE-N as a baseline metric for summarization evaluation across model versions.
A news platform uses ROUGE-N as part of its weekly summarization quality monitoring.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License