Glossary term
Glossary term
Evaluation and Benchmarks
A family of metrics that evaluate automatic summarization and machine translation models. ROUGE metrics determine the degree to which a reference text overlaps an ML model's generated text. Each member of the ROUGE family measures overlap in a different way. Higher ROUGE scores indicate more similarity between the reference text and generated text than lower ROUGE scores.
Each ROUGE family member typically generates the following metrics:
Precision
Recall
F1
Note: ROUGE uses precision and recall somewhat differently than traditional precision and recall.
For details and examples, see:
Note: BLEU and BLEURT optimize for precision while ROUGE optimizes for recall. Consequently, BLEU and BLEURT are better metrics for evaluating machine translation (since the focus is precision) while ROUGE is a better metric for summarization (since the focus is recall).
Created for this library
A summarization team reports ROUGE scores against editor-written reference summaries to compare model versions.
A research team reports ROUGE in its preprint so other researchers can compare summarization performance.
A news platform uses ROUGE as an offline metric to compare summarization models before paying for human review.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License