Glossary term
Glossary term
Evaluation and Benchmarks
A metric for evaluating machine translations from one language to another, particularly to and from English.
For translations to and from English, BLEURT aligns more closely to human ratings than BLEU. Unlike BLEU, BLEURT emphasizes semantic (meaning) similarities and can accommodate paraphrasing.
BLEURT relies on a pre-trained large language model (BERT to be exact) that is then fine-tuned on text from human translators.
The original paper on this metric is BLEURT: Learning Robust Metrics for Text Generation.
Created for this library
A translation vendor adopts BLEURT alongside BLEU because BLEURT better captures meaning preservation on free-form translations of marketing copy.
A multilingual search team uses BLEURT to score paraphrased query rewrites against the original intent across European languages.
A localization team reports BLEURT as a complementary metric to BLEU when comparing translation models for natural-sounding customer-facing strings.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License