Glossary term
Glossary term
Evaluation and Benchmarks
A metric between 0.0 and 1.0 for evaluating machine translations, for example, from Spanish to Japanese.
To calculate a score, BLEU typically compares an ML model's translation (generated text) to a human expert's translation (reference text). The degree to which N-grams in the generated text and reference text match determines the BLEU score.
The original paper on this metric is BLEU: a Method for Automatic Evaluation of Machine Translation.
See also BLEURT.
Created for this library
A translation vendor uses BLEU as a fast offline metric to compare candidate models before paying for human translator evaluation on key language pairs.
A localization team at a software company tracks BLEU score weekly on a fixed test set to detect translation quality drift after each model update.
A multilingual customer support team uses BLEU on machine-translated agent replies to monitor quality across markets between human review cycles.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License