Glossary term
Glossary term
Evaluation and Benchmarks
Metric measuring n-gram overlap between generated and reference text, used for machine translation evaluation.
Google Translate reports BLEU scores internally to track translation quality improvements across 133 language pairs - a sustained 2+ BLEU point improvement typically triggers a production model update.
BLEU is used by the WMT shared task (annual machine translation competition) to rank system outputs - the 2023 WMT English-German winner achieved a BLEU score of 38.4, becoming the new SOTA baseline.
AWS Translate's quality metric dashboard exposes BLEU scores to enterprise customers comparing custom vs general translation models - used to validate domain-specific fine-tuning before deployment.