Glossary term
Glossary term
Evaluation and Benchmarks
Semantic similarity metric that uses BERT embeddings to compare generated and reference text, capturing meaning beyond n-gram overlap.
BERTScore is used by Google's translation team to evaluate long-form generation quality where BLEU fails - a paraphrase that preserves meaning but uses different words scores high on BERTScore but low on BLEU.
Allen NLP's EvalAll library integrates BERTScore for model comparison - research teams at AI21 Labs use BERTScore alongside ROUGE to evaluate their summarisation and rewriting products.
GPT-4-based paraphrase generation achieves BERTScore F1 of 0.93 on PAWS dataset - used by data augmentation pipelines to validate that generated paraphrases preserve semantic meaning for training data.