Glossary term
Glossary term
Evaluation and Benchmarks
Metric measuring how well a language model predicts a held-out text corpus; lower perplexity indicates better language modelling.
One measure of how well a model is accomplishing its task. For example, suppose your task is to read the first few letters of a word a user is typing on a phone keyboard, and to offer a list of possible completion words. Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type.
Perplexity is related to cross-entropy as follows:
GPT-4 achieves perplexity of ~2.8 on the Penn Treebank corpus, vs GPT-2's ~20 - used by model developers as a fast training signal before running expensive downstream benchmarks.
Llama 3.1 8B achieves lower perplexity than Llama 2 7B on the same validation set - used by Meta to confirm language model quality improvements before running full MMLU and HumanEval evaluations.
Perplexity is used as a filtering metric in dataset curation - web-crawled text with perplexity scores outside a normal range (too low means duplicated, too high means random noise) is removed from training data.
Created for this library
An NLP team reports perplexity on a held-out corpus to compare candidate language models during pretraining.
A research team monitors perplexity during pretraining to gauge how quickly the model learns the underlying data distribution.
An LLM team uses perplexity as a diagnostic during pretraining and complementary task-specific metrics during fine-tuning.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License