Glossary term
Glossary term
Foundations
Process of removing duplicate or near-duplicate documents from training corpora to improve model quality and reduce memorisation.
MinHash deduplication of Common Crawl removes 70% of documents as near-duplicates - Llama 2's training corpus reduced from 4T raw tokens to 2T after aggressive deduplication, improving generation diversity.
DataComp (Gadre et al. 2024) shows that deduplication and quality filtering of LAION-2B improves CLIP model quality more than scaling up training compute - motivating data-centric AI over compute scaling.
Exact substring deduplication using suffix arrays (Lee et al. 2022) removes verbatim repeated text from training data - used by Google to reduce memorisation of sensitive personal information in LLM outputs.