Common Crawl

Petabyte-scale dataset of web content used as a primary pre-training corpus for most large language models.

1.
Common Crawl (non-profit) provides 3.7 billion web pages per monthly crawl - used as the primary pre-training corpus for GPT-3, Llama, Falcon, and most open-source LLMs, typically after deduplication and quality filtering.
2.
C4 (Colossal Clean Crawled Corpus) is a 750GB filtered version of Common Crawl used to pre-train T5 and many academic LLMs - filtering removes boilerplate, duplicate text, and adult content from the raw crawl.
3.
FineWeb (HuggingFace 2024) is a 15-trillion-token filtered Common Crawl dataset with per-page quality scores - outperforming Dolma and RefinedWeb as a training corpus for Llama-scale models.

Loading…