The Pile

800GB open-source pre-training dataset curated by EleutherAI from 22 diverse text sources.

1.
The Pile (EleutherAI, 2021) combines GitHub, arXiv, PubMed, Wikipedia, Books3, and 17 other sources to create a diverse 825GB training corpus - used to pre-train GPT-Neo, GPT-NeoX-20B, and Pythia models.
2.
Pythia (EleutherAI) trained 8 LLMs from 70M to 12B parameters on The Pile with public checkpoints at every 512 steps - used by alignment researchers to study memorisation, learning dynamics, and bias evolution.
3.
The Pile's Books3 component was challenged in copyright litigation in 2023 - Meta, Google, and EleutherAI all face similar suits over book datasets, motivating the shift to licensed or synthetic training data.

Loading…