Glossary term
Glossary term
Training and Fine-Tuning
Training data is what an AI system learns from. It could be text, documents, conversations, anything that teaches the model how language works and what to expect. The better the training data, the smarter and more accurate the AI becomes.
Common Crawl, The Pile, and RedPajama are large open training-data collections for LLMs.
LAION-5B is a public training-data set of image-text pairs used to train models like Stable Diffusion.
Scale AI, Surge AI, and Toloka provide professionally curated training data to OpenAI, Anthropic, and Meta.