Pre-Training

Initial large-scale training of a model on broad data (typically next-token prediction or masked-language modelling) before any task-specific fine-tuning.

The initial training of a model on a large dataset. Some pre-trained models are clumsy giants and must typically be refined through additional training. For example, ML experts might pre-train a large language model on a vast text dataset, such as all the English pages in Wikipedia. Following pre-training, the resulting model might be further refined through any of the following techniques:

distillation

fine-tuning

instruction tuning

parameter-efficient tuning

prompt-tuning

Examples

1.
GPT-4's pre-training on 13 trillion tokens from web crawls, books, and code took approximately $100M in compute - establishing the broad world knowledge that downstream fine-tuning and prompting build on.
2.
Meta pre-trained Llama 3.1 405B on 15 trillion tokens of curated multilingual text and code - the pre-training run used 16,384 H100 GPUs for 77 days, producing the open-weight model released publicly.
3.
MedPaLM 2 (Google) is pre-trained on a broad corpus then specialised with medical text fine-tuning - pre-training provides general language understanding while subsequent steps add clinical knowledge.

Real-world uses

Created for this library

1.
An LLM team pre-trains a foundation model on a broad web corpus before fine-tuning it for specific use cases.
2.
A medical NLP team pre-trains a transformer on a clinical notes corpus before fine-tuning task-specific heads for billing codes.
3.
A research lab pre-trains a base model on a curated multilingual corpus before downstream alignment training.

Back to glossary

Initial large-scale training of a model on broad data (typically next-token prediction or masked-language modelling) before any task-specific fine-tuning.

distillation

fine-tuning

instruction tuning

parameter-efficient tuning

prompt-tuning

Examples

1.
GPT-4's pre-training on 13 trillion tokens from web crawls, books, and code took approximately $100M in compute - establishing the broad world knowledge that downstream fine-tuning and prompting build on.
2.
Meta pre-trained Llama 3.1 405B on 15 trillion tokens of curated multilingual text and code - the pre-training run used 16,384 H100 GPUs for 77 days, producing the open-weight model released publicly.
3.
MedPaLM 2 (Google) is pre-trained on a broad corpus then specialised with medical text fine-tuning - pre-training provides general language understanding while subsequent steps add clinical knowledge.

Real-world uses

Created for this library

1.
An LLM team pre-trains a foundation model on a broad web corpus before fine-tuning it for specific use cases.
2.
A medical NLP team pre-trains a transformer on a clinical notes corpus before fine-tuning task-specific heads for billing codes.
3.
A research lab pre-trains a base model on a curated multilingual corpus before downstream alignment training.

Back to glossary

Pre-Training

Examples

Real-world uses

Related terms

Loading…

Pre-Training

Examples

Real-world uses

Related terms