Glossary term
Glossary term
Training and Fine-Tuning
Distributed training technique that splits model layers across multiple GPUs/nodes, allowing each device to process a different pipeline stage simultaneously.
GPipe (Google, 2019) introduced pipeline parallelism for training models too large for a single GPU, enabling an 83B-parameter AmoebaNet model to be trained by splitting 128 layers across 8 GPUs.
NVIDIA Megatron-LM uses pipeline parallelism combined with tensor parallelism and data parallelism (3D parallelism) to train models like Megatron-Turing NLG 530B across thousands of A100 GPUs.
DeepSeek V3 used a combination of pipeline and tensor parallelism across 2,048 H800 GPUs to train the 671B MoE model in approximately 2 million GPU hours at an estimated cost of $5.5M.