Glossary term
Glossary term
Infrastructure and Serving
Microsoft open-source deep learning optimisation library providing ZeRO, 3D parallelism, and training efficiency techniques for large-scale model training.
DeepSpeed's ZeRO-2 enables training of models up to 170 billion parameters and set the BERT training record of 44 minutes on 1,024 NVIDIA V100 GPUs. It has been used to train Bloom-176B, MPT-7B, and Microsoft Turing-NLG-17B.
DeepSpeed ZeRO-3 partitions model weights, gradients, and optimiser states across all GPUs, enabling a single GPU to train models with up to 13 billion parameters - a workload that exhausts memory in standard PyTorch DDP at 1.4 billion parameters.
DeepSpeed ZeRO++ (2023) reduces communication volume by 4x compared to ZeRO, enabling up to 2.25x better RLHF generation throughput - adopted by Databricks, Hugging Face, and academic labs for aligning Llama 3 and Mistral models.