Glossary term
Glossary term
Infrastructure and Serving
Zero Redundancy Optimizer - DeepSpeed memory optimisation technology that eliminates redundancy in distributed training by partitioning model states across GPUs.
ZeRO Stage 3 partitions weights, gradients, and optimiser states across all data-parallel processes, reducing per-GPU memory to 1/N of full model size. This allowed Bloom-176B to be trained across 384 A100 GPUs without model-parallel code changes.
ZeRO-Infinity extends ZeRO to NVMe storage, enabling training of models with tens of trillions of parameters by offloading parameters to CPU and NVMe memory. Used by Microsoft to explore trillion-parameter model training feasibility.
PyTorch FSDP (Fully Sharded Data Parallel) implements ZeRO Stage 3 natively in PyTorch - adopted by Meta to train Llama 2 and Llama 3, removing the DeepSpeed dependency while providing the same memory efficiency.