Glossary term
Glossary term
Infrastructure and Serving
Fully Sharded Data Parallel - PyTorch's native implementation of ZeRO-style sharding for distributed model training without external libraries.
Meta used FSDP to train Llama 2 and Llama 3 across thousands of GPUs, sharding the 70B model's parameters, gradients, and optimiser states across all GPU ranks without requiring DeepSpeed.
Hugging Face's Accelerate library wraps FSDP to provide a simple API for multi-GPU and multi-node training of models up to 405B parameters, used by researchers fine-tuning Llama 3.1 405B on custom datasets.
FSDP + QLoRA enables fine-tuning of 70B models on 8xA100 nodes in research labs - the combination uses FSDP to shard frozen base weights and LoRA adapters while QLoRA reduces the frozen weight memory footprint.