Gradient Checkpointing

Memory-efficiency technique that recomputes intermediate activations during the backward pass instead of storing them, trading compute for memory.

1.
Gradient checkpointing (also called activation recomputation) reduces GPU memory usage by 60-70% during training at the cost of a 20-33% increase in computation. Used by Hugging Face Transformers to enable fine-tuning of 7B+ models on single 24GB GPUs.
2.
LLaVA-1.5 is trained with gradient checkpointing enabled, allowing full fine-tuning of a 13B multimodal model on a single 8xA100 node in approximately 1 day without running out of GPU memory.
3.
DeepSpeed's gradient checkpointing is combined with ZeRO-3 in production fine-tuning pipelines for 70B models, enabling Llama 3.1 70B instruction tuning on 8xH100 configurations that would otherwise require 16xH100.

Loading…