Glossary term
Glossary term
Training and Fine-Tuning
Memory-efficiency technique that recomputes intermediate activations during the backward pass instead of storing them, trading compute for memory.
Gradient checkpointing (also called activation recomputation) reduces GPU memory usage by 60-70% during training at the cost of a 20-33% increase in computation. Used by Hugging Face Transformers to enable fine-tuning of 7B+ models on single 24GB GPUs.
LLaVA-1.5 is trained with gradient checkpointing enabled, allowing full fine-tuning of a 13B multimodal model on a single 8xA100 node in approximately 1 day without running out of GPU memory.
DeepSpeed's gradient checkpointing is combined with ZeRO-3 in production fine-tuning pipelines for 70B models, enabling Llama 3.1 70B instruction tuning on 8xH100 configurations that would otherwise require 16xH100.