Mixed Precision Training

Training technique using lower-precision floating point (FP16, BF16) for forward/backward passes while maintaining FP32 master weights for stability.

1.
Automatic Mixed Precision (AMP) in PyTorch enables 2-3x training speedup on V100/A100 GPUs by using Tensor Core FP16 operations in the forward pass while maintaining FP32 weight copies for gradient accumulation.
2.
Llama 3.1 was pre-trained using BF16 precision (16-bit brain float) rather than FP16 because BF16 has the same range as FP32, avoiding loss scaling instabilities that affect FP16 training of large models.
3.
NVIDIA H100 GPUs introduce native FP8 training support, enabling DeepSeek V3 to use FP8 mixed precision and achieve 2x memory reduction and 1.6x training speed improvement over BF16, used in the full 671B training run.

Loading…