Gradient Accumulation

A backpropagation technique that updates the parameters only once per epoch rather than once per iteration. After processing each mini-batch, gradient accumulation simply updates a running total of gradients. Then, after processing the last mini-batch in the epoch, the system finally updates the parameters based on the total of all gradient changes.

Gradient accumulation is useful when the batch size is very large compared to the amount of available memory for training. When memory is an issue, the natural tendency is to reduce batch size. However, reducing the batch size in normal backpropagation increases the number of parameter updates. Gradient accumulation enables the model to avoid memory issues but still train efficiently.

Real-world uses

Created for this library

1.
A vision team uses gradient accumulation to mimic a large effective batch size when its GPUs cannot fit the full batch in memory.
2.
A speech recognition team uses gradient accumulation over four steps to reach an effective batch size that produced better convergence in earlier experiments.
3.
An NLP team uses gradient accumulation to simulate large-batch training on a small cluster without renting larger machines.

Back to glossary

Gradient Accumulation

Real-world uses

Related terms

Loading…

Gradient Accumulation

Real-world uses

Related terms