Model Parallelism

A way of scaling training or inference that puts different parts of one model on different devices. Model parallelism enables models that are too big to fit on a single device.

To implement model parallelism, a system typically does the following:

Shards (divides) the model into smaller parts.

Distributes the training of those smaller parts across multiple processors. Each processor trains its own part of the model.

Combines the results to create a single model.

Model parallelism slows training.

Real-world uses

Created for this library

1.
An LLM training team uses model parallelism to fit its largest model across multiple TPU devices.
2.
A research lab uses tensor and pipeline model parallelism together to scale training of very large language models.
3.
An ML platform team uses model parallelism when individual layers exceed the memory of a single accelerator chip.

Back to glossary

Model Parallelism

Real-world uses

Related terms

Loading…

Model Parallelism

Real-world uses

Related terms