Glossary term
Glossary term
Infrastructure and Serving
A way of scaling training or inference that puts different parts of one model on different devices. Model parallelism enables models that are too big to fit on a single device.
To implement model parallelism, a system typically does the following:
Shards (divides) the model into smaller parts.
Distributes the training of those smaller parts across multiple processors. Each processor trains its own part of the model.
Combines the results to create a single model.
Model parallelism slows training.
See also data parallelism.
Created for this library
An LLM training team uses model parallelism to fit its largest model across multiple TPU devices.
A research lab uses tensor and pipeline model parallelism together to scale training of very large language models.
An ML platform team uses model parallelism when individual layers exceed the memory of a single accelerator chip.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License