Glossary term
Glossary term
Infrastructure and Serving
Splitting model computation across multiple GPUs.
Megatron-LM (NVIDIA) uses tensor parallelism to split GPT-3-scale attention layers across 8 GPUs - each GPU computes a shard of the attention heads, enabling models too large for a single GPU to serve requests.
vLLM's tensor-parallel serving mode splits Llama 3 405B across 8 H100-80GB GPUs, enabling enterprise teams to serve the model on a single DGX H100 node at 50 tokens/sec throughput.
DeepSeek V3's training used pipeline + tensor parallelism across 2,048 H800 GPUs - splitting the 671B MoE model across nodes to train in 2 million GPU hours at a fraction of comparable model training costs.