
Distributed Training: DDP, FSDP & DeepSpeed
Learn how to train large models across multiple GPUs. Master data parallelism, the ZeRO optimizer stages, FSDP, and when to use each strategy.
What you will learn
Why Distributed Training?
A Llama-3 70B model with Adam optimizer in BF16 needs roughly 420 GB of GPU memory for training (weights 140GB + gradients 140GB + optimizer states 140GB). You need a strategy to distribute this across multiple GPUs.
Data Parallelism: DDP
The simplest distributed strategy. Each GPU gets a full copy of the model. The dataset is split across GPUs. After each backward pass, gradients are synchronized via all-reduce (ring-allreduce algorithm). Only viable when the model fits on a single GPU.
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Initialize process group dist.init_process_group("nccl") local_rank = dist.get_rank() torch.cuda.set_device(local_rank) model = MyModel().cuda() model = DDP(model, device_ids=[local_rank]) # Train normally — gradients sync automatically optimizer.zero_grad() output = model(batch) loss.backward() optimizer.step()
# Single node, 8 GPUs torchrun --nproc_per_node=8 train.py # Multi-node: node 0 (master) torchrun --nproc_per_node=8 --nnodes=4 --node_rank=0 --master_addr="192.168.1.1" --master_port=29500 train.py
ZeRO: Zero Redundancy Optimizer
ZeRO (from DeepSpeed, now also in FSDP) eliminates the memory redundancy in DDP. Instead of each GPU holding a full copy of everything, ZeRO shards across GPUs.
| Stage | What's Sharded | Communication | Memory Reduction (8 GPUs) |
|---|---|---|---|
| ZeRO-1 | Optimizer states | All-reduce gradients | 4× reduction |
| ZeRO-2 | Optimizer states + Gradients | All-reduce gradients | 8× reduction |
| ZeRO-3 | Optimizer + Gradients + Parameters | All-gather parameters | 64× reduction |
| ZeRO-Infinity | ZeRO-3 + NVMe offload | PCIe/NVMe I/O | Virtually unlimited |
PyTorch FSDP
FSDP (Fully Sharded Data Parallel) is PyTorch's native implementation of ZeRO-3 semantics. Introduced in PyTorch 1.12, it's now the recommended approach for large model training without external dependencies.
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp import ShardingStrategy, MixedPrecision from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy # Mixed precision policy (weights: BF16, params: FP32) bf16_policy = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ) model = FSDP( model, sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3 auto_wrap_policy=transformer_auto_wrap_policy( transformer_layer_cls={LlamaDecoderLayer} ), mixed_precision=bf16_policy, device_id=torch.cuda.current_device(), )
QLoRA: Fine-tuning Without Full Sharding
For fine-tuning (not pre-training), you often don't need full FSDP. QLoRA (Quantized Low-Rank Adaptation) combines 4-bit quantized base weights with small trainable adapter matrices — reducing fine-tuning memory by 4–5× while matching full fine-tune quality on most tasks.
DeepSpeed
Microsoft's distributed training framework that pioneered ZeRO. More feature-rich than FSDP but requires external dependency and JSON config files.
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu" },
"offload_param": { "device": "cpu" },
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e8
},
"bf16": { "enabled": true },
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 8
}
Tensor Parallelism & Pipeline Parallelism
When data parallelism alone isn't enough, these strategies split the model itself:
- Split weight matrices across GPUs
- Each GPU computes part of each layer
- Requires all-reduce per layer
- Needs NVLink (high bandwidth)
- Good for single-node multi-GPU
- Split model layers across GPUs
- GPU 0 runs layers 1–8, GPU 1 runs 9–16...
- Micro-batching to hide idle time
- Works over slower inter-node links
- Good for multi-node deployments
The Decision Matrix
| Model Size | Fits on 1 GPU? | Recommended Strategy |
|---|---|---|
| < 7B | Yes (40GB+) | DDP — simplest, most efficient |
| 7B–30B | Partial | FSDP ZeRO-2 or ZeRO-3 |
| 70B | No | FSDP ZeRO-3 or DeepSpeed ZeRO-3 |
| 200B+ | No | 3D parallelism (Data + Tensor + Pipeline) |
| Trillion+ | No | DeepSpeed ZeRO-Infinity + specialized hardware |
Gradient Accumulation & Effective Batch Size
# Effective batch = micro_batch × grad_accum × num_gpus # Example: micro=2, accum=16, gpus=8 → effective=256 accumulation_steps = 16 for step, batch in enumerate(dataloader): output = model(batch) loss = loss_fn(output) / accumulation_steps # normalize! loss.backward() if (step + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Modelsarxiv.org
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learningarxiv.org
- PyTorch FSDP — Official Documentationpytorch.org
- Hugging Face Accelerate — FSDP Guidehuggingface.co
- DeepSpeed Training — Overview and Featuresdeepspeed.ai
Finished reading?