The Engineering Codex/LLM Systems Engineering
DAY 4
05 / 09

Distributed Training: DDP, FSDP & DeepSpeed

schedule5 minsignal_cellular_altIntermediate1,036 words
Learn how to train large models across multiple GPUs. Master data parallelism, the ZeRO optimizer stages, FSDP, and when to use each strategy.

What you will learn

01Why Distributed Training?
02Data Parallelism: DDP
03ZeRO: Zero Redundancy Optimizer
04PyTorch FSDP
05QLoRA: Fine-tuning Without Full Sharding
06DeepSpeed

Why Distributed Training?

A Llama-3 70B model with Adam optimizer in BF16 needs roughly 420 GB of GPU memory for training (weights 140GB + gradients 140GB + optimizer states 140GB). You need a strategy to distribute this across multiple GPUs.

70B
Llama-3 params
420 GB
Full training memory
H100 80GB GPUs needed (no sharding)
H100s with ZeRO-3 sharding

Data Parallelism: DDP

The simplest distributed strategy. Each GPU gets a full copy of the model. The dataset is split across GPUs. After each backward pass, gradients are synchronized via all-reduce (ring-allreduce algorithm). Only viable when the model fits on a single GPU.

Python · DDP setup
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group
dist.init_process_group("nccl")
local_rank = dist.get_rank()
torch.cuda.set_device(local_rank)

model = MyModel().cuda()
model = DDP(model, device_ids=[local_rank])

# Train normally — gradients sync automatically
optimizer.zero_grad()
output = model(batch)
loss.backward()
optimizer.step()
Shell · Launch DDP with torchrun
# Single node, 8 GPUs
torchrun --nproc_per_node=8 train.py

# Multi-node: node 0 (master)
torchrun --nproc_per_node=8          --nnodes=4          --node_rank=0          --master_addr="192.168.1.1"          --master_port=29500          train.py

ZeRO: Zero Redundancy Optimizer

ZeRO (from DeepSpeed, now also in FSDP) eliminates the memory redundancy in DDP. Instead of each GPU holding a full copy of everything, ZeRO shards across GPUs.

StageWhat's ShardedCommunicationMemory Reduction (8 GPUs)
ZeRO-1Optimizer statesAll-reduce gradients4× reduction
ZeRO-2Optimizer states + GradientsAll-reduce gradients8× reduction
ZeRO-3Optimizer + Gradients + ParametersAll-gather parameters64× reduction
ZeRO-InfinityZeRO-3 + NVMe offloadPCIe/NVMe I/OVirtually unlimited
DDP (none) ZeRO-1 ZeRO-2 ZeRO-3 Params (full) Grads (full) Opt states (full) Params (full) Grads (full) opt/N Params (full) g/N o/N p/N g/N o/N 64× N = number of GPUs (8 shown here) · numbers show memory reduction factor shaded cells = shard held by one GPU only; full bars = each GPU holds full copy
ZeRO stages trade communication for memory. Stage 3 shards everything — each GPU holds 1/N of parameters, gradients, and optimizer state.
💡
ZeRO-3 Trade-off
ZeRO-3 shards parameters across GPUs. This means before each layer's forward/backward pass, GPUs must all-gather that layer's parameters. This adds communication overhead (~30% slower than ZeRO-2 on slow inter-GPU links). But it's what enables training models 8–64× larger than what fits on one GPU.

PyTorch FSDP

FSDP (Fully Sharded Data Parallel) is PyTorch's native implementation of ZeRO-3 semantics. Introduced in PyTorch 1.12, it's now the recommended approach for large model training without external dependencies.

Python · FSDP wrapping
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, MixedPrecision
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

# Mixed precision policy (weights: BF16, params: FP32)
bf16_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # ZeRO-3
    auto_wrap_policy=transformer_auto_wrap_policy(
        transformer_layer_cls={LlamaDecoderLayer}
    ),
    mixed_precision=bf16_policy,
    device_id=torch.cuda.current_device(),
)

QLoRA: Fine-tuning Without Full Sharding

For fine-tuning (not pre-training), you often don't need full FSDP. QLoRA (Quantized Low-Rank Adaptation) combines 4-bit quantized base weights with small trainable adapter matrices — reducing fine-tuning memory by 4–5× while matching full fine-tune quality on most tasks.

💡
How LoRA Works
Instead of updating all weight parameters W (huge), LoRA adds small matrices A and B where rank r ≪ d: ΔW = A × B. For a 4096×4096 attention weight with r=16, LoRA updates 2×4096×16 = 131k parameters instead of 16.7M — a 128× reduction in trainable params. At inference, you merge: W_final = W_base + A×B. QLoRA extends this by keeping W_base in INT4 (frozen) and training only the LoRA adapters in BF16.

DeepSpeed

Microsoft's distributed training framework that pioneered ZeRO. More feature-rich than FSDP but requires external dependency and JSON config files.

JSON · DeepSpeed ZeRO-3 config
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu" },
    "offload_param": { "device": "cpu" },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8
  },
  "bf16": { "enabled": true },
  "train_micro_batch_size_per_gpu": 2,
  "gradient_accumulation_steps": 8
}

Tensor Parallelism & Pipeline Parallelism

When data parallelism alone isn't enough, these strategies split the model itself:

Tensor Parallelism
  • Split weight matrices across GPUs
  • Each GPU computes part of each layer
  • Requires all-reduce per layer
  • Needs NVLink (high bandwidth)
  • Good for single-node multi-GPU
Pipeline Parallelism
  • Split model layers across GPUs
  • GPU 0 runs layers 1–8, GPU 1 runs 9–16...
  • Micro-batching to hide idle time
  • Works over slower inter-node links
  • Good for multi-node deployments

The Decision Matrix

Model SizeFits on 1 GPU?Recommended Strategy
< 7BYes (40GB+)DDP — simplest, most efficient
7B–30BPartialFSDP ZeRO-2 or ZeRO-3
70BNoFSDP ZeRO-3 or DeepSpeed ZeRO-3
200B+No3D parallelism (Data + Tensor + Pipeline)
Trillion+NoDeepSpeed ZeRO-Infinity + specialized hardware
🎯
The 2026 Default Stack
For most fine-tuning (7B–70B): use Hugging Face Accelerate + FSDP. It integrates cleanly with Transformers, supports both FSDP and DeepSpeed backends, and has excellent documentation. Add QLoRA for even smaller memory footprint. DDP still works great for <7B models. Only reach for DeepSpeed when you specifically need ZeRO-Infinity (CPU/NVMe offloading).

Gradient Accumulation & Effective Batch Size

Python · Gradient accumulation
# Effective batch = micro_batch × grad_accum × num_gpus
# Example: micro=2, accum=16, gpus=8 → effective=256

accumulation_steps = 16

for step, batch in enumerate(dataloader):
    output = model(batch)
    loss = loss_fn(output) / accumulation_steps  # normalize!
    loss.backward()
    
    if (step + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
🔑
Key Takeaways
1. DDP = model fits on one GPU, simple and fast. FSDP = model doesn't fit, native PyTorch, recommended default. DeepSpeed = need advanced offloading. 2. ZeRO-3 can reduce per-GPU memory by 64× with 8 GPUs — this is what makes training 70B models on consumer hardware possible. 3. Always use gradient accumulation + mixed precision (BF16) together — they're the low-hanging fruit before touching parallelism strategies.

Finished reading?