DAY 4

05 / 09

Distributed Training: DDP, FSDP & DeepSpeed

schedule5 minsignal_cellular_altIntermediate1,036 words

Learn how to train large models across multiple GPUs. Master data parallelism, the ZeRO optimizer stages, FSDP, and when to use each strategy.

What you will learn

01Why Distributed Training?

02Data Parallelism: DDP

03ZeRO: Zero Redundancy Optimizer

04PyTorch FSDP

05QLoRA: Fine-tuning Without Full Sharding

06DeepSpeed

Why Distributed Training?

A Llama-3 70B model with Adam optimizer in BF16 needs roughly 420 GB of GPU memory for training (weights 140GB + gradients 140GB + optimizer states 140GB). You need a strategy to distribute this across multiple GPUs.

70B

Llama-3 params

420 GB

Full training memory

6×

H100 80GB GPUs needed (no sharding)

2×

H100s with ZeRO-3 sharding

Data Parallelism: DDP

The simplest distributed strategy. Each GPU gets a full copy of the model. The dataset is split across GPUs. After each backward pass, gradients are synchronized via all-reduce (ring-allreduce algorithm). Only viable when the model fits on a single GPU.

Python · DDP setup

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group
dist.init_process_group("nccl")
local_rank = dist.get_rank()
torch.cuda.set_device(local_rank)

model = MyModel().cuda()
model = DDP(model, device_ids=[local_rank])

# Train normally — gradients sync automatically
optimizer.zero_grad()
output = model(batch)
loss.backward()
optimizer.step()

Shell · Launch DDP with torchrun

# Single node, 8 GPUs
torchrun --nproc_per_node=8 train.py

# Multi-node: node 0 (master)
torchrun --nproc_per_node=8          --nnodes=4          --node_rank=0          --master_addr="192.168.1.1"          --master_port=29500          train.py

ZeRO: Zero Redundancy Optimizer

ZeRO (from DeepSpeed, now also in FSDP) eliminates the memory redundancy in DDP. Instead of each GPU holding a full copy of everything, ZeRO shards across GPUs.

Stage	What's Sharded	Communication	Memory Reduction (8 GPUs)
ZeRO-1	Optimizer states	All-reduce gradients	4× reduction
ZeRO-2	Optimizer states + Gradients	All-reduce gradients	8× reduction
ZeRO-3	Optimizer + Gradients + Parameters	All-gather parameters	64× reduction
ZeRO-Infinity	ZeRO-3 + NVMe offload	PCIe/NVMe I/O	Virtually unlimited

ZeRO stages trade communication for memory. Stage 3 shards everything — each GPU holds 1/N of parameters, gradients, and optimizer state.

💡

ZeRO-3 Trade-off

ZeRO-3 shards parameters across GPUs. This means before each layer's forward/backward pass, GPUs must all-gather that layer's parameters. This adds communication overhead (~30% slower than ZeRO-2 on slow inter-GPU links). But it's what enables training models 8–64× larger than what fits on one GPU.

PyTorch FSDP

FSDP (Fully Sharded Data Parallel) is PyTorch's native implementation of ZeRO-3 semantics. Introduced in PyTorch 1.12, it's now the recommended approach for large model training without external dependencies.

Python · FSDP wrapping

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, MixedPrecision
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

# Mixed precision policy (weights: BF16, params: FP32)
bf16_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # ZeRO-3
    auto_wrap_policy=transformer_auto_wrap_policy(
        transformer_layer_cls={LlamaDecoderLayer}
    ),
    mixed_precision=bf16_policy,
    device_id=torch.cuda.current_device(),
)

QLoRA: Fine-tuning Without Full Sharding

For fine-tuning (not pre-training), you often don't need full FSDP. QLoRA (Quantized Low-Rank Adaptation) combines 4-bit quantized base weights with small trainable adapter matrices — reducing fine-tuning memory by 4–5× while matching full fine-tune quality on most tasks.

💡

How LoRA Works

Instead of updating all weight parameters W (huge), LoRA adds small matrices A and B where rank r ≪ d: ΔW = A × B. For a 4096×4096 attention weight with r=16, LoRA updates 2×4096×16 = 131k parameters instead of 16.7M — a 128× reduction in trainable params. At inference, you merge: W_final = W_base + A×B. QLoRA extends this by keeping W_base in INT4 (frozen) and training only the LoRA adapters in BF16.

DeepSpeed

Microsoft's distributed training framework that pioneered ZeRO. More feature-rich than FSDP but requires external dependency and JSON config files.

JSON · DeepSpeed ZeRO-3 config

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu" },
    "offload_param": { "device": "cpu" },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8
  },
  "bf16": { "enabled": true },
  "train_micro_batch_size_per_gpu": 2,
  "gradient_accumulation_steps": 8
}

Tensor Parallelism & Pipeline Parallelism

When data parallelism alone isn't enough, these strategies split the model itself:

Tensor Parallelism

Split weight matrices across GPUs
Each GPU computes part of each layer
Requires all-reduce per layer
Needs NVLink (high bandwidth)
Good for single-node multi-GPU

Pipeline Parallelism

Split model layers across GPUs
GPU 0 runs layers 1–8, GPU 1 runs 9–16...
Micro-batching to hide idle time
Works over slower inter-node links
Good for multi-node deployments

The Decision Matrix

Model Size	Fits on 1 GPU?	Recommended Strategy
< 7B	Yes (40GB+)	DDP — simplest, most efficient
7B–30B	Partial	FSDP ZeRO-2 or ZeRO-3
70B	No	FSDP ZeRO-3 or DeepSpeed ZeRO-3
200B+	No	3D parallelism (Data + Tensor + Pipeline)
Trillion+	No	DeepSpeed ZeRO-Infinity + specialized hardware

🎯

The 2026 Default Stack

For most fine-tuning (7B–70B): use Hugging Face Accelerate + FSDP. It integrates cleanly with Transformers, supports both FSDP and DeepSpeed backends, and has excellent documentation. Add QLoRA for even smaller memory footprint. DDP still works great for <7B models. Only reach for DeepSpeed when you specifically need ZeRO-Infinity (CPU/NVMe offloading).

Gradient Accumulation & Effective Batch Size

Python · Gradient accumulation

# Effective batch = micro_batch × grad_accum × num_gpus
# Example: micro=2, accum=16, gpus=8 → effective=256

accumulation_steps = 16

for step, batch in enumerate(dataloader):
    output = model(batch)
    loss = loss_fn(output) / accumulation_steps  # normalize!
    loss.backward()
    
    if (step + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

🔑

Key Takeaways

1. DDP = model fits on one GPU, simple and fast. FSDP = model doesn't fit, native PyTorch, recommended default. DeepSpeed = need advanced offloading. 2. ZeRO-3 can reduce per-GPU memory by 64× with 8 GPUs — this is what makes training 70B models on consumer hardware possible. 3. Always use gradient accumulation + mixed precision (BF16) together — they're the low-hanging fruit before touching parallelism strategies.

📚 Further reading

ZeRO: Memory Optimizations Toward Training Trillion Parameter Modelsarxiv.org
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learningarxiv.org
PyTorch FSDP — Official Documentationpytorch.org
Hugging Face Accelerate — FSDP Guidehuggingface.co
DeepSpeed Training — Overview and Featuresdeepspeed.ai

Finished reading?