
Model Serving & Autoscaling
Build production-grade LLM serving infrastructure. Learn multi-replica deployments, load balancing strategies, Kubernetes-native serving, and GPU autoscaling.
What you will learn
From Single Server to Production Serving
Running vLLM on one server is the easy part. The hard part is making it handle variable traffic, recover from failures, scale economically, and maintain SLOs. This is where AI infrastructure diverges from standard web serving.
Core Serving Architecture
Load Balancing for LLMs
Standard round-robin load balancing doesn't work well for LLMs. Why? LLM requests have wildly different processing times (a 10-token completion vs a 2000-token completion). You need smarter strategies:
| Strategy | How It Works | Best For | Issue |
|---|---|---|---|
| Round-Robin | Rotate through replicas | Uniform request sizes | Skews under variable loads |
| Least Connections | Route to replica with fewest active requests | General purpose | Ignores GPU utilization |
| Least Queued | Route to replica with shortest queue depth | LLM serving | Requires queue metrics |
| KV Cache Affinity | Route same prefix to same replica (cache hit) | Shared system prompts | Complex routing logic |
| GPU Utilization | Route to replica with lowest GPU util | Variable workloads | Metric lag |
Kubernetes-Native LLM Serving
Modern production LLM deployments are Kubernetes-native. Key components:
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-server spec: replicas: 2 selector: matchLabels: app: vllm template: spec: containers: - name: vllm image: vllm/vllm-openai:latest args: - "--model" - "meta-llama/Meta-Llama-3.1-8B-Instruct" - "--enable-prefix-caching" - "--gpu-memory-utilization=0.90" resources: limits: nvidia.com/gpu: 1 # 1 GPU per replica env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-secret key: token
GPU Autoscaling
Standard HPA (Horizontal Pod Autoscaler) uses CPU/memory metrics — useless for GPU workloads. LLM autoscaling needs GPU-specific metrics.
KEDA (Kubernetes Event-Driven Autoscaling)
KEDA extends HPA to scale on custom metrics like queue depth, GPU utilization, or request latency from Prometheus.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-autoscaler
spec:
scaleTargetRef:
name: vllm-server
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_request_queue_depth
threshold: "10" # scale up if queue > 10
query: |
avg(vllm:num_requests_waiting)
Key Autoscaling Metrics for LLM
- Queue depth:
vllm:num_requests_waiting— primary trigger for scale-up - GPU memory utilization:
vllm:gpu_cache_usage_perc— scale up before OOM - TTFT P95: scale if 95th percentile TTFT exceeds SLA
- Tokens/sec: capacity planning and scale target
Cold Start Problem
The biggest challenge in LLM autoscaling: loading a 70B model from disk takes 5–15 minutes. By the time new replicas are ready, the traffic spike has passed.
Production Stacks: llm-d and vLLM Production
Kubernetes + vLLM + KEDA is the foundation, but production teams often need higher-level abstractions for prefix-aware scheduling, disaggregated prefill/decode, and multi-node coordination. Two projects address this:
- llm-d (Red Hat / community): Kubernetes-native LLM serving stack with prefix-aware routing, disaggregated prefill/decode nodes, and KEDA integration baked in
- vLLM Production Stack: A router layer for vLLM that adds KV cache affinity routing, request queuing, and multi-replica coordination — vLLM's official answer to production-scale routing
- NVIDIA Dynamo: NVIDIA's disaggregated serving framework optimized for H100/H200 with NVLink; splits prefill and decode onto separate GPU pools
Multi-Model Serving & Router Pattern
Production systems often need multiple models: a small fast model for simple queries, a large model for complex ones. Implement a router that classifies complexity and routes accordingly.
async def route_request(prompt: str, client_config: dict): # Simple heuristic: token count determines model token_estimate = len(prompt.split()) * 1.3 if token_estimate < 500 and not client_config.get('high_quality'): # Route to fast 8B model return await llm_8b_client(prompt) else: # Route to powerful 70B model return await llm_70b_client(prompt)
Failover & Fallback Strategy
- Health checks: vLLM exposes
/healthand/metricsendpoints — use them in readiness probes - Circuit breaker: if a replica fails N requests in M seconds, stop sending it traffic
- Fallback chain: Self-hosted → backup cloud GPU → OpenAI API
- Graceful drain: don't kill a replica mid-generation — wait for in-flight requests to complete
Finished reading?