The Engineering Codex/LLM Systems Engineering
DAY 5
06 / 09

Model Serving & Autoscaling

schedule5 minsignal_cellular_altIntermediate996 words
Build production-grade LLM serving infrastructure. Learn multi-replica deployments, load balancing strategies, Kubernetes-native serving, and GPU autoscaling.

What you will learn

01From Single Server to Production Serving
02Core Serving Architecture
03Load Balancing for LLMs
04Kubernetes-Native LLM Serving
05GPU Autoscaling
06Cold Start Problem

From Single Server to Production Serving

Running vLLM on one server is the easy part. The hard part is making it handle variable traffic, recover from failures, scale economically, and maintain SLOs. This is where AI infrastructure diverges from standard web serving.

Core Serving Architecture

🏗️
The Reference Architecture
Client → API Gateway (rate limiting, auth) → Load Balancer → vLLM Replicas (N GPUs each) → Monitoring. For multi-region: add a global load balancer and per-region replica pools. Always have a fallback: if self-hosted fails → OpenAI API as safety net.
Client request API Gateway auth · rate limit Load Balancer queue-depth aware vLLM replica 1 GPU ×N vLLM replica 2 GPU ×N replica N autoscaled ⇅ KEDA
Production LLM serving stack. The load balancer routes on queue depth — not round-robin — and KEDA scales replicas on GPU-specific metrics.

Load Balancing for LLMs

Standard round-robin load balancing doesn't work well for LLMs. Why? LLM requests have wildly different processing times (a 10-token completion vs a 2000-token completion). You need smarter strategies:

StrategyHow It WorksBest ForIssue
Round-RobinRotate through replicasUniform request sizesSkews under variable loads
Least ConnectionsRoute to replica with fewest active requestsGeneral purposeIgnores GPU utilization
Least QueuedRoute to replica with shortest queue depthLLM servingRequires queue metrics
KV Cache AffinityRoute same prefix to same replica (cache hit)Shared system promptsComplex routing logic
GPU UtilizationRoute to replica with lowest GPU utilVariable workloadsMetric lag
💡
KV Cache Affinity Routing
Hash the request's system prompt prefix and always route to the same replica. That replica's KV cache will have the prefix cached after the first request. Subsequent requests get instant cache hits. This is what NVIDIA Dynamo, vLLM Production Stack, and llm-d implement as "prefix-aware scheduling." Can improve TTFT by 60–80% for chatbot workloads.

Kubernetes-Native LLM Serving

Modern production LLM deployments are Kubernetes-native. Key components:

YAML · vLLM Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Meta-Llama-3.1-8B-Instruct"
        - "--enable-prefix-caching"
        - "--gpu-memory-utilization=0.90"
        resources:
          limits:
            nvidia.com/gpu: 1  # 1 GPU per replica
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token

GPU Autoscaling

Standard HPA (Horizontal Pod Autoscaler) uses CPU/memory metrics — useless for GPU workloads. LLM autoscaling needs GPU-specific metrics.

KEDA (Kubernetes Event-Driven Autoscaling)

KEDA extends HPA to scale on custom metrics like queue depth, GPU utilization, or request latency from Prometheus.

YAML · KEDA ScaledObject for vLLM
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-autoscaler
spec:
  scaleTargetRef:
    name: vllm-server
  minReplicaCount: 1
  maxReplicaCount: 8
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_request_queue_depth
      threshold: "10"  # scale up if queue > 10
      query: |
        avg(vllm:num_requests_waiting)

Key Autoscaling Metrics for LLM

  • Queue depth: vllm:num_requests_waiting — primary trigger for scale-up
  • GPU memory utilization: vllm:gpu_cache_usage_perc — scale up before OOM
  • TTFT P95: scale if 95th percentile TTFT exceeds SLA
  • Tokens/sec: capacity planning and scale target

Cold Start Problem

The biggest challenge in LLM autoscaling: loading a 70B model from disk takes 5–15 minutes. By the time new replicas are ready, the traffic spike has passed.

⚠️
Cold Start Mitigations
1. Keep minimum 1–2 warm replicas always running. 2. Use faster storage (NVMe RAID → model loads in 60–90 seconds for 8B). 3. Container image pre-pulling and node pre-warming. 4. Predictive scaling based on traffic patterns (time-of-day). 5. Model sharding: keep model weights in GPU memory across scheduled downtime.

Production Stacks: llm-d and vLLM Production

Kubernetes + vLLM + KEDA is the foundation, but production teams often need higher-level abstractions for prefix-aware scheduling, disaggregated prefill/decode, and multi-node coordination. Two projects address this:

  • llm-d (Red Hat / community): Kubernetes-native LLM serving stack with prefix-aware routing, disaggregated prefill/decode nodes, and KEDA integration baked in
  • vLLM Production Stack: A router layer for vLLM that adds KV cache affinity routing, request queuing, and multi-replica coordination — vLLM's official answer to production-scale routing
  • NVIDIA Dynamo: NVIDIA's disaggregated serving framework optimized for H100/H200 with NVLink; splits prefill and decode onto separate GPU pools
✂️
Disaggregated Prefill / Decode
Prefill (processing the prompt) is compute-bound. Decode (generating tokens) is memory-bandwidth-bound. These two phases have opposing resource profiles. Disaggregated serving runs prefill on compute-optimized nodes and decode on bandwidth-optimized nodes, then transfers the KV cache between them. This improves GPU utilization by 30–50% at large scale but adds architectural complexity.

Multi-Model Serving & Router Pattern

Production systems often need multiple models: a small fast model for simple queries, a large model for complex ones. Implement a router that classifies complexity and routes accordingly.

Python · Simple model router
async def route_request(prompt: str, client_config: dict):
    # Simple heuristic: token count determines model
    token_estimate = len(prompt.split()) * 1.3
    
    if token_estimate < 500 and not client_config.get('high_quality'):
        # Route to fast 8B model
        return await llm_8b_client(prompt)
    else:
        # Route to powerful 70B model
        return await llm_70b_client(prompt)

Failover & Fallback Strategy

  • Health checks: vLLM exposes /health and /metrics endpoints — use them in readiness probes
  • Circuit breaker: if a replica fails N requests in M seconds, stop sending it traffic
  • Fallback chain: Self-hosted → backup cloud GPU → OpenAI API
  • Graceful drain: don't kill a replica mid-generation — wait for in-flight requests to complete
🔑
Key Takeaways
1. Round-robin load balancing is wrong for LLMs — use queue depth or KV cache affinity. 2. KEDA + Prometheus is the right autoscaling stack for GPU workloads. 3. Cold start is your biggest scaling challenge — plan for it with minimum replicas and predictive scaling. 4. Always build a fallback chain — self-hosted models can fail and you need a safety net.

Finished reading?