DAY 5

06 / 09

Model Serving & Autoscaling

schedule5 minsignal_cellular_altIntermediate996 words

Build production-grade LLM serving infrastructure. Learn multi-replica deployments, load balancing strategies, Kubernetes-native serving, and GPU autoscaling.

What you will learn

01From Single Server to Production Serving

02Core Serving Architecture

03Load Balancing for LLMs

04Kubernetes-Native LLM Serving

05GPU Autoscaling

06Cold Start Problem

From Single Server to Production Serving

Running vLLM on one server is the easy part. The hard part is making it handle variable traffic, recover from failures, scale economically, and maintain SLOs. This is where AI infrastructure diverges from standard web serving.

Core Serving Architecture

🏗️

The Reference Architecture

Client → API Gateway (rate limiting, auth) → Load Balancer → vLLM Replicas (N GPUs each) → Monitoring. For multi-region: add a global load balancer and per-region replica pools. Always have a fallback: if self-hosted fails → OpenAI API as safety net.

Production LLM serving stack. The load balancer routes on queue depth — not round-robin — and KEDA scales replicas on GPU-specific metrics.

Load Balancing for LLMs

Standard round-robin load balancing doesn't work well for LLMs. Why? LLM requests have wildly different processing times (a 10-token completion vs a 2000-token completion). You need smarter strategies:

Strategy	How It Works	Best For	Issue
Round-Robin	Rotate through replicas	Uniform request sizes	Skews under variable loads
Least Connections	Route to replica with fewest active requests	General purpose	Ignores GPU utilization
Least Queued	Route to replica with shortest queue depth	LLM serving	Requires queue metrics
KV Cache Affinity	Route same prefix to same replica (cache hit)	Shared system prompts	Complex routing logic
GPU Utilization	Route to replica with lowest GPU util	Variable workloads	Metric lag

💡

KV Cache Affinity Routing

Hash the request's system prompt prefix and always route to the same replica. That replica's KV cache will have the prefix cached after the first request. Subsequent requests get instant cache hits. This is what NVIDIA Dynamo, vLLM Production Stack, and llm-d implement as "prefix-aware scheduling." Can improve TTFT by 60–80% for chatbot workloads.

Kubernetes-Native LLM Serving

Modern production LLM deployments are Kubernetes-native. Key components:

YAML · vLLM Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Meta-Llama-3.1-8B-Instruct"
        - "--enable-prefix-caching"
        - "--gpu-memory-utilization=0.90"
        resources:
          limits:
            nvidia.com/gpu: 1  # 1 GPU per replica
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token

GPU Autoscaling

Standard HPA (Horizontal Pod Autoscaler) uses CPU/memory metrics — useless for GPU workloads. LLM autoscaling needs GPU-specific metrics.

KEDA (Kubernetes Event-Driven Autoscaling)

KEDA extends HPA to scale on custom metrics like queue depth, GPU utilization, or request latency from Prometheus.

YAML · KEDA ScaledObject for vLLM

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-autoscaler
spec:
  scaleTargetRef:
    name: vllm-server
  minReplicaCount: 1
  maxReplicaCount: 8
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_request_queue_depth
      threshold: "10"  # scale up if queue > 10
      query: |
        avg(vllm:num_requests_waiting)

Key Autoscaling Metrics for LLM

Queue depth: vllm:num_requests_waiting — primary trigger for scale-up
GPU memory utilization: vllm:gpu_cache_usage_perc — scale up before OOM
TTFT P95: scale if 95th percentile TTFT exceeds SLA
Tokens/sec: capacity planning and scale target

Cold Start Problem

The biggest challenge in LLM autoscaling: loading a 70B model from disk takes 5–15 minutes. By the time new replicas are ready, the traffic spike has passed.

⚠️

Cold Start Mitigations

1. Keep minimum 1–2 warm replicas always running. 2. Use faster storage (NVMe RAID → model loads in 60–90 seconds for 8B). 3. Container image pre-pulling and node pre-warming. 4. Predictive scaling based on traffic patterns (time-of-day). 5. Model sharding: keep model weights in GPU memory across scheduled downtime.

Production Stacks: llm-d and vLLM Production

Kubernetes + vLLM + KEDA is the foundation, but production teams often need higher-level abstractions for prefix-aware scheduling, disaggregated prefill/decode, and multi-node coordination. Two projects address this:

llm-d (Red Hat / community): Kubernetes-native LLM serving stack with prefix-aware routing, disaggregated prefill/decode nodes, and KEDA integration baked in
vLLM Production Stack: A router layer for vLLM that adds KV cache affinity routing, request queuing, and multi-replica coordination — vLLM's official answer to production-scale routing
NVIDIA Dynamo: NVIDIA's disaggregated serving framework optimized for H100/H200 with NVLink; splits prefill and decode onto separate GPU pools

✂️

Disaggregated Prefill / Decode

Prefill (processing the prompt) is compute-bound. Decode (generating tokens) is memory-bandwidth-bound. These two phases have opposing resource profiles. Disaggregated serving runs prefill on compute-optimized nodes and decode on bandwidth-optimized nodes, then transfers the KV cache between them. This improves GPU utilization by 30–50% at large scale but adds architectural complexity.

Multi-Model Serving & Router Pattern

Production systems often need multiple models: a small fast model for simple queries, a large model for complex ones. Implement a router that classifies complexity and routes accordingly.

Python · Simple model router

async def route_request(prompt: str, client_config: dict):
    # Simple heuristic: token count determines model
    token_estimate = len(prompt.split()) * 1.3
    
    if token_estimate < 500 and not client_config.get('high_quality'):
        # Route to fast 8B model
        return await llm_8b_client(prompt)
    else:
        # Route to powerful 70B model
        return await llm_70b_client(prompt)

Failover & Fallback Strategy

Health checks: vLLM exposes /health and /metrics endpoints — use them in readiness probes
Circuit breaker: if a replica fails N requests in M seconds, stop sending it traffic
Fallback chain: Self-hosted → backup cloud GPU → OpenAI API
Graceful drain: don't kill a replica mid-generation — wait for in-flight requests to complete

🔑

Key Takeaways

1. Round-robin load balancing is wrong for LLMs — use queue depth or KV cache affinity. 2. KEDA + Prometheus is the right autoscaling stack for GPU workloads. 3. Cold start is your biggest scaling challenge — plan for it with minimum replicas and predictive scaling. 4. Always build a fallback chain — self-hosted models can fail and you need a safety net.

📚 Further reading

KEDA — Prometheus Scaler Documentationkeda.sh
vLLM — Deploying with Kubernetesvllm.ai
KubeRay — Kubernetes Operator for Ray (Ray Serve)github.com
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Servingusenix.org
Google GKE — GPU Node Pools Documentationcloud.google.com

Finished reading?