The Engineering Codex/LLM Systems Engineering
DAY 2
03 / 09

vLLM, TensorRT-LLM & Inference Engines

schedule5 minsignal_cellular_altIntermediate1,088 words
Deep dive into the two dominant LLM inference engines. Understand PagedAttention, Flash Attention, tensor parallelism, and how to deploy models for maximum performance.

What you will learn

01The Inference Engine Landscape
02vLLM Deep Dive
03Flash Attention
04TensorRT-LLM
05SGLang — The Third Option
06Deploying vLLM Behind an API Gateway

The Inference Engine Landscape

You could theoretically run inference with plain PyTorch and Hugging Face, but you'd get 5–10% GPU utilization. Production inference engines exist to solve the hard systems problems: memory management, scheduling, parallelism, and kernel optimization.

vLLM TensorRT-LLM SGLang Ollama (dev) llama.cpp (edge)

vLLM Deep Dive

PagedAttention — The Core Innovation

vLLM's key insight: GPU memory fragmentation was the #1 bottleneck before vLLM. Traditional systems pre-allocate a contiguous chunk for each request's KV cache. This wastes up to 60% of VRAM due to fragmentation and over-allocation.

📄
PagedAttention: Inspired by OS Virtual Memory
PagedAttention breaks the KV cache into fixed-size blocks (16 tokens each by default) and maintains a block table per request — like OS page tables. Memory is allocated on demand, not upfront. Different requests can share physical blocks (for prefix caching). Fragmentation drops from ~60% to <4%.
Logical (per request) R1 B0 R1 B1 R1 B2 R2 B0 R2 B1 Prefix shared by R1 & R2 Physical VRAM pages pg 0 pg 1 pg 2 pg 3 pg 4 pg 5 ⇄ shared page free free
PagedAttention maps logical KV blocks to non-contiguous physical pages — like OS virtual memory. Shared prefix pages are referenced by multiple requests, eliminating redundant computation.
Shell · vLLM server startup
# Start vLLM OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server     --model meta-llama/Meta-Llama-3.1-8B-Instruct     --dtype bfloat16     --max-model-len 8192     --gpu-memory-utilization 0.90     --max-num-seqs 256     --enable-prefix-caching     --tensor-parallel-size 1     --port 8000

Key vLLM Configuration Parameters

ParameterWhat It DoesDefault
gpu-memory-utilizationFraction of VRAM reserved for vLLM (weights + KV cache)0.90
max-num-seqsMaximum concurrent sequences in flight256
max-model-lenMaximum total context length (prompt + output)Model's max
enable-prefix-cachingCache KV for shared prompt prefixes (huge win for chat)False
tensor-parallel-sizeNumber of GPUs for tensor parallelism1
block-sizePagedAttention block size in tokens16

Tensor Parallelism in vLLM

For models too large for one GPU, vLLM supports tensor parallelism (splitting weight matrices across GPUs). Each GPU computes a subset of the attention heads and MLP outputs, communicating via all-reduce operations.

Shell · vLLM with 4-GPU tensor parallelism
python -m vllm.entrypoints.openai.api_server     --model meta-llama/Meta-Llama-3.1-70B-Instruct     --tensor-parallel-size 4   # split across 4 GPUs
    --dtype bfloat16     --gpu-memory-utilization 0.92

Automatic Prefix Caching

When multiple requests share the same system prompt prefix (common in chatbots), vLLM can reuse the KV cache blocks for that prefix. Hash-based matching identifies shared prefixes. Cache hit = skip the entire prefill for that prefix = huge TTFT reduction.

💰
Prefix Caching ROI
If your system prompt is 2,000 tokens and 80% of requests share it, prefix caching eliminates processing of 2,000 × 0.80 = 1,600 tokens per request in the prefill phase. At high traffic, this can cut TTFT by 60–80% and GPU costs proportionally.

Flash Attention

Flash Attention is an algorithmic improvement to the attention mechanism that's now standard in all production inference engines. Traditional attention loads the full Q×K matrix into VRAM; Flash Attention tiles this computation in SRAM, dramatically reducing memory bandwidth usage.

  • Flash Attention 1 (2022): 2–4x speedup on attention, O(N) memory instead of O(N²)
  • Flash Attention 2 (2023): Better GPU utilization, improved parallelism across sequence length
  • Flash Attention 3 (2024): H100-specific optimizations using FP8, asynchronous pipeline, 2.6x speedup vs FA2
  • vLLM, SGLang, TRT-LLM all use FA2/FA3 by default

TensorRT-LLM

NVIDIA's own inference engine, built for maximum performance on NVIDIA hardware. Less flexible than vLLM but faster on NVIDIA GPUs for production deployments, especially with custom CUDA kernels.

TRT-LLM Architecture

  • Compilation step: Model is compiled into a TensorRT engine (.plan file) — startup cost, but inference is highly optimized
  • In-flight batching: Same continuous batching as vLLM, implemented as "inflight batching"
  • Custom kernels: Heavily optimized CUDA kernels for attention, layer norms, activations
  • FP8 first-class support: NVIDIA's Transformer Engine with FP8 is best in class on H100
  • Paged KV cache: Similar to vLLM's PagedAttention
Python · TRT-LLM basic inference
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

runner = ModelRunner.from_dir(
    engine_dir="./engine",
    rank=0,
    max_output_len=512,
)

outputs = runner.generate(
    batch_input_ids=input_ids,
    sampling_config=sampling_config,
)

vLLM vs TensorRT-LLM

vLLM
  • Open source, easy to deploy
  • OpenAI-compatible API out of box
  • 100+ model architectures
  • Multi-hardware (AMD, Intel)
  • Faster iteration on new models
  • PyTorch-based, easier to debug
TensorRT-LLM
  • NVIDIA-only but higher throughput
  • Compilation overhead at startup
  • Best FP8/INT8 kernel optimization
  • Triton integration built-in
  • More complex setup + debugging
  • Best for latency-critical NVIDIA deployments
When to Use Each
Start with vLLM. It works on more hardware, is easier to operate, and handles 95% of production use cases. Move to TRT-LLM only when you need to squeeze the last 20–30% performance out of an NVIDIA-specific deployment and you can afford the operational complexity.

SGLang — The Third Option

SGLang (Structured Generation Language) is a newer inference engine from LMSYS (the team behind Vicuna and Chatbot Arena). It's increasingly popular for workloads that involve structured outputs, multi-call programs, or complex sampling patterns.

  • RadixAttention: SGLang's answer to PagedAttention — uses a radix tree to share KV cache across requests with common prefixes more aggressively than vLLM
  • Structured decoding: native support for JSON schema-constrained generation (faster than outlines/guidance)
  • Multi-call programs: chain multiple LLM calls in a single request without round-trip overhead
  • Comparable throughput to vLLM on most benchmarks, sometimes faster for prefix-heavy workloads
🔀
When to Use SGLang
If your workload involves heavy structured output (e.g., extracting JSON from every response), or you make many LLM calls per user request, SGLang's structured generation and multi-call support may outperform vLLM. For standard open-ended generation, vLLM is still the safer default with a larger community and more model support.

Deploying vLLM Behind an API Gateway

Python · OpenAI-compatible client
from openai import OpenAI

# vLLM exposes OpenAI-compatible endpoint
client = OpenAI(
    base_url="http://your-vllm-host:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain KV caching"}],
    max_tokens=512,
    temperature=0.1,
    stream=True,
)
🔑
Key Takeaways
1. PagedAttention eliminated memory fragmentation — this is why vLLM can run 24x more concurrent requests than naive serving. 2. Enable prefix caching whenever you have a system prompt — it's free throughput. 3. Flash Attention is now table stakes — every serious engine uses it. 4. vLLM for flexibility, TRT-LLM for last-mile NVIDIA optimization.

Finished reading?