DAY 2

03 / 09

vLLM, TensorRT-LLM & Inference Engines

schedule5 minsignal_cellular_altIntermediate1,088 words

Deep dive into the two dominant LLM inference engines. Understand PagedAttention, Flash Attention, tensor parallelism, and how to deploy models for maximum performance.

What you will learn

01The Inference Engine Landscape

02vLLM Deep Dive

03Flash Attention

04TensorRT-LLM

05SGLang — The Third Option

06Deploying vLLM Behind an API Gateway

The Inference Engine Landscape

You could theoretically run inference with plain PyTorch and Hugging Face, but you'd get 5–10% GPU utilization. Production inference engines exist to solve the hard systems problems: memory management, scheduling, parallelism, and kernel optimization.

vLLM TensorRT-LLM SGLang Ollama (dev) llama.cpp (edge)

vLLM Deep Dive

PagedAttention — The Core Innovation

vLLM's key insight: GPU memory fragmentation was the #1 bottleneck before vLLM. Traditional systems pre-allocate a contiguous chunk for each request's KV cache. This wastes up to 60% of VRAM due to fragmentation and over-allocation.

📄

PagedAttention: Inspired by OS Virtual Memory

PagedAttention breaks the KV cache into fixed-size blocks (16 tokens each by default) and maintains a block table per request — like OS page tables. Memory is allocated on demand, not upfront. Different requests can share physical blocks (for prefix caching). Fragmentation drops from ~60% to <4%.

PagedAttention maps logical KV blocks to non-contiguous physical pages — like OS virtual memory. Shared prefix pages are referenced by multiple requests, eliminating redundant computation.

Shell · vLLM server startup

# Start vLLM OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server     --model meta-llama/Meta-Llama-3.1-8B-Instruct     --dtype bfloat16     --max-model-len 8192     --gpu-memory-utilization 0.90     --max-num-seqs 256     --enable-prefix-caching     --tensor-parallel-size 1     --port 8000

Key vLLM Configuration Parameters

Parameter	What It Does	Default
`gpu-memory-utilization`	Fraction of VRAM reserved for vLLM (weights + KV cache)	0.90
`max-num-seqs`	Maximum concurrent sequences in flight	256
`max-model-len`	Maximum total context length (prompt + output)	Model's max
`enable-prefix-caching`	Cache KV for shared prompt prefixes (huge win for chat)	False
`tensor-parallel-size`	Number of GPUs for tensor parallelism	1
`block-size`	PagedAttention block size in tokens	16

Tensor Parallelism in vLLM

For models too large for one GPU, vLLM supports tensor parallelism (splitting weight matrices across GPUs). Each GPU computes a subset of the attention heads and MLP outputs, communicating via all-reduce operations.

Shell · vLLM with 4-GPU tensor parallelism

python -m vllm.entrypoints.openai.api_server     --model meta-llama/Meta-Llama-3.1-70B-Instruct     --tensor-parallel-size 4   # split across 4 GPUs
    --dtype bfloat16     --gpu-memory-utilization 0.92

Automatic Prefix Caching

When multiple requests share the same system prompt prefix (common in chatbots), vLLM can reuse the KV cache blocks for that prefix. Hash-based matching identifies shared prefixes. Cache hit = skip the entire prefill for that prefix = huge TTFT reduction.

💰

Prefix Caching ROI

If your system prompt is 2,000 tokens and 80% of requests share it, prefix caching eliminates processing of 2,000 × 0.80 = 1,600 tokens per request in the prefill phase. At high traffic, this can cut TTFT by 60–80% and GPU costs proportionally.

Flash Attention

Flash Attention is an algorithmic improvement to the attention mechanism that's now standard in all production inference engines. Traditional attention loads the full Q×K matrix into VRAM; Flash Attention tiles this computation in SRAM, dramatically reducing memory bandwidth usage.

Flash Attention 1 (2022): 2–4x speedup on attention, O(N) memory instead of O(N²)
Flash Attention 2 (2023): Better GPU utilization, improved parallelism across sequence length
Flash Attention 3 (2024): H100-specific optimizations using FP8, asynchronous pipeline, 2.6x speedup vs FA2
vLLM, SGLang, TRT-LLM all use FA2/FA3 by default

TensorRT-LLM

NVIDIA's own inference engine, built for maximum performance on NVIDIA hardware. Less flexible than vLLM but faster on NVIDIA GPUs for production deployments, especially with custom CUDA kernels.

TRT-LLM Architecture

Compilation step: Model is compiled into a TensorRT engine (.plan file) — startup cost, but inference is highly optimized
In-flight batching: Same continuous batching as vLLM, implemented as "inflight batching"
Custom kernels: Heavily optimized CUDA kernels for attention, layer norms, activations
FP8 first-class support: NVIDIA's Transformer Engine with FP8 is best in class on H100
Paged KV cache: Similar to vLLM's PagedAttention

Python · TRT-LLM basic inference

import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

runner = ModelRunner.from_dir(
    engine_dir="./engine",
    rank=0,
    max_output_len=512,
)

outputs = runner.generate(
    batch_input_ids=input_ids,
    sampling_config=sampling_config,
)

vLLM vs TensorRT-LLM

vLLM

Open source, easy to deploy
OpenAI-compatible API out of box
100+ model architectures
Multi-hardware (AMD, Intel)
Faster iteration on new models
PyTorch-based, easier to debug

TensorRT-LLM

NVIDIA-only but higher throughput
Compilation overhead at startup
Best FP8/INT8 kernel optimization
Triton integration built-in
More complex setup + debugging
Best for latency-critical NVIDIA deployments

⚡

When to Use Each

Start with vLLM. It works on more hardware, is easier to operate, and handles 95% of production use cases. Move to TRT-LLM only when you need to squeeze the last 20–30% performance out of an NVIDIA-specific deployment and you can afford the operational complexity.

SGLang — The Third Option

SGLang (Structured Generation Language) is a newer inference engine from LMSYS (the team behind Vicuna and Chatbot Arena). It's increasingly popular for workloads that involve structured outputs, multi-call programs, or complex sampling patterns.

RadixAttention: SGLang's answer to PagedAttention — uses a radix tree to share KV cache across requests with common prefixes more aggressively than vLLM
Structured decoding: native support for JSON schema-constrained generation (faster than outlines/guidance)
Multi-call programs: chain multiple LLM calls in a single request without round-trip overhead
Comparable throughput to vLLM on most benchmarks, sometimes faster for prefix-heavy workloads

🔀

When to Use SGLang

If your workload involves heavy structured output (e.g., extracting JSON from every response), or you make many LLM calls per user request, SGLang's structured generation and multi-call support may outperform vLLM. For standard open-ended generation, vLLM is still the safer default with a larger community and more model support.

Deploying vLLM Behind an API Gateway

Python · OpenAI-compatible client

from openai import OpenAI

# vLLM exposes OpenAI-compatible endpoint
client = OpenAI(
    base_url="http://your-vllm-host:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain KV caching"}],
    max_tokens=512,
    temperature=0.1,
    stream=True,
)

🔑

Key Takeaways

1. PagedAttention eliminated memory fragmentation — this is why vLLM can run 24x more concurrent requests than naive serving. 2. Enable prefix caching whenever you have a system prompt — it's free throughput. 3. Flash Attention is now table stakes — every serious engine uses it. 4. vLLM for flexibility, TRT-LLM for last-mile NVIDIA optimization.

📚 Further reading

Finished reading?