
vLLM, TensorRT-LLM & Inference Engines
Deep dive into the two dominant LLM inference engines. Understand PagedAttention, Flash Attention, tensor parallelism, and how to deploy models for maximum performance.
What you will learn
The Inference Engine Landscape
You could theoretically run inference with plain PyTorch and Hugging Face, but you'd get 5–10% GPU utilization. Production inference engines exist to solve the hard systems problems: memory management, scheduling, parallelism, and kernel optimization.
vLLM Deep Dive
PagedAttention — The Core Innovation
vLLM's key insight: GPU memory fragmentation was the #1 bottleneck before vLLM. Traditional systems pre-allocate a contiguous chunk for each request's KV cache. This wastes up to 60% of VRAM due to fragmentation and over-allocation.
# Start vLLM OpenAI-compatible server python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --max-model-len 8192 --gpu-memory-utilization 0.90 --max-num-seqs 256 --enable-prefix-caching --tensor-parallel-size 1 --port 8000
Key vLLM Configuration Parameters
| Parameter | What It Does | Default |
|---|---|---|
gpu-memory-utilization | Fraction of VRAM reserved for vLLM (weights + KV cache) | 0.90 |
max-num-seqs | Maximum concurrent sequences in flight | 256 |
max-model-len | Maximum total context length (prompt + output) | Model's max |
enable-prefix-caching | Cache KV for shared prompt prefixes (huge win for chat) | False |
tensor-parallel-size | Number of GPUs for tensor parallelism | 1 |
block-size | PagedAttention block size in tokens | 16 |
Tensor Parallelism in vLLM
For models too large for one GPU, vLLM supports tensor parallelism (splitting weight matrices across GPUs). Each GPU computes a subset of the attention heads and MLP outputs, communicating via all-reduce operations.
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 4 # split across 4 GPUs --dtype bfloat16 --gpu-memory-utilization 0.92
Automatic Prefix Caching
When multiple requests share the same system prompt prefix (common in chatbots), vLLM can reuse the KV cache blocks for that prefix. Hash-based matching identifies shared prefixes. Cache hit = skip the entire prefill for that prefix = huge TTFT reduction.
Flash Attention
Flash Attention is an algorithmic improvement to the attention mechanism that's now standard in all production inference engines. Traditional attention loads the full Q×K matrix into VRAM; Flash Attention tiles this computation in SRAM, dramatically reducing memory bandwidth usage.
- Flash Attention 1 (2022): 2–4x speedup on attention, O(N) memory instead of O(N²)
- Flash Attention 2 (2023): Better GPU utilization, improved parallelism across sequence length
- Flash Attention 3 (2024): H100-specific optimizations using FP8, asynchronous pipeline, 2.6x speedup vs FA2
- vLLM, SGLang, TRT-LLM all use FA2/FA3 by default
TensorRT-LLM
NVIDIA's own inference engine, built for maximum performance on NVIDIA hardware. Less flexible than vLLM but faster on NVIDIA GPUs for production deployments, especially with custom CUDA kernels.
TRT-LLM Architecture
- Compilation step: Model is compiled into a TensorRT engine (
.planfile) — startup cost, but inference is highly optimized - In-flight batching: Same continuous batching as vLLM, implemented as "inflight batching"
- Custom kernels: Heavily optimized CUDA kernels for attention, layer norms, activations
- FP8 first-class support: NVIDIA's Transformer Engine with FP8 is best in class on H100
- Paged KV cache: Similar to vLLM's PagedAttention
import tensorrt_llm from tensorrt_llm.runtime import ModelRunner runner = ModelRunner.from_dir( engine_dir="./engine", rank=0, max_output_len=512, ) outputs = runner.generate( batch_input_ids=input_ids, sampling_config=sampling_config, )
vLLM vs TensorRT-LLM
- Open source, easy to deploy
- OpenAI-compatible API out of box
- 100+ model architectures
- Multi-hardware (AMD, Intel)
- Faster iteration on new models
- PyTorch-based, easier to debug
- NVIDIA-only but higher throughput
- Compilation overhead at startup
- Best FP8/INT8 kernel optimization
- Triton integration built-in
- More complex setup + debugging
- Best for latency-critical NVIDIA deployments
SGLang — The Third Option
SGLang (Structured Generation Language) is a newer inference engine from LMSYS (the team behind Vicuna and Chatbot Arena). It's increasingly popular for workloads that involve structured outputs, multi-call programs, or complex sampling patterns.
- RadixAttention: SGLang's answer to PagedAttention — uses a radix tree to share KV cache across requests with common prefixes more aggressively than vLLM
- Structured decoding: native support for JSON schema-constrained generation (faster than outlines/guidance)
- Multi-call programs: chain multiple LLM calls in a single request without round-trip overhead
- Comparable throughput to vLLM on most benchmarks, sometimes faster for prefix-heavy workloads
Deploying vLLM Behind an API Gateway
from openai import OpenAI # vLLM exposes OpenAI-compatible endpoint client = OpenAI( base_url="http://your-vllm-host:8000/v1", api_key="not-needed" ) response = client.chat.completions.create( model="meta-llama/Meta-Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Explain KV caching"}], max_tokens=512, temperature=0.1, stream=True, )
- Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM)arxiv.org
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awarenessarxiv.org
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioningarxiv.org
- vLLM Official Documentationvllm.ai
- NVIDIA TensorRT-LLM — GitHubgithub.com
Finished reading?