vLLM is a powerful, open-source tool that lets you run advanced AI like GPT-4o on just one GPU. It uses special techniques to make everything super fast and efficient, handling lots of requests at once. Even on a laptop or a small computer, vLLM can serve AI responses quickly and smoothly. New updates make it easy to set up and monitor, so anyone can start using strong AI models with simple commands.
What is vLLM and how does it enable GPT-4o-class inference on a single GPU in 2025?
vLLM is an open-source engine designed to deliver GPT-4o-class inference on a single GPU or modest cluster, leveraging technologies like PagedAttention, continuous batching, and advanced quantisation. This enables high-throughput, low-latency LLM serving – achieving over 1,800 tokens/second with efficient GPU utilization.
- Inside vLLM: a 2025 deep-dive into the engine powering GPT-4o-class inference on a single GPU*
Since its first open-source release, vLLM has quietly become the backbone of many high-throughput LLM services. By early 2025 the project’s own roadmap targets “GPT-4o class performance on a single GPU or modest node”, a claim backed by a public per-commit dashboard at perf.vllm.ai and a new Production Stack that turns helm install
into a one-command Kubernetes rollout. Below is a concise field guide distilled from the project’s 2025 vision document, recent ASPLOS paper, and active GitHub issues.
1. Core architecture at a glance
Component | What it does (2025 view) | Benefit to operators |
---|---|---|
*PagedAttention * | Treats KV cache as paged memory (4 KB blocks) | 3-7× higher GPU utilisation vs. static cache |
Continuous batching | Adds new requests mid-forward pass | Up to 23× throughput on shared GPUs |
Prefix/grammar decoding | Constrains next-token logits with FSM or CFG rules | JSON/regex output w/o fine-tuning |
Speculative decode | Draft-small → verify-big two-stage pipeline | 2–2.5× lower latency at same power |
Disaggregated P/D | Split prefill & decode workers (scale independently) | Linear throughput growth with cluster size |
Quantisation suite | Native FP8, INT4, GPTQ, AWQ, AutoRound | Fit 70 B models on 1×A100 80 GB |
2. From laptop to 1 000 RPS: four validated topologies
Topology sketches are now shipped in the official Helm charts:
Scale | Topology | Typical fleet | Latency (P99) | Notes |
---|---|---|---|---|
Dev | Single pod | 1×GPU laptop | 400 ms @ 8 req/s | Uses aggregated worker |
SMB | Router + 4 replicas | 4×A100 | 120 ms @ 64 req/s | KV-aware routing enabled |
SaaS | Disaggregated | 8×prefill + 16×decode H100 | 80 ms @ 400 req/s | Redis KV store shared |
Hyperscale | Multi-model | 128×GH200 cluster | <60 ms @ 1 k req/s | TPU v6e tests in CI |
3. Production checklist (2025 edition)
helm repo add vllm https://vllm-project.github.io/helm-charts
- Set
replicaCount = GPU_count
and opt-in to router mode for >2 GPUs - Enable Grafana dashboard → 12 core metrics out-of-the-box
- Use rolling update
maxUnavailable: 0
to keep 100 % capacity during upgrades - Pick the matching quant profile:
vllm/vllm-openai:v0.8.4-fp8
saves 33 % memory vs. FP16
A recent IBM benchmark shows the Disaggregated Router pattern hitting 3 196 tokens/s on Llama-3-70B with 16×H100 while keeping P99 latency pinned at 76 ms (ASPLOS 2024 paper).
4. Known sharp edges & 2026 fixes
Limitation (today) | Status / workaround | ETA |
---|---|---|
LoRA heavy batches | 30 % slower than dense; refactor in v1 sampler | Q3 2025 |
FP8 KV instability | Warnings under llm-compressor quant | Q2 2025 patch |
Mamba / SSM models | WIP branch, nightly builds available | late 2025 |
Encoder-decoder | Partial support, full parity tracked in #16284 | 2026 |
5. Quick-start snippet (single-node, 2025)
“`bash
launch a quantized 8×A100 pod
helm upgrade –install phi3-vllm vllm/vllm \
–set model=phi-3-medium-128k \
–set quant=fp8 \
–set tensorParallelSize=8 \
–set maxNumSeqs=512
“`
Expect 1 850 tokens/s at 4 k context with 220 watts per GPU on H100.
Ready to try? The official quick-start walks through OpenAI-compatible endpoints, streaming, and metrics export in under ten minutes.
What makes vLLM different from other inference engines in 2025?
PagedAttention + continuous batching is the killer combo.
By slicing the KV-cache into 4 KB blocks (exactly like virtual memory), vLLM keeps the GPU saturated even when requests have wildly different lengths. A single NVIDIA H100 running vLLM 0.8.2 can now push 4.17 successful GPT-4o-class requests/sec at 6 req/s load while staying under 8 s P99 latency – numbers that were impossible with static batching last year.
If you want to run the same experiment yourself, the public dashboard at perf.vllm.ai is updated after every commit.
Can I really serve GPT-4o on just one GPU today?
Yes – as long as you are OK with quantization.
FP8 KV-cache + AWQ INT4 weights lets a 70 B parameter checkpoint fit into 48 GB VRAM and still deliver close to original BF16 quality. Apple M-series chips (via MPS) and AMD MI300X are both supported in the current release, so your laptop or a single-node cloud VM can become a production endpoint.
The catch: MoE models still need token-based expert parallelism when the expert count goes beyond 64, so very large Mixtral-scale checkpoints may still require two GPUs.
How hard is it to move from “docker run” to a real Kubernetes pipeline?
Surprisingly easy thanks to the vLLM Production Stack published earlier this year.
One Helm command spins up:
- router pods that hash prefixes and forward requests to the instance owning the right KV block
- worker pods in either aggregated (one GPU) or disaggregated prefill-decode mode
- Grafana dashboards that show GPU util, cache hit ratio and rolling latency in real time
Users report going from a proof-of-concept to 1 k QPS autoscaling cluster in about two hours on AWS EKS – no custom operators or YAML surgery required.
The templates live at github.com/vllm-project/production-stack.
What decoding tricks does vLLM offer beyond greedy sampling?
Most teams only scratch the surface.
Current features include:
- Speculative decoding with draft models as small as 0.3 B parameters (3.5× median latency cut)
- Grammar-guided / JSON schema decoding via the new
xgrammar
backend - Parallel sampling and beam search without code changes – just flip flags in the OpenAI-compatible REST payload
- Chunked prefill that overlaps prompt ingestion with generation, shaving another 15-20 % off TTFT on long contexts
Which vLLM limitations should I still watch out for?
- LoRA throughput: still 30-40 % slower than base model inference on the same batch size; the team targets parity in vLLM 1.0 (Q3 2025).
- Structured output: deterministic JSON works, but nested arrays with optional fields can break – plan to validate server-side until Q4 fixes land.
- State-space (Mamba) models: experimental branch only; expect 5-10× speed-ups on 1 M+ token contexts once the CUDA kernels stabilize.
Keep an eye on the GitHub milestones – the roadmap is public and milestone dates have slipped by ≤ 2 weeks so far.