vLLM is a powerful, open-source tool that lets you run advanced AI like GPT-4o on just one GPU. It uses special techniques to make everything super fast and efficient, handling lots of requests at once. Even on a laptop or a small computer, vLLM can serve AI responses quickly and smoothly. New updates make it easy to set up and monitor, so anyone can start using strong AI models with simple commands.

What is vLLM and how does it enable GPT-4o-class inference on a single GPU in 2025?

vLLM is an open-source engine designed to deliver GPT-4o-class inference on a single GPU or modest cluster, leveraging technologies like PagedAttention, continuous batching, and advanced quantisation. This enables high-throughput, low-latency LLM serving – achieving over 1,800 tokens/second with efficient GPU utilization.

Inside vLLM: a 2025 deep-dive into the engine powering GPT-4o-class inference on a single GPU*

Since its first open-source release, vLLM has quietly become the backbone of many high-throughput LLM services. By early 2025 the project’s own roadmap targets “GPT-4o class performance on a single GPU or modest node”, a claim backed by a public per-commit dashboard at perf.vllm.ai and a new Production Stack that turns helm install into a one-command Kubernetes rollout. Below is a concise field guide distilled from the project’s 2025 vision document, recent ASPLOS paper, and active GitHub issues.

1. Core architecture at a glance

Component	What it does (2025 view)	Benefit to operators
*PagedAttention *	Treats KV cache as paged memory (4 KB blocks)	3-7× higher GPU utilisation vs. static cache
Continuous batching	Adds new requests mid-forward pass	Up to 23× throughput on shared GPUs
Prefix/grammar decoding	Constrains next-token logits with FSM or CFG rules	JSON/regex output w/o fine-tuning
Speculative decode	Draft-small → verify-big two-stage pipeline	2–2.5× lower latency at same power
Disaggregated P/D	Split prefill & decode workers (scale independently)	Linear throughput growth with cluster size
Quantisation suite	Native FP8, INT4, GPTQ, AWQ, AutoRound	Fit 70 B models on 1×A100 80 GB

2. From laptop to 1 000 RPS: four validated topologies

Topology sketches are now shipped in the official Helm charts:

Scale	Topology	Typical fleet	Latency (P99)	Notes
Dev	Single pod	1×GPU laptop	400 ms @ 8 req/s	Uses aggregated worker
SMB	Router + 4 replicas	4×A100	120 ms @ 64 req/s	KV-aware routing enabled
SaaS	Disaggregated	8×prefill + 16×decode H100	80 ms @ 400 req/s	Redis KV store shared
Hyperscale	Multi-model	128×GH200 cluster	<60 ms @ 1 k req/s	TPU v6e tests in CI

3. Production checklist (2025 edition)

helm repo add vllm https://vllm-project.github.io/helm-charts
Set replicaCount = GPU_count and opt-in to router mode for >2 GPUs
Enable Grafana dashboard → 12 core metrics out-of-the-box
Use rolling update maxUnavailable: 0 to keep 100 % capacity during upgrades
Pick the matching quant profile: vllm/vllm-openai:v0.8.4-fp8 saves 33 % memory vs. FP16

A recent IBM benchmark shows the Disaggregated Router pattern hitting 3 196 tokens/s on Llama-3-70B with 16×H100 while keeping P99 latency pinned at 76 ms (ASPLOS 2024 paper).

4. Known sharp edges & 2026 fixes

Limitation (today)	Status / workaround	ETA
LoRA heavy batches	30 % slower than dense; refactor in v1 sampler	Q3 2025
FP8 KV instability	Warnings under llm-compressor quant	Q2 2025 patch
Mamba / SSM models	WIP branch, nightly builds available	late 2025
Encoder-decoder	Partial support, full parity tracked in #16284	2026

5. Quick-start snippet (single-node, 2025)

“`bash

launch a quantized 8×A100 pod

helm upgrade –install phi3-vllm vllm/vllm \
–set model=phi-3-medium-128k \
–set quant=fp8 \
–set tensorParallelSize=8 \
–set maxNumSeqs=512
“`

Expect 1 850 tokens/s at 4 k context with 220 watts per GPU on H100.

Ready to try? The official quick-start walks through OpenAI-compatible endpoints, streaming, and metrics export in under ten minutes.

What makes vLLM different from other inference engines in 2025?

PagedAttention + continuous batching is the killer combo.
By slicing the KV-cache into 4 KB blocks (exactly like virtual memory), vLLM keeps the GPU saturated even when requests have wildly different lengths. A single NVIDIA H100 running vLLM 0.8.2 can now push 4.17 successful GPT-4o-class requests/sec at 6 req/s load while staying under 8 s P99 latency – numbers that were impossible with static batching last year.
If you want to run the same experiment yourself, the public dashboard at perf.vllm.ai is updated after every commit.

Can I really serve GPT-4o on just one GPU today?

Yes – as long as you are OK with quantization.
FP8 KV-cache + AWQ INT4 weights lets a 70 B parameter checkpoint fit into 48 GB VRAM and still deliver close to original BF16 quality. Apple M-series chips (via MPS) and AMD MI300X are both supported in the current release, so your laptop or a single-node cloud VM can become a production endpoint.
The catch: MoE models still need token-based expert parallelism when the expert count goes beyond 64, so very large Mixtral-scale checkpoints may still require two GPUs.

How hard is it to move from “docker run” to a real Kubernetes pipeline?

Surprisingly easy thanks to the vLLM Production Stack published earlier this year.
One Helm command spins up:

router pods that hash prefixes and forward requests to the instance owning the right KV block
worker pods in either aggregated (one GPU) or disaggregated prefill-decode mode
Grafana dashboards that show GPU util, cache hit ratio and rolling latency in real time

Users report going from a proof-of-concept to 1 k QPS autoscaling cluster in about two hours on AWS EKS – no custom operators or YAML surgery required.
The templates live at github.com/vllm-project/production-stack.

What decoding tricks does vLLM offer beyond greedy sampling?

Most teams only scratch the surface.
Current features include:

Speculative decoding with draft models as small as 0.3 B parameters (3.5× median latency cut)
Grammar-guided / JSON schema decoding via the new xgrammar backend
Parallel sampling and beam search without code changes – just flip flags in the OpenAI-compatible REST payload
Chunked prefill that overlaps prompt ingestion with generation, shaving another 15-20 % off TTFT on long contexts

Which vLLM limitations should I still watch out for?

LoRA throughput: still 30-40 % slower than base model inference on the same batch size; the team targets parity in vLLM 1.0 (Q3 2025).
Structured output: deterministic JSON works, but nested arrays with optional fields can break – plan to validate server-side until Q4 fixes land.
State-space (Mamba) models: experimental branch only; expect 5-10× speed-ups on 1 M+ token contexts once the CUDA kernels stabilize.

Keep an eye on the GitHub milestones – the roadmap is public and milestone dates have slipped by ≤ 2 weeks so far.