DoorDash adopts LLM simulation to cut chatbot hallucinations by 90%

DoorDash created a new system using simulation and evaluation tools to test its chatbots, as manual quality checks could not keep up with changing customer conversations. The company uses an offline simulator to create realistic chat conversations and an automated evaluator that checks each conversation for specific rules. After making some fixes, DoorDash reportedly saw a 90% drop in chatbot hallucinations, though there may still be some gaps and costs. The team keeps adjusting the system and says early results suggest offline testing can help predict how the chatbot will perform with real customers.

DoorDash has pioneered a simulation-and-evaluation flywheel to test its LLM chatbots, addressing the challenge of manual QA's inability to keep up with dynamic customer conversations. This new system for LLM simulation is designed to cut chatbot hallucinations, and according to an InfoQ write up, it relies on two core components: an offline simulator for generating realistic conversation transcripts and an automated evaluator for grading them against strict, binary policies.

This framework enables engineers to run hundreds of synthetic conversations in minutes. The system automatically scores each interaction for issues like hallucinations or brand tone violations, gating software releases behind a suite of over 50 automated checks. The approach prioritizes scalable, data-driven testing while retaining human oversight for final calibration.

Inside DoorDash's simulation-and-evaluation flywheel

DoorDash uses a simulation-and-evaluation flywheel to test its LLM chatbots. The system generates hundreds of synthetic, multi-turn conversations based on historical data and then automatically grades each one against 50+ binary rules, including checks for factual accuracy, policy compliance, and brand tone.

The simulator is trained on historical chat logs to construct realistic user scenarios, combining customer intents, user profiles, and probable escalation paths. These synthetic "customers" are then used to test the production chatbot. DoorDash reports the ability to run over 200 complete conversations in less than five minutes, providing rapid feedback to developers on prompt and retrieval adjustments.

The conversation outputs are passed to an LLM-based "judge" for evaluation. To ensure accuracy and reduce scoring noise, this judge was calibrated against human raters before deployment. Each test yields a simple pass/fail result for critical policies, including:

Factual grounding and hallucination
Refund policy compliance
Issue classification accuracy
Courtesy and brand tone
Escalation timing

This binary scoring system allows for clear, aggregate performance tracking across software builds. DoorDash documented multiple tuning iterations, using regressions to identify and quickly correct system weaknesses.

Early quantitative signals

Following targeted fixes in context engineering, DoorDash reported a significant reduction in simulated chatbot hallucinations. The company noted a "good" correlation between these offline results and live traffic performance. However, engineers acknowledge limitations, such as potential gaps in simulator coverage and the inherent compute costs. Each test run consumes LLM tokens, and expanding coverage to include more edge cases or language variations increases this expense.

The system's design aligns with key industry best practices for LLM evaluation. The approach reflects several core principles:

Conversation, not single-prompt, testing.
Metric gates that block deploys.
Rapid, synthetic user generation to surface rare failures.
Continuous calibration between LLM judges and human reviewers.

What the numbers mean for teams in 2025

The ability to run 200 simulated conversations in five minutes transforms traditional, time-consuming regression testing into a near-real-time feedback loop. Combined with an automated evaluation suite of over 50 checks, this speed is fortified with robust quality guardrails. While this "flywheel" model represents a cost shift from manual QA labor to GPU compute, the tradeoff is highly favorable for large-scale support organizations, where the cost of a single refund error can far exceed that of thousands of test tokens.

A key architectural innovation highlighted by DoorDash engineers is the use of a structured "case state" object. This object condenses raw data and tool logs into a concise, relevant context for the LLM. The team credits this approach with a significant portion of the hallucination reduction, as it prevents the model from reasoning over unstructured and unvetted data.

DoorDash continues to refine its flywheel, currently expanding its capabilities to include more nuanced brand tones and multilingual intents. The team acknowledges that each new dimension adds to both token consumption and the need for judge calibration. Nevertheless, early production data reportedly mirrors the gains seen in simulation, suggesting that this controlled, offline testing method can accurately predict real-world performance without exposing customers to developmental risks.

What exactly did DoorDash build to slash hallucinations?

DoorDash rolled out a two-part offline testing rig:
1. LLM simulator - ingests historical chat transcripts to spin up multi-turn conversations with diverse customer personas and realistic escalation patterns.
2. LLM evaluator - automatically scores every simulated chat against binary policy checks (e.g., no refunds above policy limit, neutral tone).

The pipeline now runs hundreds of simulated conversations in under five minutes and gates every code change behind 50+ automated checks.

How big was the actual hallucination drop?

In repeated simulation cycles, hallucinations fell significantly. The same changes were then released to real traffic, where internal telemetry tracked a correlated production improvement without extra regressions.

What is the "case state" architecture?

Instead of handing raw API logs to the model, DoorDash compresses them into a curated "case state" - a structured snapshot of order, issue, previous turns, and tool results. This single context window cuts irrelevant noise and keeps the model focused on verifiable facts, a pattern now echoed across industry guides on grounded chatbots.

Does the simulator really save money?

Yes. Replacing human QA scripts with synthetic runs moved the marginal cost per test conversation from ~$6 (human) to < $0.05 (LLM tokens). For an average release cycle that required multiple iterations before passing the suite, simulation saved an estimated significant engineering hours and substantial manual QA budget.

What limitations remain?

Coverage gaps: rare edge cases still slip through; DoorDash supplements simulation with live shadow traffic sampling.
Compute bill: even optimized runs consume considerable resources per release cycle. The team budgets accordingly and is exploring open-source inference stacks to drive costs down further.