Google Gemini-2.0-Flash achieves 0.7% AI hallucination rate
Serge Bulaev
Google's Gemini-2.0-Flash model has the lowest AI hallucination rate at just 0.7%, beating other top models. Hallucinations are when AI makes up facts not supported by real data. Legal and medical questions are harder for AI, causing more mistakes. Researchers use special tests and tools to measure and catch these errors. Keeping track of hallucination rates all the time is important, because models can mess up more in real use than in tests.

Understanding AI hallucination rates is critical for teams deploying large language models (LLMs). While the news that Google Gemini-2.0-Flash achieves a 0.7% AI hallucination rate sets a new benchmark, consistent cross-model audits still find significant factual errors. This guide details how researchers measure LLM hallucinations and presents key performance data every developer should consider.
Current AI Hallucination Leaderboard
Google's Gemini-2.0-Flash currently leads with a 0.7% hallucination rate on grounded summarization tasks, followed closely by other top models from OpenAI and Anthropic in the 0.8% to 1.5% range. However, these rates can increase significantly depending on the complexity and domain of the prompt.
On the public Vectara summarization leaderboard, Google's Gemini-2.0-Flash recorded the lowest hallucination rate at approximately 0.7 percent. A 2026 review by Scott Graffius corroborates this, placing three other leading models between 0.8 and 1.5 percent on identical grounded tasks. In contrast, open-domain reasoning proves less reliable; OpenAI's o3 series exhibited a 33-51 percent hallucination rate on SimpleQA and PersonQA benchmarks. The prompt's domain is also a major factor: top models average 6.4 percent hallucinations for legal questions versus just 0.8 percent for general trivia, with medical queries registering around 4.3 percent.
Defining and Measuring AI Hallucinations
Researchers classify an AI output as a hallucination when it includes a claim not supported by provided source documents or established public facts. To quantify this, enterprise teams use three primary metrics:
- Response-level rate: The percentage of responses containing at least one fabricated assertion.
- Assertion-level rate: The number of false claims per 1,000 tokens, sometimes weighted by confidence scores.
- Groundedness score: The proportion of sentences not substantiated by the provided context.
Automated verification is now standard, with LLM-as-a-judge pipelines like W&B Weave achieving up to 91 percent detection accuracy using a combination of entailment and consistency checks.
How to Design a Reproducible Hallucination Benchmark
A robust benchmark for measuring hallucinations should incorporate a mix of open and grounded tasks, employ adversarial prompts, and report confidence intervals. A reliable process includes these steps:
- Curate a diverse prompt set using established benchmarks like TruthfulQA, FEVER, and SimpleQA, plus a custom domain-specific corpus (e.g., legal or clinical).
- Include a grounded task set, such as news summarization or Retrieval-Augmented Generation (RAG) Q&A, where answers must cite the provided source documents.
- Develop adversarial prompt variations that introduce subtle falsehoods or require strict citations to probe for speculative outputs.
- Evaluate results using an LLM-as-a-judge system alongside manual spot-checks, recording both response-level and assertion-level scores.
- Publish all evaluation scripts, raw model outputs, and an interactive dashboard to enable others to reproduce the tests as models evolve.
The Importance of Continuous Hallucination Monitoring
Benchmark scores are not static. Impressive sub-1 percent hallucination rates achieved on a static leaderboard can obscure double-digit failure rates in live production environments. For this reason, methodologies like Sparkco's 2026 framework recommend implementing telemetry that continuously streams hallucination rates, confidence-weighted errors, and spike alerts correlated with specific input features. Without this live data, teams risk deploying models that perform significantly worse in real-world scenarios than they did during testing.
How low is Google Gemini-2.0-Flash's hallucination rate on grounded tasks?
On the Vectara HHEM summarization leaderboard, the model achieves a 0.7% hallucination rate. This means fewer than one in a hundred summary sentences introduces information not found in the source document. This score places Gemini-2.0-Flash at the top of 2025 public rankings for this task, just ahead of other leading models in the 0.8-1.5% range.
Why does the same model score much higher on open-domain benchmarks?
The task design is the critical factor. When given unconstrained factual questions from benchmarks like SimpleQA or PersonQA, the hallucination rate can jump into the 33-51% range, similar to OpenAI's o3 series. This disparity demonstrates that "hallucination rate" is not a single, universal metric; it is meaningless unless quoted alongside the specific test set and prompting method.
What evaluation protocol produced the 0.7% figure?
Vectara's open and reproducible pipeline is as follows:
1. Provide the model with a news article from the CNN/Daily Mail (CNN/DM) dataset.
2. Request a concise summary of the article.
3. Use an LLM-as-a-judge verifier to label each sentence in the summary as "supported," "contradicted," or "neutral" based on the source text.
4. Calculate the final score as the percentage of summaries containing one or more contradicted sentences.
The dataset and evaluation code are publicly available, allowing teams to reproduce the test or adapt it for their own documents.
Is 0.7% "good enough" for production RAG systems?
For high-volume, customer-facing systems, a 0.7% error rate can still be too high, translating to 7 incorrect answers for every 1,000 interactions. This can pose significant compliance or reputational risks. Therefore, production teams typically add layers of retrieval-evaluation filters, confidence scoring thresholds, and human-in-the-loop review to push the effective error rate below 0.1%. The 0.7% figure is best used as a comparative benchmark, not a direct indicator of production readiness.
Which other models sit in the "sub-1%" club in 2025?
On the same grounded summarization benchmark, four frontier models currently score below a 1% hallucination rate:
- Google Gemini-2.0-Flash (≈0.7%)
- OpenAI GPT-4-Turbo-2025 (≈0.8%)
- Anthropic Claude-3-Opus (≈0.9%)
- Google Gemini-1.5-Pro (≈1.0%)
It is crucial to remember these figures are dataset-specific. Rates often climb to 2-6% on specialized legal, medical, or financial documents and can surpass 15% with adversarial or ambiguous prompts.