Google DeepMind unveils SimpleQA Verified, a new LLM factuality benchmark

Google DeepMind has launched SimpleQA Verified, a new test to see how well AI models answer short, factual questions. It uses 1,000 tough questions from different topics, and an improved AI checks every answer. The latest models, like Gemini 3 Pro, score highest, but some struggle with numbers or certain topics. There's a public scoreboard so everyone can see how the models do. This test is fast, open, and helps show if an AI model really knows its facts.

Google DeepMind has released SimpleQA Verified, a vital new LLM factuality benchmark for testing how accurately large language models answer short, factual questions. Early results show a clear separation between leading AI systems, offering developers critical insights into current model performance and reliability.

The project refines the original SimpleQA dataset to meet modern standards. It features 1,000 curated prompts that models must answer without external tools like web search. An upgraded GPT-4.1 autorater evaluates each response, and researchers can track live results on a public leaderboard via the alphaXiv benchmark page.

Methodology upgrades

SimpleQA Verified is an advanced benchmark developed by Google DeepMind to evaluate the parametric knowledge of large language models. It consists of 1,000 curated, fact-based questions that models must answer from memory, without access to external tools, providing a pure measure of their internal factual accuracy.

The creation process was rigorous: engineers refined the initial 4,326-question SimpleQA pool down to 1,000 distinct, challenging prompts through deduplication, reference checking, and an adversarial filtering pass. The final dataset is balanced across topics: Date (22%), Person (20%), Number (19%), Place (15%), and Other (25%). For numeric questions, acceptable answer ranges were defined to allow for fairer grading of near-correct responses.

To isolate a model's inherent knowledge, evaluations are text-only, as allowing tools like search inflates scores toward 100% and masks true capabilities. The autorater classifies answers as correct, incorrect, or unattempted, using the F1 score to balance precision with the model's willingness to attempt difficult questions.

What the leaderboard shows

According to the latest Kaggle leaderboard, Gemini 3 Pro Preview leads with an F1 score of 72.1%, significantly outperforming Gemini 2.5 Pro (55.6%), GPT-5, and Claude Opus 4. The dataset's difficulty was maintained during curation, as top models achieved similar raw accuracy on both the original and verified versions, confirming the test's integrity.

The results also reveal specific model weaknesses. Even top-performing models struggle with numeric questions, which lag behind person and date queries by approximately 18 accuracy points. Furthermore, the balanced topic distribution highlights knowledge gaps, with models showing a higher error rate on sports and geography questions compared to science.

Why this matters for AI reliability

SimpleQA Verified provides a clean, reliable measure of a model's parametric knowledge - the facts stored in its parameters. This is crucial for academic research and commercial product safety. Its open-source nature, with public prompts and labels, allows for complete transparency and auditing, and future versions will include methods to detect training data contamination.

The benchmark is already being integrated into larger evaluation suites like Google's FACTS and Epoch AI's dashboards. Its transparent scoring has also been noted by policymakers as a model for industry reporting standards. For developers, its fast runtime - under 15 minutes on a single A100 GPU - makes it practical for frequent regression testing.

The benchmark does have limitations. Its format cannot assess multi-step reasoning, visual understanding, or knowledge of fast-changing information. The public nature of the dataset also creates a risk of future training contamination, which DeepMind aims to address with a private follow-up set in 2026. Despite these constraints, SimpleQA Verified currently offers the clearest available signal of an LLM's core factual recall.

Key Takeaways

Pure Factual Recall: Tests models on 1,000 challenging, balanced questions without the use of external tools.
Current Leader: Gemini 3 Pro Preview sets the standard with a 72.1% F1 score.
Automated & Transparent: Uses a GPT-4.1 autorater, with a public leaderboard and open-source code for full auditability.
Identifies Weaknesses: Reveals specific gaps in model knowledge, particularly in numeric reasoning.
Future-Proofing: Future plans include addressing multi-step reasoning and preventing benchmark contamination.