DeepSeekMath-V2 scores 118/120 on Putnam, achieves IMO Gold

DeepSeekMathV2 has achieved a goldmedal level at the International Mathematical Olympiad (IMO) and scored an astounding 118/120 on the Putnam exam, establishing a new frontier in AIdriven mathematical reasoning. Developed by DeepSeek AI, the model's breakthrough performance is even more significant because it is an opensource system, providing researchers a transparent blueprint for large language models that prioritize verifiable proof over mere answers.

DeepSeekMath-V2 has achieved a gold-medal level at the International Mathematical Olympiad (IMO) and scored an astounding 118/120 on the Putnam exam, establishing a new frontier in AI-driven mathematical reasoning. Developed by DeepSeek AI, the model's breakthrough performance is even more significant because it is an open-source system, providing researchers a transparent blueprint for large language models that prioritize verifiable proof over mere answers.

Competition scores that outpace humans

DeepSeekMath-V2 demonstrates superhuman performance in mathematics, securing a near-perfect 118/120 on the 2025 Putnam exam and matching the gold medal standard for the 2025 IMO. These results, achieved by an open-weights model, surpass top human scores and rival leading closed, proprietary AI systems.

The model's 118/120 score on the 2025 Putnam exam far surpasses the top human score of 90, as reported by Marktechpost. It also verified 99% of proofs on the IMO-ProofBench Basic subset, outperforming Google's Gemini DeepThink by 10 points. On the more challenging Advanced subset, it maintained 62% accuracy, as cited by Apidog.

A summary of its achievements:

IMO 2025: Gold medal standard, 5 of 6 full solutions
Putnam 2025: 118/120
IMO-ProofBench Basic: 99% success rate
Parameters: 685 billion mixture-of-experts

Self-verifiable architecture drives accuracy

The model's high accuracy stems from a novel self-verifiable architecture. It pairs a powerful proof generator with a lightweight verifier that systematically checks each logical step by parsing it into an abstract syntax tree. This verifier acts as the reward model during training, compelling the generator to correct its own errors before producing a final output. At inference, DeepSeek scales this process by running up to 64 candidate proofs and 64 parallel verifications in a loop 16 times, a method shown to reduce error rates by 40% over baseline models. This dynamic closes the "generation-verification gap" that limited previous systems, while sparse attention architecture allows the 685B parameter model to maintain context over long, complex derivations.

Open weights reshape research landscape

In a significant move for the AI community, DeepSeek released the model's weights on Hugging Face under the Apache 2.0 license. This decision challenges the trend of closed, proprietary development for frontier-scale systems. Now, academics and independent researchers can reproduce the landmark Olympiad results, conduct ablation studies on the verifier-first pipeline, and fine-tune specialist models without relying on pay-per-token APIs. While an open model now rivals top systems from Google and OpenAI on formal proof tasks, its practical deployment requires significant hardware, such as eight A100 GPUs. The model also shows room for improvement, trailing Gemini DeepThink slightly on the IMO-ProofBench Advanced split. Nonetheless, DeepSeekMath-V2 establishes a new, publicly accessible baseline: a model that outperforms elite human mathematicians and exposes its internal workings for all to scrutinize and build upon.