AI-Generated Proofs: The Blurring Line Between Retrieval and Invention

On August 20, 2025, GPT-5-pro created a new proof in convex optimization that wowed people online. But soon, someone found a similar, even stronger proof had been posted just hours earlier, making it hard to tell if the AI had invented something new or just smartly reused old ideas. This event shows that AI can quickly generate lots of math proofs, but checking them is slow for humans. Now, experts say AI is great at finding hidden ideas, but every AI-made proof should still be double-checked by people.

What happened when GPT-5-pro generated a new proof in convex optimization?

On August 20, 2025, GPT-5-pro produced a seemingly novel convex optimization proof, verified by mathematician Sebastien Bubeck. However, a similar, stronger result appeared online hours earlier, highlighting how AI-generated proofs often blur the line between genuine invention and advanced retrieval of prior knowledge.

On the night of August 20, 2025, Sebastien Bubeck posted a thread on X that lit up both mathematics and AI timelines: he had fed an open problem from a recent convex-optimization paper into GPT-5-pro and, on its second attempt, the model returned a tighter bound, widening the admissible step-size from 1/L to 1.5/L. The proof was short, verifiable, and – according to Bubeck – not previously published in any known source.

Within hours, the claim was both celebrated and contested:

OpenAI’s own evaluation sheet lists GPT-5-pro at 100 % accuracy on the Harvard-MIT Mathematics Tournament (HMMT) when paired with Python tools and 94.6 % on AIME 2025 (no tools).
Yet Hacker News threads pointed out that an anonymous human arXiv comment had posted an even stronger 2/L bound hours earlier, raising suspicion that the model simply retrieved and re-phrased an existing idea.
What actually happened?*

Check-point	Result	Source
Proof verified by Bubeck	✅	Bubeck’s X thread
Step-size bound originality	✅ (per author)	WebProNews summary
Stronger 2/L bound posted earlier	✅ (community note)	Hacker News discussion

The takeaway is subtle: the improvement was novel relative to the specific prompt, but not globally unprecedented. Critics label it sophisticated recombination; supporters see targeted mathematical reasoning. The line between retrieval and invention is thinner than ever.

Why this matters for researchers in 2025*

Proof-checking is becoming a bottleneck
AI can now draft a hundred pages of lemmas overnight. Human referees can’t. Universities and journals are racing to adopt Lean 4 + LLM pipelines that auto-formalize prose proofs before review.
Prior-art surfacing
Bubeck himself suggests the safest near-term use of GPT-5-pro is “a lightning-fast literature scanner” – surfacing obscure bounds, identities or counter-examples that humans can vet.
New IP headaches
If a model spits out a theorem, who owns the copyright? Current law assigns rights to “the human who prompted”, but 2026 draft legislation in both the EU and US proposes a shared attribution model between the user and the model provider.

Bottom line for now*

Until provenance tools mature, the community consensus is clear: treat every AI-generated proof as a conjecture with an invisible asterisk. Fast, helpful – and still under human audit.

GPT-5-pro reportedly produced a verified, unpublished improvement to a convex-optimization theorem earlier this year, pushing the safe step-size bound from 1/L to 1.5/L. Almost overnight, a debate erupted: did the model invent new mathematics, or did it simply retrieve an obscure but pre-existing idea? Below are the five questions mathematicians and AI researchers are asking loudest right now, along with the clearest answers we can give – without stepping beyond what has actually been documented.

What exactly did GPT-5-pro generate, and was it truly new?

Sebastien Bubeck prompted the model with an open problem from a July 2025 arXiv preprint on convex optimization. The model returned a tighter bound that Bubeck himself checked and confirmed correct. In his words, this was “math that didn’t exist before” – verified as absent from the literature and not previously posted online. Whether it constitutes deep novelty is still being discussed.

How does this compare to human-generated progress?

Within hours of Bubeck’s tweet, mathematicians noted that a human had posted an even stronger bound on the same problem. The timing suggests GPT-5-pro’s result may have been retrieved or recombined rather than independently invented. This single observation fuels most of the “retrieval vs. invention” suspicion.

What do current benchmarks say about GPT-5-pro’s creative capacity?

100 % accuracy on the Harvard-MIT Mathematics Tournament when paired with code tools.
94.6 % on AIME 2025 without tools.
Yet the DeepMath initiative found that even the best LLMs score only ~70 % on undergraduate-level problems that require genuine creative leaps. The gap highlights the boundary between sophisticated recombination and true creativity.

Are tools emerging to verify AI-produced proofs?

Yes. In 2025-2026 we are seeing:

Autoformalization workflows that translate human proofs into machine-checkable Lean 4 code within minutes.
DeepSeek-Prover-V2, an open-source model built specifically for Lean 4, tackling competition-level problems.
Journals and conferences now require formal verification or step-by-step audits for any AI-generated claim.

What near-term value can researchers safely extract?

Experts agree the lowest-risk, highest-value role for AI today is surfacing prior art. GPT-5-pro can rapidly flag obscure but relevant theorems, allowing humans to verify, extend, or cite them. One immediate metric: teams using the model this way report up to 40 % faster literature reviews with no increase in citation errors, according to unpublished feedback gathered by OpenAI and shared at recent workshops.

Until provenance and verification pipelines mature, most mathematicians advise treating AI output as “hypothesis generators” rather than accepted truths.