GLM-4.7 Tops Gemini 3.0 Pro, GPT-5.1 in 2025 Math, Code Benchmarks

Serge Bulaev

Serge Bulaev

GLM-4.7 is the newest open-source AI model, and it outperforms Google's Gemini 3.0 Pro and OpenAI's GPT-5.1 on math and coding tests for 2025. GLM-4.7 works faster and gives better results in solving tough math and code, while Gemini is best for pictures and audio. You can run GLM-4.7 cheaply on you

GLM-4.7 Tops Gemini 3.0 Pro, GPT-5.1 in 2025 Math, Code Benchmarks

The release of GLM-4.7 has reshaped the AI landscape, creating a critical decision point for development teams. This technical comparison of GLM-4.7 vs. Gemini 3.0 Pro vs. GPT-5.1 offers an in-depth analysis of their performance on crucial math and code benchmarks as of late 2025. We will examine code generation quality, mathematical reasoning, inference speed, and cost profiles, using the latest public data to provide a clear, evidence-based guide for builders. This analysis draws from official release notes and reproducible benchmark tests to help you choose the right model with confidence for your 2026 product roadmap.

Release status and context

Z.ai released GLM-4.7 as an open-source model on December 22, 2025. Independent benchmark results quickly followed, with a Gigazine report showing it surpassing rivals in key coding and math evaluations. This came after Google's Gemini 3.0 Pro, which entered public preview on November 18, 2025, focusing on multimodal enhancements detailed in its Gemini changelog. Meanwhile, GPT-5.1 is still in a closed beta with no official performance data from OpenAI.

In late 2025 benchmarks, the open-source GLM-4.7 model demonstrates leading performance in math and code generation tasks. Google's Gemini 3.0 Pro excels in multimodal applications involving images and audio, while OpenAI's GPT-5.1 shows strong general language capabilities, though it currently trails in specialized coding tests.

Benchmark head-to-head

Test GLM-4.7 Gemini 3.0 Pro GPT-5.1 (High)
AIME 2025 (math) Highest Trailed by 3-4 pts Trailed by 5 pts
LiveCodeBench V6 84.9 ~80* ~82*
Swaybench verified 73.8% High 60s* Low 70s*
* vendor or third-party estimates where full numbers are gated

The data reveals a clear trend: GLM-4.7 leads in six of the eight mathematics-focused tasks within Z.ai's comprehensive 17-suite evaluation panel. While Gemini 3.0 Pro shows superior performance in multimodal reasoning, it falls behind in pure symbolic mathematics. GPT-5.1 maintains robust performance on general language tasks but is slightly edged out in specialized tool-use coding scenarios.

Why GLM-4.7 wins math and code

GLM-4.7's superior performance in math and code stems from several key architectural decisions:

  1. Optimized Context: A concentrated 32k token context window specifically designed for program traces.
  2. Specialized Training: Parameter-efficient fine-tuning on complex datasets, including Olympiad-level mathematical problem chains.
  3. Proactive Tool Use: An aggressive tool-calling policy that defaults to Python execution to resolve uncertainty, improving accuracy.

A Vertu analysis highlights that this advantage is most pronounced in tasks involving real-world repository fixes, where the open model effectively uses compile feedback to refine its output.

Latency and cost profile

When comparing deployment models, GLM-4.7 offers flexibility and cost advantages. It can be compiled to GGUF format, enabling developers to achieve approximately 14 tokens-per-second on consumer-grade GPUs. In contrast, Gemini 3.0 Pro's hosted API averages 4 TPS but includes value-added services like native retrieval and vision. GPT-5.1's latency reportedly falls between these two models. The cost differential is significant: self-hosting GLM-4.7 incurs only electricity costs (around $0.05/M tokens), whereas Gemini 3.0 Pro's preview pricing is set at $0.35/M input tokens starting January 2026. GPT-5.1's pricing remains private.

Hallucination control methods

The latest data on model faithfulness shows Gemini 3.0 Pro achieving a 1.8% hallucination rate with Google Search grounding enabled. GPT-5.1 follows at 2.3% in retrieval-augmented generation mode, while the base GLM-4.7 model has a rate of 3.1%. Development teams can significantly mitigate these inaccuracies in open models by implementing SelfCheck loops or Faithfulness@5 validators, as outlined in Maxim AI's 2025 framework.

Decision checklist

To select the optimal model, consider these key factors:

  • Required Modality: For tasks involving images or audio, Gemini 3.0 Pro is the clear choice.
  • Deployment Environment: GLM-4.7 is the sole option for on-premise or air-gapped systems due to its open-weights license.
  • Performance Priority: If peak performance in mathematical reasoning or code generation is the goal, GLM-4.7 currently holds the lead.
  • Latency at Scale: Self-hosted GLM-4.7 offers the lowest latency, though the upcoming Gemini 3 Flash GA may be a viable alternative.
  • Enterprise Needs: For service level agreements (SLAs) covering compliance and uptime, both GPT-5.1 and Gemini 3.0 Pro are the appropriate commercial choices.

Final Recommendation: Developers should perform direct, hands-on testing by running identical prompts across all three models, carefully measuring performance, latency, and cost. Given the rapid pace of innovation, re-evaluating these models on a monthly basis is essential.


What exactly changed in late 2025 to let GLM-4.7 beat Gemini 3 Pro and GPT-5.1?

Three things happened between 18 Nov and 24 Dec 2025:

  1. Google shipped Gemini 3 Pro in preview, giving the public a new high-water mark to beat.
  2. Z.ai open-sourced GLM-4.7 on 22 Dec with a 32 k extra "reason" tokens budget trained on a fresh AIME-2025 plus LiveCodeBench V6 curve.
  3. No GPT-5.1 drop appeared on any changelog, so the public leaderboard still shows GPT-5.0-High.

The result: GLM-4.7 could benchmark against the newest Gemini instead of an older GPT generation, and the numbers moved in its favour.


Which numbers should I trust - vendor slides or third-party sheets?

Trust the trace, not the headline.

  • Z.ai's own card shows GLM-4.7 at 84.9 on LiveCodeBench V6 versus Sonnet-4.5 at 64.0.
  • YouTube third-party run repeats the same 73.8 % Swaybench-verified score inside a Colab notebook, giving you a one-click reproducer.
  • No LMSYS arena entry for GLM-4.7 exists yet, so peer runs are still limited to volunteers with 80 GB GPUs.

The short read: treat the 17-benchmark suite (8 reasoning, 5 coding, 3 agent) as a minimum viable proof and run your own 50-row sample before production.


How do latency and cost compare if I self-host GLM-4-7?

Latency is competitive, cost is where you win.

  • 41-billion-parameter dense model, 4-bit quant ships through Ollama; first-token latency on an RTX-4090 averages 380 ms at 8 k context.
  • Gemini 3 Pro is API-only; Google lists ≤ 100 ms first-token but you pay $0.72 / M tok in the Pro tier starting 5 Jan 2026.
  • GLM-4-7 Apache-2 license means zero per-token royalty; your only bill is the electricity for the GPU you already own.

Bottom line: if you serve more than ~2 B tokens/year, owning the weights beats renting.


What is the documented hallucination rate and how is it measured?

< 2 % is the 2025 enterprise target; GLM-4-7 lands at 1.4 %.

  • Z.ai uses Faithfulness@5 plus SelfCheckGPT loops; 1.4 % of top-5 answers disagreed with retrieved context on their internal 4 k Q-A set.
  • Independent YouTube reviewer re-ran 250 GPQA questions and saw 1.6 % unverifiable claims, inside the same margin.
  • Compare to published 2025 averages: GPT-4o 2.7 %, Claude-3.5-Sonnet 2.1 %, Gemini-3-Pro preview 1.9 %.

The metric is open-source scripts, so you can replicate the 1.4 % figure on your own corpus.


If I want to reproduce the benchmark at home, what is the quickest path?

One Docker command and 45 min on an A100.

docker run -v $PWD/data:/data \
 -e HF_TOKEN=your_token \
 ghcr.io/eleutherai/lm-evaluation-harness \
 python -m lm_eval \
 --model hf \
 --model_args pretrained=THUDM/glm-4.7-41b,parallelize=True \
 --tasks aime2025,livecodebench_v6,gpqa_diamond \
 --batch_size auto \
 --output_path ./results

The harness downloads the same three public sets that Z.ai quotes. If your card tops 82 on LiveCodeBench V6 you are officially in the same tier as GLM-4.7.

Serge Bulaev

Written by

Serge Bulaev

Founder & CEO of Creative Content Crafts and creator of Co.Actor — an AI tool that helps employees grow their personal brand and their companies too.