Xiaomi MiMo-V2-Flash tops SWE-bench, cuts code generation costs

Serge Bulaev
MiMo-V2-Flash is Xiaomi's powerful new coding tool that turns text into web pages and code super fast and at a very low price. It uses special tech so only a small part of its big brain works at once, making it both smart and cheap to run. This system topped the coding charts, answering most programming problems quicker than others, and even huge projects are no problem for it. Developers can use it for just a few dollars, which is way cheaper than other tools, and it works inside popular editors. Xiaomi made it easy to access for everyone, and now it's ready to help make coding cheaper and faster in real-world projects.

Xiaomi's MiMo-V2-Flash is a landmark Mixture-of-Experts (MoE) model setting new standards for low-cost, high-speed code generation. Topping the SWE-bench leaderboard, this 309B parameter model from Xiaomi demonstrates premium performance by activating just 15B parameters per task. It delivers exceptional speed and a 73.4% score on the SWE-bench Verified benchmark, challenging commercial giants with its innovative architecture and disruptive pricing.
Inside the architecture
MiMo-V2-Flash is an advanced Mixture-of-Experts AI model designed for efficient code generation. It uses a sparse routing technique, activating only a fraction of its 309 billion parameters for each request. This allows it to generate complex code and web pages from text prompts with remarkable speed and cost-effectiveness.
The model's performance stems from its sparse routing architecture, where each token interacts with a small subset of its total weights. Key innovations include Multi-Token Prediction for forecasting several tokens simultaneously and a sliding-window attention mechanism that reduces the key-value cache size sixfold. These optimizations achieve generation speeds of approximately 150 tokens per second, as detailed in this in-depth 309 billion parameter overview.
Key technical levers:
- 256K contextual window suitable for large codebases and long HTML documents
- 27 trillion token training corpus stored in FP8 to lower memory load
- Toggleable reasoning mode for step-by-step planning in multi-file tasks
- Post-training on 100 k GitHub issues with reinforcement learning for agent workflows
Benchmark score and real-world pace
MiMo-V2-Flash achieves a leading score of 73.4% on SWE-bench Verified, outperforming all other open-source models as confirmed in the release roundup at zenthegeek.tech. It also scores an impressive 71.7% on the multilingual benchmark and 94.1% on AIME 2025 math reasoning. In real-world tests, it generates full HTML and CSS for a responsive landing page in under three seconds. However, users should note potential inconsistencies in tool-calling, which may require guardrails in production.
Pricing meets adoption
Xiaomi's pricing model is highly disruptive, set at just $0.10 per million input tokens and $0.30 per million output tokens. This rate is substantially lower than flagship APIs, as detailed in 2025 pricing comparisons on getdx.com, and is backed by a free trial for benchmarking. This token-based approach dramatically reduces costs, allowing a small team to operate for just a few dollars per developer monthly versus the $10-$40 per-user fees of competing services. As a result, developers are integrating MiMo-V2-Flash into editors like Cursor and Claude Code. The model weights are available on Hugging Face under an MIT license for self-hosting, while a public endpoint on Skywork.ai offers a playground for text-to-HTML workflows.
By combining elite performance with unprecedented affordability, MiMo-V2-Flash is positioned to reshape the AI-assisted coding landscape. Its true impact will be measured as it moves from benchmarks to sustained, real-world development cycles, but its potential to democratize high-end code generation is clear.
What exactly is MiMo-V2-Flash and how does it generate HTML or code from plain text?
MiMo-V2-Flash is a 309 billion parameter Mixture-of-Experts model that keeps only 15 billion parameters active per request, released by Xiaomi on December 17, 2025. It turns natural-language prompts into working HTML, CSS, or full code bases by predicting multiple tokens at once (Multi-Token Prediction) and sliding-window attention that keeps a 256 K-token context window in memory. Early demos show a landing page or React component rendered in <2 seconds from a one-sentence prompt.
How does its 73.4 % SWE-bench score compare to other models?
The 73.4 % figure on SWE-bench Verified places MiMo-V2-Flash #1 among open-source models and within 2 - 4 points of Claude 4.5 Sonnet (77.2 % reported by the community). On the multilingual split it hits 71.7 %, edging out Claude's 68.0 %. While GPT-5-High remains the overall leader, Xiaomi's model is the first open-weight system to break the 70 % barrier on both splits, narrowing the gap between freely available and proprietary code generators to <5 %.
What does the pricing model look like and how much can a startup save?
Pricing is purely token-based: $0.10 per million input tokens and $0.30 per million output tokens. A typical 8 k-in / 1 k-out coding task costs ≈ $0.0011, roughly 10× cheaper than GPT-5 API calls ($1.25 / $12) and 60× cheaper than Grok-3-beta ($3 / $15). A five-person startup running 10 M tokens a month would spend ≈ $4 total, versus $100 - $200 on flat-rate seats for Copilot Pro or Cursor.
Are there any hidden limitations when using MiMo-V2-Flash in production?
Community tests note uneven instruction following and fragile tool-calling reliability, so workflows that chain shell commands or Git operations may need human review. The model also lacks llama.cpp support today, so edge-device deployment is limited to Xiaomi-approved SDKs. Finally, its agentic RL training excels at algorithmic bugs but can hallucinate library APIs, meaning dependency checks are still mandatory before shipping.
Where can developers try the model immediately and what hardware is required?
Weights are open-sourced under MIT on Hugging Face and a free chat demo is live at skywork.ai. For self-hosting, a single A100 80 GB handles 15-billion active parameters with 150 tokens/second throughput; batching four prompts together keeps GPU memory under 65 GB thanks to the 6× KV-cache reduction from sliding-window attention.