Claude Opus 4.5 reaches new 4-hour task horizon on METR benchmark

Serge Bulaev
Claude Opus 4.5 just set a new record on the METR benchmark, showing it can work on tough tasks for almost five hours straight - longer than any model before. This is a big deal because it proves the AI can finish really long, complicated jobs, not just short tasks. In tests, Claude beat other b

The new Claude Opus 4.5 model from Anthropic has achieved a groundbreaking milestone, reaching a new 4-hour task horizon on the METR benchmark. This record demonstrates its ability to sustain complex reasoning for extended periods, marking a pivotal shift from experimental AI to practical enterprise applications.
This new record is significant due to how the METR benchmark measures performance. Instead of simple success rates, METR calculates a "task horizon" - the time a human expert would need for a given task. According to the 2025 METR report, Claude Opus 4.5 achieved a 50% success horizon of 4 hours and 49 minutes. This is more than double its predecessor's capability and surpasses all other models in the study. An in-depth LessWrong post validates these findings, noting its unique success on longer jobs.
Why the METR Horizon Matters
The METR benchmark evaluates an AI's ability to complete complex tasks that would take a skilled human a specific amount of time. Claude Opus 4.5's new record indicates it can successfully handle assignments requiring nearly five hours of continuous, unsupervised work, demonstrating unprecedented long-term reasoning and reliability.
METR's creators contend that standard benchmarks, focused on percentage correctness, don't capture an AI's practical value. The METR time-horizon curve offers a better metric by plotting the model's probability of finishing a complete task against the time a human professional would need. Historical data reveals an exponential trend, with the task horizon doubling roughly every seven months. If this trajectory continues, AI agents could manage week-long projects before the end of the decade, signaling immense productivity gains alongside concerns about job automation.
Competitive Context
In the competitive landscape, Claude Opus 4.5 shows a distinct advantage in specific areas. A comparative analysis from a Vellum AI benchmarking post places it against top rivals like Gemini 3 Pro and GPT-5.1. Claude excels in coding, achieving 80.9% on SWE-bench for bug resolution and leading Terminal-Bench at 59.3%. While competitors may perform better on certain benchmarks, neither can match Claude's standout 4-hour, 49-minute 50% success horizon, the key indicator of reliability for long, complex workflows.
Early Enterprise Uptake
This long-horizon reliability is already driving adoption in key enterprise sectors. Businesses are leveraging Claude Opus 4.5 for:
- Autonomous Software Engineering: Performing full-stack refactors and complex debugging sessions that last over 30 minutes, saving significant development time.
- Advanced Financial Modeling: Automating the consolidation of regulatory filings, market data, and internal reports into comprehensive, day-long forecast pipelines.
- Cybersecurity Operations: Creating integrated playbooks for incident response that chain log analysis, threat intelligence, and report generation.
To facilitate this adoption, major cloud platforms have made Claude Opus 4.5 widely available through AWS Bedrock, Azure Foundry, and Vertex AI, as well as Anthropic's own API. This gives development teams a unified SDK for managing complex workflows, with pricing tiers designed to allocate its powerful reasoning for the most critical tasks.
What Happens Next
Looking ahead, the AI community is closely watching two key developments: whether the exponential growth in task horizon continues and if open-weight models can catch up to this new standard of performance. For now, Claude Opus 4.5 provides definitive proof that AI agents can reliably perform multi-hour tasks in real-world production environments. This shifts the focus of AI development from short-term accuracy to long-term endurance and coherence.