Anthropic updates Claude Opus 4.8 with improved honesty, effort controls
Serge Bulaev
Anthropic released Opus 4.8 for Claude in May 2026, focusing on better honesty and more control over how much effort the AI puts into tasks. Early feedback suggests the model now flags uncertainty more often, which may help catch errors and avoid overconfident mistakes. Testers report improvements in catching bugs and handling vague requests, and the model seems less wordy. Benchmarks show Opus 4.8 is competitive, especially in some coding tasks, and it may be cheaper to use. However, adoption still appears limited to early testing, so it is not clear yet how widely it will be used.

Anthropic's update to Claude Opus 4.8 emphasizes improved honesty and new developer effort controls, positioning the model as a more reliable and cautious collaborator. While not a massive capability leap, early evaluations and developer feedback suggest the update marks a significant shift from chasing raw benchmarks to prioritizing calibrated risk management and reliability.
What's New in Claude Opus 4.8?
The Claude Opus 4.8 update introduces greater model honesty, designed to be significantly less likely to approve flawed code. It adds new developer controls for "Effort" and a "Thinking" toggle to balance latency against reasoning depth, while early feedback notes a less verbose and more cost-effective model.
Anthropic states the new release is substantially less likely than its predecessor to let flawed code slip through review. Industry reports suggest this is linked to the model's new habit of flagging uncertainty instead of guessing, a behavior that engineers have described as "proactive issue spotting" during testing.
New Controls, Benchmarks, and Pricing
The most significant interface change is a new "Effort" selector with five levels (Low, Medium, High, Extra, Max), allowing developers to tune reasoning depth against latency. This is complemented by an optional "Thinking" toggle that surfaces the model's chain-of-thought process. According to the Anthropic launch page, these features aim to make the speed-versus-deliberation trade-off explicit.
Early performance numbers show Opus 4.8 is highly competitive:
* Code Performance: It achieves a 69.2% pass rate on the difficult SWE-bench Pro task, surpassing GPT-5.5's 58.6%.
* Agent Tasks: According to industry reports, it shows strong performance on browser-agent benchmarks, with competitive scores compared to other leading models.
* Pricing: A Lushbinary comparison shows Opus 4.8 undercutting GPT-5.5's input token cost by roughly half, at approximately $5 for input and $25 for output per million tokens.
Early Developer Feedback
Early reviews praise the model's tangible improvements. Simon Willison noted a "modest but tangible improvement," highlighting the model's tendency to ask for clarification rather than generating potentially incorrect output. Common observations from early access forums include:
- Flags uncertainty instead of "hallucinating" confident but wrong answers.
- More reliably catches cross-file bugs in complex coding tasks.
- Produces less verbose, more concise output, leading to lower token usage and cost.
- Pushes back on vague requests, asking for more specific instructions.
Adoption is still in the "testing" and "early access" phase, with no verifiable public metrics on broad production rollout.
Competitive Landscape
Public comparisons position Opus 4.8 against OpenAI's GPT-5.5 and Google's Gemini 3.x series. While GPT-5.5 maintains a slight edge on specific benchmarks like Terminal-Bench 2.1, Opus 4.8's main advantages are in browser-agent and end-to-end workflow tasks. Anthropic claims the model completes every case in its internal Super-Agent benchmark, demonstrating its strength in complex, multi-step processes.
Why Honesty Matters: The Shift to Accountable AI
The focus on honesty in Opus 4.8 reflects a broader industry trend. Industry analysis notes that model incentives are shifting away from confident guessing. By foregrounding hesitation and uncertainty, Opus 4.8 encourages UI patterns that display confidence scores and evidence, a feature that many predict procurement teams will prioritize audit trails over raw benchmark scores.
This evolution frames a move toward accountable AI. As organizations adopt calibration-first models, application development may shift from single-shot generation to integrated workflows with built-in validation steps. The market's response in the coming months will determine if this design philosophy leads to widespread adoption or remains a niche for risk-sensitive industries.
What exactly changed in Claude Opus 4.8 compared to Opus 4.7?
Anthropic upgraded Opus 4.8 to flag uncertainty significantly more often and catch substantially more silent code flaws than its predecessor. The model now ships with two new user-facing levers:
- Effort control (Low, Medium, High, Extra, Max)
- Thinking toggle that surfaces internal deliberation when enabled
Early testers also report a noticeably less chatty tone while output quality stays the same or improves.
How do the new Effort and Thinking controls affect latency?
Each Effort level trades speed for depth of reasoning.
- Low - fastest response, minimal reflection
- Medium/High - balanced default
- Extra/Max - up to ~3-5× slower, but deeper chain-of-thought reasoning revealed via the Thinking toggle
Developers running browser-agent or long-horizon tasks can now choose whether to favor real-time interactivity or maximum thoroughness.
What real-world feedback has emerged from early testing?
According to industry reports and early access feedback:
- Many developers have found Opus 4.8 competitive for code review and strong on cross-file reasoning
- A significant number of testers report the model proactively flags questionable inputs more frequently than previous versions
- Independent tester Simon Willison labeled it "a modest but tangible improvement," especially for catching unsupported claims
No broad adoption numbers are yet public, but anecdotal uptake appears concentrated among coding teams and agentic-workflow builders.
How does Opus 4.8 benchmark against GPT-5.5 and Google's latest models?
| Benchmark | Claude Opus 4.8 | GPT-5.5 | Gemini 3.5 Flash |
|---|---|---|---|
| SWE-bench Pro | 69.2 % | 58.6 % | 54.2 % |
| Online-Mind2Web | Strong performance | not given | not given |
| Terminal-Bench 2.1 | 74.6 % | 78.2 % | 76.2 % |
| Artificial Analysis Index | 61.4 | 60.2 | 55.3 |
(Source: Anthropic release post and Lushbinary comparison)
These numbers show Opus 4.8 leads on agentic coding and browser tasks while remaining cost-competitive.
Why does "honesty" and uncertainty signaling matter for future AI products?
Industry analysis suggests a clear shift:
- Reward functions are moving from "maximize confidence" to "maximize calibrated uncertainty disclosure"
- Enterprise buyers increasingly demand audit trails, evidence links, and refusal states baked into the UI
- Many forecasts predict output validation will be a significant procurement differentiator rather than raw capability
In short, the product roadmap appears to be moving from persuasive AI to accountable AI, and Claude Opus 4.8's honesty layer is one of the first concrete implementations of this approach.