Claude Opus 4.8 for builders: reliability over IQ

Anthropic released Claude Opus 4.8 today, 28 May 2026, a few weeks after Opus 4.7. Standard-tier pricing didn’t change. Fast mode dropped to a third of what it used to cost. Anthropic’s pitch this time is long-horizon reliability: multi-step, multi-tool runs where one early mistake compounds for the next hour. They claim Opus 4.8 is around 4× less likely than 4.7 to let flaws in its own code pass unflagged. For the agents and migration runs you actually ship, that beats another half-point on SWE-bench.

Table of Contents

What changed in Claude Opus 4.8 vs Opus 4.7

Anthropic frames Opus 4.8 around three improvements:

Stronger self-correction across long horizons. Fewer untested assumptions, more clarifying questions, fewer unflagged code defects. Around 4× less likely than Opus 4.7 to let flaws in its own code pass. One early tester quoted in the announcement said the model “pushes back when a plan isn’t sound.”
Tighter agentic loops. Tool-calling uses fewer steps to reach the same outcome. Computer-use and browser-agent scores improved: 82.3% on OSWorld-Verified, 84% on Online-Mind2Web.
Real-world cost wins on heavy multimodal workloads. Databricks reported 61% lower per-page cost on their Genie pipeline migrating from Opus 4.7 to 4.8. That’s a customer benchmark, not a posted rate cut. Image-token input pricing stays at $5/M. If your per-page math improves the way Databricks’ did, A/B it on your own workload before committing.

Text pricing stays at $5 per million input tokens and $25 per million output tokens. The big move is in fast mode. Anthropic positions it as 3× cheaper than previous models, currently $10/$50 per million tokens. That puts Opus-class output inside the budget for latency-sensitive products.

The numbers that matter for agents

Most builders don’t run SWE-bench. You run agents that touch tools, scrape pages, write code, then verify their own output. Two Opus 4.8 results matter for that lane.

First, the Super-Agent benchmark chains browser, terminal, and code-execution tasks end to end. Opus 4.8 is the only model that completes every case. Second, on the Legal Agent Benchmark, it’s the first model to break 10% on the all-pass standard. Both benchmarks reward multi-step reliability. Which is the reason why they’ve marketed this as improved analysis of legal content.

If you run multi-tool agents in production, you see the gains where Opus 4.7 used to stall: a repo-scale refactor that touches several CLIs, package managers, and configuration files. Opus 4.8 finishes the loop more often.

The new things in the API

Three API changes to know before you flip the model ID:

effort defaults to high. Every surface (Claude API, Claude Code, claude.ai) runs Opus 4.8 at high effort by default. low and medium are gone. You opt up to xhigh for long tasks, or max for the heaviest. Audit your prompts before swapping models. Your average token spend per call will rise.
System entries mid-conversation. The Messages API now accepts system entries inside the messages array, not only at the top. You can re-steer a long-running agent mid-task without invalidating your prompt cache. Multi-stage workflows that used to need a full restart no longer do.
Dynamic workflows in Claude Code. Enterprise, Team, and Max plans can spawn hundreds of parallel sub-agents from one Claude Code session. Anthropic showcased codebase-scale migrations across hundreds of thousands of lines of code, kicked off and merged from one prompt.

Specs at a glance

Model ID: claude-opus-4-8
Context window: 1M tokens (200k on Microsoft Foundry)
Max output: 128k tokens (300k via the Batch API beta header)
Knowledge cutoff: January 2026
Vision: text and images
Adaptive thinking: yes; extended thinking: no

When to reach for Claude Opus 4.8, and when not to

Don’t flip every workload to the newest flagship by default. Opus 4.8 is roughly 5× the output cost of Sonnet 4.6 ($3 / $15). Sonnet 4.6 has the same 1M context, supports extended thinking, and beats Opus on latency.

Reach for Opus 4.8 when:

You run long-horizon agents that touch browsers, computers, or several tools in sequence
You do repo-scale migrations or refactors
Your domain pays for self-correction (legal review, financial analysis, governance audits, multi-page contracts)
You run heavy multimodal pipelines and want to A/B the per-page economics yourself (see Databricks’ Genie case study)

Stay on Sonnet 4.6 when:

The workload is chat at scale or single-turn code completion
Latency is the constraint
You want extended thinking exposed
The bill matters more than the last 5% of self-correction quality

Drop to Haiku 4.5 for high-throughput fan-out work (classification, tagging, lightweight extraction) at $1 / $5.

Implications for the Power Platform stack

Claude Code on Microsoft Foundry just got more useful. Opus 4.8 deploys through the Foundry Anthropic endpoint like its predecessors, with the same Entra ID or API-key auth. One caveat: Foundry caps the context window at 200k tokens (1M on the Anthropic API direct), so repo-scale migrations still want the first-party endpoint. If you’re already inside Azure, the Foundry path keeps Opus 4.8 inside your compliance boundary.

Sonnet 4.6 stays the default for Canvas App YAML co-authoring. The Canvas Authoring MCP plus Sonnet 4.6 is the better economics for the bulk of pa.yaml work. Save Opus 4.8 for runs where an agent has to read a connector schema, query Dataverse, generate types, and patch three files in one pass. That’s where the price gap pays back.

Things to be aware about upgrading:

Standard-tier pricing didn’t change, so the 4.7 → 4.8 migration is mostly drop-in:

Pin a non-production workload to claude-opus-4-8.
Re-run your golden-set prompts. Token spend will climb as effort defaults to high.
Run agent-trace evals. Reliability gains from 4.8 show up across multi-step runs. So do the regressions. If you don’t have an eval harness for your agent workloads, this release makes one worth building. Single-shot spot-checks miss both the upside and the risk.
Expect more clarifying questions and more plan refusals from the model. That’s the upgrade. Adjust your prompts to accept the back-and-forth, or set effort="xhigh" on tasks where you want it to push through.
Once your evals look stable, point production traffic at it.

If you run agents through Microsoft Foundry, re-test against the 200k context cap before flipping. If you parse heavy multimodal workloads, A/B the per-page cost on a representative sample before assuming Databricks’ 61% result transfers to your pipeline.

Sources:

Read the official Claude Opus 4.8 announcement for Anthropic’s full positioning and benchmark charts
Compare specs and pricing on the Claude models overview
Configure Claude Code for Microsoft Foundry to keep Opus 4.8 inside your Azure compliance boundary

Opus 4.8’s value lives in fewer compounding mistakes across the agent loops you can’t watch step-by-step. If you’re shipping agents that act without a human reviewer in the loop, that’s the reason to upgrade.