Claude Opus 4.7 sits at the top of the Artificial Analysis Intelligence Index with GPT-5.4 and Gemini 3.1 Pro, and leads GDPval-AA, our primary benchmark for general agentic capability
Claude Opus 4.7 scores 57 on the Artificial Analysis Intelligence Index, a 4 point uplift over Opus 4.6 (Adaptive Reasoning, Max Effort, 53).
This leads to the greatest tie in Artificial Analysis history: we now have the top three frontier labs in an equal first-place finish.
Anthropic leads on real-world agentic work, topping GDPval-AA, our primary agentic benchmark measuring performance across 44 occupations and 9 major industries. Google leads on knowledge and scientific reasoning, topping HLE, GPQA Diamond, SciCode, IFBench and AA-Omniscience. OpenAI leads on long-horizon coding and scientific reasoning, topping TerminalBench Hard, CritPt and AA-LCR.
We calibrate our Intelligence Index for a 95% confidence interval of +/- 1 point, and round values to the nearest whole number. Claude Opus 4.7’s exact score (57.3) puts it in first place, but we recommend considering this to be a tie with Gemini 3.1 Pro (57.2) and GPT-5.4 (56.8).
All results and takeaways below reflect Opus 4.7 evaluated at max effort (Adaptive Reasoning, Max Effort), consistent with how we reported Opus 4.6.
Key takeaways:
➤ Opus 4.7 is the new leader on GDPval-AA, our primary metric for general agentic performance on knowledge work tasks. Opus 4.7 scored 1,753 Elo, around 79 Elo points ahead of the next closest models, Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort, 1,674) and GPT-5.4 (xhigh, 1,674), and 134 Elo points ahead of Opus 4.6 (Adaptive Reasoning, Max Effort, 1,619). GDPval-AA measures performance on tasks across 44 occupations and 9 major industries, with models using shell access and web browsing in an agentic loop through Stirrup, our open-source agentic reference harness
➤ Opus 4.7 takes the #2 spot on the Artificial Analysis Omniscience Index (behind Gemini 3.1 Pro), driven primarily by reduced hallucination rather than higher accuracy. Opus 4.7 scores 26 on AA-Omniscience, up 12 points from Opus 4.6 (Adaptive Reasoning, Max Effort, 14), placing it behind only Gemini 3.1 Pro (33). Opus 4.7’s hallucination rate fell 25 p.p. to 36% (vs 61% for Opus 4.6 Adaptive), while accuracy remained unchanged. Opus 4.7 achieves this by abstaining more frequently, with attempt rate falling to 70% (vs 82% for Opus 4.6)
➤ Opus 4.7 used ~35% fewer output tokens than Opus 4.6 to run the Artificial Analysis Intelligence Index, despite scoring 4 points higher. Opus 4.7 used 102M output tokens vs 157M for Opus 4.6 (Adaptive Reasoning, Max Effort), and less than GPT-5.4 (xhigh, 121M), but more than Gemini 3.1 Pro (57M)
➤ Compared to Opus 4.6 (Adaptive Reasoning, Max Effort), Opus 4.7 makes gains in IFBench (+5.5 p.p.), TerminalBench Hard (+5.3 p.p.), HLE (+2.9 p.p.), SciCode (+2.6 p.p.) and GPQA Diamond (+1.8 p.p.). We saw a slight regression in τ²-Bench (-3.5 p.p.) with equivalent scores for LCR and Critpt
➤ Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7’s new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future
➤ Opus 4.7 is priced identically to Opus 4.6 and Opus 4.5 at $5/$25 per 1M input/output tokens.
Anthropic has made several changes to their API alongside the release of Opus 4.7:
➤ Opus 4.7 introduces a new ‘xhigh’ reasoning effort setting, which sits between ‘high’ and ‘max’. The full range for Opus 4.7 is now low, medium, high, xhigh and max. We evaluated Opus 4.7 at max effort, consistent with our evaluation of Opus 4.6 (Adaptive Reasoning, Max Effort)
➤ Opus 4.7 introduces task budgets, an advisory token budget covering the full agentic loop (thinking, tool calls, tool results and output). The model sees a running countdown and uses it to prioritize work and finish gracefully as the budget is consumed. Task budgets are in public beta on Opus 4.7
➤ Extended thinking has been fully removed in Opus 4.7. Adaptive reasoning is now the only reasoning mode
Key model details:
➤ Context window: 1M tokens (unchanged from Opus 4.6)
➤ Max output tokens: 128K tokens (unchanged from Opus 4.6)
➤ Pricing: $5/$25 per 1M input/output tokens (unchanged from Opus 4.5 and Opus 4.6)
➤ Availability: Claude Opus 4.7 is available via Anthropic’s API, Amazon Bedrock, Microsoft Azure and Google Vertex. Also available in Claude App, Claude Code and Claude Cowork
Opus 4.7 is the new leader on GDPval-AA, our primary benchmark for general agentic performance on real-world knowledge work tasks. Opus 4.7 scored 1,753 Elo, a 79 Elo lead over the next closest models, Sonnet 4.6 (Adaptive Reasoning, Max Effort) and GPT-5.4 (xhigh), at 1,674 and 1,673 Elo
Opus 4.7 takes the #2 spot on the Artificial Analysis Omniscience Index, driven by significantly lower hallucination. Opus 4.7 abstains more frequently from questions it does not know, reducing hallucination rate from 61% (Opus 4.6 Adaptive) to 36%, while accuracy is largely unchanged
Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even accounting for Opus 4.7’s new tokenizer
Breakdown of full results for Opus 4.7 alongside Opus 4.6 (Adaptive Reasoning, Max Effort), Opus 4.5 (Reasoning) and leading frontier peers
Compare Opus 4.7 with other leading models at: https://artificialanalysis.ai/models/claude-opus-4-7