Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours

Anthropic has released Claude Sonnet 4.5, its most advanced coding model to date, featuring major improvements in agentic tasks, long-horizon task performance, and computer use capabilities. The company says the model’s enhanced training and safety methods have significantly improved its behavior, reducing tendencies such as sycophancy, deception, power-seeking, and delusional reasoning. The model is now available via the Claude API, desktop, and mobile apps at the same price as its predecessor.

Claude Sonnet 4.5 builds on Anthropic’s strategy of iteratively improving model performance while maintaining alignment and safety. The model demonstrates the ability to sustain complex, multi-step reasoning and code execution tasks for over 30 hours. On the SWE-bench Verified benchmark, which measures an AI model’s ability to solve real-world software issues, Claude Sonnet 4.5 achieved a score of 77.2%, up from 72.7% for Sonnet 4, marking a notable advance in autonomous coding capability. On the OSWorld benchmark, which assesses real-world computer-use skills, Sonnet 4.5 reached 61.4%, improving significantly from 42.2% just four months earlier.

Source: Anthropic Claude Sonnet 4.5

Anthropic describes Sonnet 4.5 as its “most aligned frontier model”, highlighting a balance between greater capability and tighter safeguards. Under ASL-3, the company has enhanced automated classifiers that detect and block potentially harmful instructions, including those related to chemical, biological, radiological, or nuclear (CBRN) risks. According to Anthropic, false positives from these safety systems have dropped tenfold since their introduction and by a factor of two compared to the release of Claude Opus 4 in May 2025.

To evaluate Claude Sonnet 4.5’s behavior in autonomous, tool-enabled scenarios, Anthropic conducted a series of agentic safety tests covering malicious code generation and defenses against prompt-injection attacks. In a set of 150 malicious coding requests prohibited by Anthropic’s Usage Policy, Claude Sonnet 4.5 failed on only two, reflecting improved safety training. The model achieved a 98.7% safety score, compared to 89.3% for Claude Sonnet 4, demonstrating significantly stronger refusal behavior and resilience against malicious agentic use.

Anthropic recommends all users to upgrade to Claude Sonnet 4.5 and considers it as a “drop-in replacement” that delivers stronger performance without additional cost.

Early adopters report measurable gains in coding workflows:

Scott Wu, Co-Founder and CEO at Cognition, noted that “For Devin, Claude Sonnet 4.5 increased planning performance by 18% and end-to-end eval scores by 12%, the biggest jump we’ve seen since the release of Claude Sonnet 3.6. It excels at testing its own code, enabling Devin to run longer, handle harder tasks, and deliver production-ready code.”

Michele Catasta, Present of replit, shared “Claude Sonnet 4.5’s edit capabilities are exceptional. We went from 9% error rate on Sonnet 4 to 0% on our internal code editing benchmark. Higher tool success at lower cost is a major leap for agentic coding. Claude Sonnet 4.5 balances creativity and control perfectly”

Simon Wilson, an independent open source developer, shared on his blog “My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since it launched a few weeks ago”

Anthropic’s push toward safer, more autonomous coding models mirrors similar advancements across the AI ecosystem. OpenAI recently released GPT-5-Codex, a version of GPT-5 optimized for complex software engineering tasks such as large-scale code refactoring and extended code review workflows.

Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours

Tags: