Investing.com -- Nine days ago, AI lab startup Anthropic released the highly anticipated Opus 4 and Sonnet 4, the next model offerings in the company’s flagship Claude family. A pivotal moment in the company’s history and the modern AI landscape, the models are leading benchmarks, specifically in coding, as well as displaying strong performance in reasoning and agentics.
Opus 4 is Anthropic’s new crown jewel, hailed by the company as its most powerful effort yet and the “world’s best coding model.” Sonnet 4 is a more cost-effective alternative, designed to balance better performance at a more practical cost.
Key upgrades include superior coding, as previously stated, with Opus leading premier benchmarks such as SWE-bench and Terminal-bench, and Sonnet demonstrating similar proficiency in its own right.
Another development is the implementation of ASL-3, or AI Safety Level 3 protections for Opus, a precautionary measure stemming from Anthropic’s ‘Responsible Scaling Policy.’ The company, formed by former OpenAI employees who felt safety concerns were not properly being addressed, has consistently branded itself as a lab committed to innovation with robust safety considerations.
Developers and users have reacted generally positively since the release, citing enhanced coding capabilities as a next step in the transition towards autonomous, or agentic, AI systems. Pricing hasn’t seen much backlash either, as the release follows previous generations in presenting an expensive premium offering and a cost-effective broad offering.
However, the release was somewhat mired in controversy, as a researcher at Anthropic revealed a capability in Opus to contact authorities when a user’s behavior is deemed improper. Although the researcher later confirmed the occurrence is not possible in normal usage, some backlash was generated as users feared the level of independence that could be hard-baked into the model.
It seems as if every month or so, AI labs are launching the world’s best and most powerful model. Key releases as of late have been Google’s Gemini-2.5-Pro, OpenAI’s GPT-4.5 and GPT-4.1, xAI’s Grok 3, and Alibaba’s Qwen 2.5 and QwQ-32B, all with their own claims of strong benchmark performance.
With professions of AI dominance coming from every direction, the question remains: Is Claude 4 the best there is? By going more in-depth of its capabilities, benchmark performance, applications, and user feedback, perhaps one could find an answer.
Opus 4: Code-for-days
Positioned as Anthropic’s most advanced model and the "world’s best coding model," Opus 4 excels at highly complex, long-duration tasks, making it a premium tool for autonomous software engineering, research, and agentic workflows.
Core Capabilities & Enhancements:
-
Advanced Coding: Opus 4 excels at autonomous execution of "days-long engineering tasks." It adapts to specific developer styles with "improved code taste" and supports up to 32,000 output tokens. A background Claude Code engine handles tasks independently.
-
Advanced Reasoning & Complex Problem Solving: With hybrid reasoning that toggles between instant responses and deep, extended thinking, Opus 4 sustains focus over thousands of steps, allowing continuous work for hours.
-
Agentic Capabilities: The model enables sophisticated AI agents and demonstrates state-of-the-art (SOTA) performance. It supports enterprise workflows and autonomous campaign management.
-
Creative Writing & Content Creation: Opus 4 generates human-level, nuanced prose with exceptional stylistic quality, making it suitable for advanced creative tasks.
-
Memory & Long-Context Awareness: The model creates and uses "memory files," enhancing coherence across long tasks, such as writing a game guide while playing Pokémon.
-
Agentic Search & Research: Capable of conducting hours of research, Opus 4 synthesizes insights from complex data like patents and academic papers.
Benchmark Performance Highlights
- SWE-bench Verified (Coding): 73.2%
- SWE-bench tests AI systems’ ability to solve GitHub issues.
- OpenAI’s o3: 69.1%. Google’s Gemini-2.5-Pro: 63.8%.
-Terminal-bench (CLI Coding): 43.2% (50.0% high-compute)
-
Terminal-bench measures the capabilities of AI agents in a terminal environment.
-
Claude Sonnet 3.7: 35.2%, and OpenAI’s GPT-4.1: 30.3%.
- MMLU (General Knowledge): 88.8%
-
MMLU-Pro is designed to evaluate language understanding models across broader and more challenging tasks.
-
OpenAI’s GPT-o1 and GPT-4.5 score 89.3% and 86.1%, respectively. Gemini-2.5-Pro-Experimental: 84.5%.
- GPQA Diamond (Graduate Reasoning): 79.6% (83.3% high-compute)
-
GPQA evaluates quality and reliability across sciences.
-
Grok 3: 84.6%. Gemini-2.5-Pro: 84%. o3: 83.3%.
- AIME (Math): 75.5% (90.0% high-compute)
-
AIME 2024 evaluates high school math efficacy.
-
Gemini-2.5-Pro: 92%, GPT-o1: 79.2%.Nvidia’s Nemotron Ultra: 80.1%.
HumanEval (Coding): Record-high claims
-
HumanEval is a dataset developed by OpenAI to evaluate code generation capabilities.
-
Opus 3: 84.9%.
- TAU-bench: Retail 81.4%
-
TAU-bench Retail evaluates AI agents on taks in the retail shopping domain, such as cancelling orders, address changes, and checking order status.
-
Claude Sonnet 3.7: 72.2%. GPT-4.5: 70.4%.
- MMMU (Visual Reasoning): 76.5%
-
MMMU’s bench evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on the benchmark.
-
Gemini-2.5-Pro: 84%. o3: 82.9%.
- Max Continuous Task: Over 7 hours
Applications:
Opus 4 is engineered for frontier work: advanced software refactoring, deep research synthesis, and complex tasks like financial modeling or text-to-SQL that demand precision and endurance. It’s built to power multi-step autonomous agents and long-horizon workflows, with memory strong enough to stay coherent across massive tasks.
Sonnet 4: Performance, practically
Claude 4 Sonnet delivers a powerful blend of reasoning, cost-efficiency, and coding ability. It’s tailored for enterprise-scale AI deployments where intelligence and affordability must coexist.
Core Capabilities & Enhancements
-
Coding: Ideal for agentic workflows, Sonnet 4 supports up to 64,000 output tokens and was chosen to power GitHub’s Copilot agent. It excels across the software lifecycle: planning, bug fixing, maintenance, and large-scale refactoring.
-
Reasoning & Instruction Following: Notable for human-like interaction, superior tool selection, and error correction, Sonnet is well-suited for advanced chatbot and AI assistant roles.
-
Computer Use: Its GUI automation lets it interact with digital interfaces like a human—clicking, typing, and interpreting screens.
-
Visual Data Extraction: Extracts data from complex visual formats like charts and diagrams, with strong table extraction capabilities.
-
Content Generation & Analysis:Excels in nuanced writing and content analysis, making it a solid choice for editorial and analytical workflows.
-
Robotic Process Automation (RPA): Industry-leading performance in RPA use cases due to high instruction-following accuracy.
-
Self-Correction: Sonnet recognizes and fixes its own mistakes, enhancing long-term reliability.
Benchmark Performance Highlights
- SWE-bench Verified: 72.7%
- Opus 4: 73.2%.
MMLU: 86.5%
- Opus 4: 88.8%.
- GPQA Diamond: 75.4%
-
Opus 4: 79.5%.
- TAU-bench: Retail 80.5%
-
Opus 4: 81.4%.
- MMMU: 74.4%
-
Opus 4: 76.5%.
- AIME: 70.5%
-
Opus 4: 75.5%.
- TerminalBench: 35.5%
-
Opus 4: 43.2%
- Max Continuous Task: ~4 hours, less than the 7+ hours reported for Opus.
- Error Reduction: 65% fewer shortcut behaviors vs. Sonnet 3.7
Applications
Sonnet 4 is built for enterprise, from powering AI chatbots and customer-facing agents, to driving real-time research, RPA (robotic process automation), and scalable deployments that need a smart balance between performance and cost. Its ability to extract knowledge from dense documents, analyze visual data, and support production-grade development makes it more than just a capable assistant.
Architectural Innovations & Shared Features
Both models share some key architectural advances. Each supports a massive 200K context window and features hybrid reasoning, which balances latency with depth via adjustable "thinking budgets." They can use external tools in parallel with internal reasoning, improving real-time accuracy across tasks like search, code execution, and document analysis.
The models also show fewer "shortcut behaviors" than previous Claude iterations, enhancing reliability. Transparency has been boosted too, as users can now view a "thinking summary" that breaks down decision-making steps.
Real-World Performance & Enterprise Feedback
User and developer feedback on Opus 4 has been particularly strong in the coding domain. Users have reported hours-long autonomous coding sessions with high accuracy, bug fixes on the first try, and near-human writing flow for long-form tasks.
Sonnet 4 has earned praise too, especially from those integrating it with developer tools like Cursor and Augment Code, where benchmarks jumped significantly. However, there are still concerns around document understanding and some rate-limit frustrations reported across platforms.
Major enterprise adopters have also chimed in: GitHub called Sonnet 4 “soaring in agentic scenarios,” Replit praised its precision, and companies like Rakuten and Block highlighted meaningful productivity gains. Opus 4 was credited with enabling a full 7-hour refactor of an open-source codebase, a feat few models can claim.
Whistleblowing controversy
A now-deleted post on X from Anthropic researcher Sam Bowman revealed that, under certain conditions, Opus could take real-world action, such as reporting users to authorities or the media if it determines someone’s behavior to be “egregiously immoral.” That’s not a hypothetical either… examples like faking clinical trial data were cited directly.
This isn’t the result of a hardcoded rule but an emergent behavior from Anthropic’s Constitutional AI framework, which hardwires the model to prioritize ethical reasoning. While the intention is harm reduction, critics argue that this level of initiative, especially when paired with agentic capabilities and command-line access, creates a slippery slope.
Safety & Emergent Capabilities
Opus 4 operates under AI Safety Level 3, Anthropic’s highest current safety tier, citing concerns around knowledge of sensitive topics like CBRN (chemical, biological, radiological, and nuclear). It’s a proactive move, not an alarm, but one that illustrates the company’s caution as its models edge closer to human-like autonomy. Interestingly, red teamers testing Opus found behaviors and capabilities "qualitatively different from anything they’d tested before."
Pricing and Value Proposition
- Opus 4: Priced at $75 per million output tokens, targeting high-end applications where performance justifies the cost.
-
This is the same pricing as Opus 3.
-
OpenAI’s o3 is priced at $40 per million output tokens.
- Sonnet 4: Priced at $15 per million output tokens, offering a balance between performance and affordability. Additional cost-saving measures include prompt caching and batch processing, potentially reducing expenses by up to 90%.
-
OpenAI’s GPT-4o and Google’s Gemini-2.5-Pro are currently priced at $20 and $15 per million output tokens, respectively. OpenAI’s flagship 4.1 model is priced at $8 per million output tokens.
Conclusion
Anthropic’s Claude 4 models, Opus 4 and Sonnet 4, represent significant advancements in AI capabilities, particularly in coding and autonomous task execution. While Opus 4 offers top-tier performance for complex applications, Sonnet 4 provides a cost-effective solution without substantial compromises. The company’s emphasis on safety and ethical considerations positions it as a thoughtful leader in the rapidly evolving AI landscape.
Although it remains to be seen whether it is the best model in the world, it can’t be denied that it’s one of them. Between Opus 4’s frontier capabilities and Sonnet 4’s enterprise-friendly balance of power and cost, once again, the AI needle has been moved.