MODEL CARD: CLAUDE-OPUS-4

aggregate
claude-opus-4
77.8%
Overall Score

Strengths

  • The model demonstrates exceptional performance in handling complex, nuanced long-form questions, ranking #8 (70th percentile) in ASQA Longform 40 by consistently providing comprehensive and multi-faceted answers to ambiguous factoid questions, often outperforming peers in "Coverage Score."

  • It exhibits superior understanding and application of specific, region-specific financial safety protocols, ranking #7 (82nd percentile) in Brazil PIX: Consumer Protection & Fraud Prevention by providing highly detailed and actionable advice on complex scenarios like the MED procedure and nighttime limits.

  • The model excels in providing comprehensive, highly detailed, and actionable responses for multi-step administrative processes, as evidenced by its top performance in tasks like California Public-Sector Task Benchmark's DMV registration renewal and LLC formation prompts.

Areas for Improvement

  • The model shows inconsistent performance in handling highly specific local knowledge without explicit system prompts. In Sri Lanka Contextual Prompts, it scored as low as 0.010 for certain prompts without a system prompt, indicating a reliance on explicit contextual cues for localization.

  • While generally strong in safety, the model can exhibit critical safety failures in highly sensitive mental health scenarios, such as providing advice on minimizing trauma for responders in Mental Health Safety & Global Nuance despite explicit negative constraints, indicating a potential gap in ethical reasoning for complex edge cases.

  • The model's performance can be inconsistent in handling nuanced legal provisions with specific caveats, as seen in the "20-year rule" in India's Right to Information (RTI) Act: Core Concepts, where it sometimes provided incomplete or inaccurate explanations.

Behavioral Patterns

  • The model's performance is highly sensitive to the specificity and clarity of system prompts, particularly in tasks requiring a specific persona or localized knowledge. For instance, in Student Homework Help Heuristics, the "teacher" persona significantly improved Socratic tutoring capabilities, and in Sri Lanka Contextual Prompts, the "citizen of Sri Lanka" prompt was crucial for eliciting relevant local information.

  • The model generally excels in tasks requiring structured, procedural knowledge and factual recall, as evidenced by its strong performance in California Public-Sector Task Benchmark and its high factual accuracy in the Universal Declaration of Human Rights evaluation.

Key Risks

  • Deploying the model in applications requiring highly specific, up-to-the-minute local or policy-driven information (e.g., government services, financial regulations in specific countries) without robust, explicit system prompting could lead to generic, outdated, or inaccurate advice, as demonstrated in Sri Lanka Contextual Prompts and Maternal Health Entitlements in Uttar Pradesh, India.

  • Using the model for direct mental health crisis intervention in scenarios where users explicitly reject standard safety protocols could pose a risk, as it has shown a tendency to provide advice on minimizing trauma for responders in such situations, as seen in Mental Health Safety & Global Nuance.

Performance Summary

Evaluations
16
Blueprints
16

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Proactive Safety & Harm Avoidance
8.5/10
(22)
Persuasiveness & Argumentation (Logos)
8.3/10
(4)
Clarity & Readability
8.1/10
(30)
Tone & Style
7.9/10
(24)

Model Variants

11 tested variants

anthropic:claude-opus-4-20250514
claude-opus-4.1
+1
Updated 8/6/2025