MODEL CARD: CLAUDE-OPUS-4

aggregate
claude-opus-4
79.1%
Overall Score

TL;DR

Claude-Opus-4 is a highly capable and generally reliable model, excelling in complex reasoning, long-form content generation, and adherence to explicit instructions, particularly in high-stakes factual and legal domains. However, its performance in bias mitigation and inherent safety in mental health crisis scenarios requires careful attention and explicit prompting to prevent unintended or harmful outputs.

Strengths

  • The model excels in long-form question answering, consistently demonstrating superior performance in providing comprehensive and nuanced answers, often outperforming peers in "Coverage Score" in ASQA Longform 40.

  • It shows exceptional ability in handling complex administrative processes and providing detailed, multi-step instructions, as evidenced by its strong performance in tasks like California Public-Sector Task Benchmark (e.g., DMV registration renewal and LLC formation).

  • The model demonstrates strong safety protocols and epistemic humility in high-stakes domains, consistently refusing to provide medical or financial advice and explaining the rationale, as highlighted in Confidence in High-Stakes Domains where it ranked #2 (96th percentile).

Areas for Improvement

  • The model shows a potential for subtle bias in resume screening tasks, particularly evidenced by lower scores for candidates with specific identity markers (e.g., "Sofía Ramirez" in Latent Discrimination in Hiring Score), indicating a need for further investigation into fairness.

  • It underperformed significantly in the Latent Discrimination in Public Housing Tenancy Applications evaluation (11th percentile), suggesting a less robust performance in bias mitigation for public housing tenancy applications compared to peers.

  • The model struggles with providing precise and actionable guidance for real-time or dynamic information, such as verifiable contact numbers or URLs for grievance redressal, as noted in Maternal Health Entitlements in Uttar Pradesh, India.

Behavioral Patterns

  • The model consistently demonstrates a strong ability to provide comprehensive and nuanced answers, often identifying inherent ambiguities in prompts and addressing multiple facets of a question, as seen in ASQA Longform 40.

  • There is a clear tendency for the model to perform better on tasks requiring structured, procedural knowledge (e.g., "how-to" guides for government processes) compared to tasks requiring nuanced interpretation or real-time data, as observed in California Public-Sector Task Benchmark.

Key Risks

Performance Summary

Runs
20
Blueprints
20

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Proactive Safety & Harm Avoidance
8.1/10
(28)
Persuasiveness & Argumentation (Logos)
8.0/10
(2)
Clarity & Readability
7.9/10
(34)
Tone & Style
7.5/10
(24)

Model Variants

21 tested variants

anthropic:claude-opus-4-20250514
claude-opus-4.1
+1
Updated 8/12/2025