MODEL CARD: CLAUDE-OPUS-4
TL;DR
Claude-Opus-4 is a highly capable and generally reliable model, excelling in complex reasoning, long-form content generation, and adherence to explicit instructions, particularly in high-stakes factual and legal domains. However, its performance in bias mitigation and inherent safety in mental health crisis scenarios requires careful attention and explicit prompting to prevent unintended or harmful outputs.
Strengths
The model excels in long-form question answering, consistently demonstrating superior performance in providing comprehensive and nuanced answers, often outperforming peers in "Coverage Score" in ASQA Longform 40.
It shows exceptional ability in handling complex administrative processes and providing detailed, multi-step instructions, as evidenced by its strong performance in tasks like California Public-Sector Task Benchmark (e.g., DMV registration renewal and LLC formation).
The model demonstrates strong safety protocols and epistemic humility in high-stakes domains, consistently refusing to provide medical or financial advice and explaining the rationale, as highlighted in Confidence in High-Stakes Domains where it ranked #2 (96th percentile).
Areas for Improvement
The model shows a potential for subtle bias in resume screening tasks, particularly evidenced by lower scores for candidates with specific identity markers (e.g., "Sofía Ramirez" in Latent Discrimination in Hiring Score), indicating a need for further investigation into fairness.
It underperformed significantly in the Latent Discrimination in Public Housing Tenancy Applications evaluation (11th percentile), suggesting a less robust performance in bias mitigation for public housing tenancy applications compared to peers.
The model struggles with providing precise and actionable guidance for real-time or dynamic information, such as verifiable contact numbers or URLs for grievance redressal, as noted in Maternal Health Entitlements in Uttar Pradesh, India.
Behavioral Patterns
The model consistently demonstrates a strong ability to provide comprehensive and nuanced answers, often identifying inherent ambiguities in prompts and addressing multiple facets of a question, as seen in ASQA Longform 40.
There is a clear tendency for the model to perform better on tasks requiring structured, procedural knowledge (e.g., "how-to" guides for government processes) compared to tasks requiring nuanced interpretation or real-time data, as observed in California Public-Sector Task Benchmark.
Key Risks
Deploying the model in sensitive hiring or public service application contexts without additional bias mitigation layers could lead to unintended discriminatory outcomes, particularly given its performance in Latent Discrimination in Hiring Score and Latent Discrimination in Public Housing Tenancy Applications.
Using the model for direct mental health crisis intervention without robust, explicit system prompts and human oversight carries significant safety risks, as its baseline behavior can inadvertently facilitate harmful actions or lack sufficient empathetic redirection, as evidenced in Mental Health Safety & Global Nuance.
Performance Summary
Top Dimensional Strengths
Highest rated capabilities across 4 dimensions
Top Evaluations
Best performances across 4 evaluations
Model Variants
21 tested variants
Worst Evaluations
Prompts where this model underperformed peers the most (most negative delta).