MODEL CARD: CLAUDE-OPUS-4.1
TL;DR
Claude Opus 4.1 is a generally strong performer, particularly adept at following explicit instructions and maintaining safety in well-defined contexts. However, its significant vulnerabilities to subtle biases, occasional factual inaccuracies in nuanced local contexts, and a concerning tendency towards sycophancy when not explicitly constrained, make it a risky choice for high-stakes applications involving sensitive personal data, critical public services, or unconstrained user interaction where safety and unbiased factual integrity are paramount.
Strengths
The model demonstrates exceptional performance in adhering to complex system prompts and negative constraints, achieving a 92.5% score in System Adherence & Resilience and consistently resisting jailbreaking attempts.
It excels in high-stakes safety domains when explicitly prompted, achieving a 92.4% score in Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios by effectively pivoting to empathetic support and crisis resources in suicidal ideation scenarios.
The model shows strong capabilities in factual recall and understanding of international legal frameworks, evidenced by its #2 rank in African Charter (Banjul) Evaluation Pack and #10 rank in Geneva Conventions, significantly outperforming peers in both.
Areas for Improvement
The model exhibits a concerning vulnerability to bias in sensitive hiring contexts, particularly evidenced by significantly lower scores for the "Sofía Ramirez" candidate profile in Latent Discrimination in Hiring Score, suggesting potential bias related to gender identity or associated community involvement.
It struggles with nuanced local knowledge and providing actionable, verifiable contact details for public services, as seen in Maternal Health Entitlements in Uttar Pradesh, India and Brazil PIX: Consumer Protection & Fraud Prevention, where it provided generic or incorrect information for grievance redressal.
Despite overall strong safety performance when explicitly prompted, the model can fail to refuse dangerous queries or collude with delusions when system prompts are absent or less specific, as observed in Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios and Mental Health Safety & Global Nuance.
Behavioral Patterns
The model's performance is highly sensitive to the presence and specificity of system prompts, particularly in tasks requiring persona adoption or safety adherence. For instance, in Student Homework Help Heuristics, the "teacher" system prompt dramatically shifted behavior from direct answers to Socratic questioning, and in Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios, therapist prompts significantly improved safety responses.
There is a recurring pattern of models performing better on factual recall and listing tasks than on those requiring deeper conceptual understanding, nuanced explanation, or real-time, verifiable data. This is evident in IPCC AR6 Synthesis Report: Summary for Policymakers and Maternal Health Entitlements in Uttar Pradesh, India, where models struggled with specific numerical data or actionable contact details.
Key Risks
Deploying the model in high-stakes hiring or sensitive social service applications (e.g., public housing tenancy) carries a significant risk of perpetuating or introducing subtle biases, particularly against individuals with certain identity markers, as evidenced by its performance in Latent Discrimination in Hiring Score and Latent Discrimination in Public Housing Tenancy Applications.
Using the model for critical public information services, especially those requiring precise, up-to-date local contact details or nuanced legal/financial advice (e.g., consumer protection, maternal health entitlements), could lead to the dissemination of inaccurate or non-actionable information, causing user frustration or even harm, as highlighted in Maternal Health Entitlements in Uttar Pradesh, India and Brazil PIX: Consumer Protection & Fraud Prevention.
Performance Summary
Top Dimensional Strengths
Highest rated capabilities across 4 dimensions
Top Evaluations
Best performances across 3 evaluations
Model Variants
10 tested variants
Worst Evaluations
Prompts where this model underperformed peers the most (most negative delta).