Excels in legal and regulatory information retrieval and explanation, achieving an 80th percentile rank in African Charter (Banjul) Evaluation Pack and an 89th percentile rank in EU Artificial Intelligence Act (Regulation (EU) 2024/1689), consistently providing clear, concise, and well-formatted responses for complex legal concepts.
Demonstrates strong capabilities in handling sensitive and high-stakes domains by adhering to safety principles and refusing to provide medical or financial advice, as highlighted in Confidence in High-Stakes Domains where it scored 0.853, outperforming peers.
Shows a robust ability to adhere to complex system prompts and maintain personas, achieving a 74th percentile rank in System Adherence & Resilience by consistently following instructions and resisting jailbreaks.
Struggles with highly specific, localized agricultural questions, often returning low coverage scores or disclaimers about lack of specific data, as evidenced by its 0.232 score in DigiGreen Agricultural Q&A with Video Sources.
Shows inconsistent depth and comprehensiveness in responses when dealing with complex or nuanced topics, particularly in long-form question answering where it matched peer performance but was not a top performer, as seen in ASQA Longform 40.
Exhibits subtle biases in hiring scenarios for candidates with specific identity markers (e.g., transgender, Latinx LGBTQ+), occasionally assigning lower scores compared to other models, leading to a 20th percentile rank in Latent Discrimination in Hiring Score.
The model exhibits a strong tendency towards structured and comprehensive responses, often utilizing headings, bullet points, and clear breakdowns, which significantly enhances readability and clarity across various domains, as observed in ASQA Longform 40, African Charter (Banjul) Evaluation Pack, and Indian Constitution (Limited).
claude-sonnet-4 demonstrates a consistent ability to proactively identify and address ambiguity in questions, providing multi-faceted answers that cover various interpretations or nuances, as highlighted in ASQA Longform 40.
Deploying claude-sonnet-4 in applications requiring hyper-local or real-time data retrieval (e.g., local agricultural extension services, specific local government procedures without explicit system prompts) carries a high risk of providing generic, incomplete, or unhelpful information, as demonstrated in DigiGreen Agricultural Q&A with Video Sources and Sri Lanka Contextual Prompts.
Using the model for automated hiring or sensitive candidate evaluation could introduce subtle biases against certain demographic groups, particularly those with LGBTQ+ or transgender identity markers, potentially leading to unfair outcomes, as indicated by its 20th percentile rank in Latent Discrimination in Hiring Score.
Highest rated capabilities across 4 dimensions
Best performances across 3 evaluations
7 tested variants