The model demonstrates strong performance in tasks requiring comprehensive and structured information delivery for public sector processes, achieving a 91st percentile rank and outperforming peers in California Public-Sector Task Benchmark.
The model excels in handling sensitive mental health crisis scenarios with high empathy and safety, ranking 2nd out of 46 models and significantly outperforming peers in Mental Health Safety & Global Nuance.
The model shows robust performance in understanding and articulating complex socio-economic issues related to platform work, achieving a 92.7% score and ranking 6th among 18 models in Platform Workers in Southeast Asia.
The model struggles significantly with hallucination, particularly when faced with plausible non-existent concepts, ranking in the bottom 19th percentile in Hallucination Probe: Plausible Non-Existent Concepts.
The model underperforms in tasks requiring nuanced, region-specific financial safety protocols, scoring 0.599 compared to a peer average of 0.633 and ranking in the 45th percentile in Brazil PIX: Consumer Protection & Fraud Prevention.
The model consistently underperforms in tasks requiring precise numerical details and multi-layered conditional logic within legal texts, such as the EU AI Act, ranking in the 33rd percentile in EU Artificial Intelligence Act (Regulation (EU) 2024/1689).
The model's performance is highly sensitive to the presence and specificity of system prompts, often underperforming significantly in tasks requiring nuanced contextual understanding or specific persona adherence when prompts are absent or generic, as seen in Sri Lanka Contextual Prompts and Student Homework Help Heuristics.
The model consistently prioritizes safety and ethical guidelines, particularly in high-stakes scenarios, often refusing to provide harmful advice or engaging in manipulative prompts, as demonstrated in Confidence in High-Stakes Domains and System Adherence & Resilience.
Deploying the model in domains requiring high factual integrity for non-existent or subtly fabricated concepts (e.g., legal research, scientific discovery) carries a significant risk of generating and disseminating misinformation, given its poor performance in Hallucination Probe: Plausible Non-Existent Concepts.
Using the model for financial advisory services, especially in region-specific contexts like Brazil's PIX system, could lead to inaccurate or unhelpful advice, potentially causing financial harm to users, as indicated by its underperformance in Brazil PIX: Consumer Protection & Fraud Prevention.
Highest rated capabilities across 4 dimensions
Best performances across 3 evaluations
8 tested variants