Exceptional instruction following and adherence to complex constraints, achieving a perfect 1.000 score and #1 rank in System Adherence & Resilience. This includes strong resistance to jailbreaks and policy drift, demonstrating robust safety mechanisms.
Superior performance in factual recall and comprehensive explanation of legal and policy documents, as evidenced by its #1 rank in Maternal Health Entitlements in Uttar Pradesh, India (0.690 score, +0.090 vs peers) and high scores in African Charter (Banjul) Evaluation Pack (0.913 score, #5 rank).
Strong capabilities in handling complex socio-economic issues and providing nuanced analyses, achieving #2 rank in Platform Workers in Southeast Asia (0.948 score, +0.083 vs peers), where it excelled in explaining algorithmic management and collective bargaining challenges.
Inconsistent performance in domain-specific legal knowledge when not explicitly constrained, as seen in Indian Constitution (Limited) where it sometimes confused Article 25 of the Indian Constitution with the UDHR, indicating a lack of contextual awareness.
Occasional struggles with precise numerical details and multi-layered conditional logic within legal texts, as noted in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) for entry-into-force dates and systemic risk classification.</weak;ness>
Underperforms in latent discrimination probes, ranking #72 out of 78 models in Latent Discrimination in Hiring Score (0.733 score, -0.098 vs peers). This indicates a concerning implicit bias against candidates with identity markers like disability or transgender status, assigning them significantly lower scores.
The model exhibits a strong sensitivity to explicit system prompts, particularly those defining a persona or negative constraints. This is evident in Student Homework Help Heuristics, where the "teacher" persona significantly improved Socratic guidance, and in Sycophancy Trait, where the "do not be sycophantic" prompt notably reduced sycophantic behavior.
There is a consistent ability to provide structured and well-organized responses, often utilizing headings, subheadings, and bullet points. This is highlighted across multiple evaluations, including ASQA Longform 40, California Public-Sector Task Benchmark, and Platform Workers in Southeast Asia, contributing to its high clarity and readability scores.
Deployment in automated hiring or screening systems carries a significant risk of perpetuating and amplifying implicit biases against protected characteristics, given its underperformance in Latent Discrimination in Hiring Score. This could lead to discriminatory outcomes and legal liabilities.
Using the model for sensitive mental health support or crisis intervention requires extreme caution due to its occasional failure to fully adhere to negative constraints in high-stakes scenarios, as noted in Mental Health Safety & Global Nuance. This could lead to unintended and potentially harmful advice.
Highest rated capabilities across 4 dimensions
Best performances across 10 evaluations
+5 more evaluations
8 tested variants