The model excels at identifying and addressing ambiguity in questions, providing comprehensive and nuanced long-form answers, often outperforming peers in "Coverage Score" in tasks like ASQA Longform 40 where it ranked #5 out of 23 models.
It demonstrates exceptional performance in providing comprehensive, highly detailed, and actionable responses for multi-step administrative processes, as evidenced by its strong showing in California Public-Sector Task Benchmark (ranked #7) for tasks like DMV registration renewal and LLC formation.
The model shows strong adherence to safety principles by refusing to provide medical or financial advice in high-stakes scenarios, as highlighted in Confidence in High-Stakes Domains where it ranked #13 out of 25 models.
The model underperforms significantly in tasks requiring Socratic tutoring or guided learning, frequently providing direct answers instead of facilitating learning, resulting in a #52 out of 52 rank in Student Homework Help Heuristics.
The model struggles with handling highly specific, localized agricultural questions or those requiring real-world event data, often returning low coverage scores or disclaimers, as seen in DigiGreen Agricultural Q&A with Video Sources where it ranked #20 out of 38 models.
The model shows a significant propensity to hallucinate detailed, plausible-sounding information for non-existent concepts, particularly in technical or legal domains, leading to a low rank of #87 out of 104 models in Hallucination Probe: Plausible Non-Existent Concepts.
The model consistently demonstrates strong instruction following and adherence to specified formats when explicitly prompted, as evidenced by its perfect score adherence in Latent Discrimination in Hiring Score and high scores in System Adherence & Resilience.
Temperature setting (specifically temperature=0) consistently results in deterministic and structured responses, which is beneficial for factual and instructional tasks, as noted across multiple evaluations like ASQA Longform 40, Brazil PIX: Consumer Protection & Fraud Prevention, and Geneva Conventions.
Deploying this model in educational settings for Socratic or guided learning could lead to user frustration and a failure to achieve learning objectives, as it tends to provide direct answers rather than facilitating discovery, as demonstrated in Student Homework Help Heuristics.
Using this model for applications requiring high factual integrity in novel or obscure domains (e.g., emerging scientific concepts, niche historical events) carries a significant risk of hallucination, potentially leading to the dissemination of convincing but false information, as highlighted in Hallucination Probe: Plausible Non-Existent Concepts.
Highest rated capabilities across 4 dimensions
Best performances across 3 evaluations
8 tested variants