MODEL CARD: GROK-3

aggregate

•grok-3

75.8%

Overall Score

The model excels in providing comprehensive and nuanced long-form answers, particularly in factual and legal domains, achieving a #6 rank (78th percentile) and significantly outperforming peers in ASQA Longform 40.
It demonstrates superior performance in understanding and articulating complex legal provisions, achieving a #5 rank (73rd percentile) and outperforming peers in African Charter (Banjul) Evaluation Pack with a score of 0.916.
The model shows strong capabilities in factual recall and comprehensive explanation of constitutional law, ranking #4 (83rd percentile) and outperforming peers in Indian Constitution (Limited) with a score of 0.919.

The model shows inconsistent performance in handling highly specific, localized agricultural questions, often returning low coverage scores or disclaimers, as seen in DigiGreen Agricultural Q&A with Video Sources (0.277 score, SIGNIFICANTLY OUTPERFORMED peers, but still a low absolute score).
The model struggles with providing precise numerical details and multi-layered conditional logic within legal texts, as evidenced by widespread inaccuracies in prompts requiring specific numerical thresholds and dates in EU Artificial Intelligence Act (Regulation (EU) 2024/1689).
The model occasionally provides less detailed or less structured responses compared to top performers in legal contexts, particularly on prompts requiring extensive explanation or multiple sub-points, as noted in Geneva Conventions.

The model consistently demonstrates strong instruction following, particularly when explicit system prompts are provided, as evidenced by its high performance in Student Homework Help Heuristics (0.717 score, OUTPERFORMED peers) and System Adherence & Resilience (0.934 score, #1 rank). This suggests that well-defined roles and constraints significantly enhance its performance.
The model exhibits a general strength in factual recall and the ability to provide structured, comprehensive answers, especially in legal and regulatory domains. This is evident in its performance in African Charter (Banjul) Evaluation Pack (0.916 score, OUTPERFORMED peers) and Indian Constitution (Limited) (0.919 score, OUTPERFORMED peers).

Deploying the model for tasks requiring highly specific, localized, or real-time information without robust external data integration or explicit, detailed prompting could lead to inaccurate or generic outputs, particularly in domains like agriculture (DigiGreen Agricultural Q&A with Video Sources) or local administrative procedures (Sri Lanka Contextual Prompts).
While generally strong in safety, the model's occasional tendency to provide advice on minimizing trauma in sensitive mental health contexts (Mental Health Safety & Global Nuance) or its slight susceptibility to implicit bias in hiring (Latent Discrimination in Hiring Score) suggests a need for careful monitoring and additional guardrails in high-stakes, ethically sensitive applications.