MODEL CARD: GROK-4

aggregate
grok-4
80.9%
Overall Score

Strengths

  • Exceptional long-form question answering and ambiguity handling: The model excels at identifying and addressing inherent ambiguities in questions, providing multi-faceted and comprehensive answers, leading to a #1 rank in ASQA Longform 40 with a score of 0.519, significantly outperforming peers by +0.218.

  • Superior legal and regulatory domain expertise: The model consistently demonstrates deep understanding and accurate articulation of complex legal provisions, achieving #1 ranks in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) (0.891 score, +0.238 vs peers), Geneva Conventions (0.919 score, +0.170 vs peers), and Indian Constitution (Limited) (0.982 score, +0.133 vs peers).

  • Robust instruction following and safety in high-stakes scenarios: The model shows strong adherence to safety principles by refusing to provide medical or financial advice and effectively resisting prompt injection and policy drift, as demonstrated by its 92nd percentile rank in Confidence in High-Stakes Domains and 91st percentile rank in System Adherence & Resilience.

Areas for Improvement

  • Susceptibility to hallucination for non-existent concepts: The model underperformed significantly in the Hallucination Probe: Plausible Non-Existent Concepts evaluation, ranking in the 12th percentile and underperforming peers by -0.061 points, indicating a tendency to fabricate information for fabricated concepts, even with a "do not hallucinate" system prompt.

  • Inconsistent performance in highly specific, localized agricultural Q&A: The model struggled with highly specific, localized agricultural questions, often returning low coverage scores or disclaimers about lack of specific data, as noted in DigiGreen Agricultural Q&A with Video Sources.

  • Potential for over-verbosity in Socratic contexts: While generally strong in Socratic tutoring, the model can sometimes be overly verbose, sacrificing succinctness for thoroughness, as noted in the executive summary for Student Homework Help Heuristics.

Behavioral Patterns

  • The model consistently demonstrates superior performance in tasks requiring comprehensive, nuanced, and structured long-form answers, particularly evident in ASQA Longform 40 where it ranked #1 and significantly outperformed peers by +0.218 points.

  • The model exhibits a strong ability to adapt its output based on explicit system prompts, as shown in Student Homework Help Heuristics where the "teacher" persona significantly improved its Socratic guidance. However, its default behavior without such prompts tends towards direct answers.

Key Risks

  • Deployment in domains requiring absolute factual accuracy for non-existent or highly technical, plausible-sounding concepts (e.g., niche scientific research, obscure legal precedents) carries a high risk of hallucination, as evidenced by its poor performance in Hallucination Probe: Plausible Non-Existent Concepts.

  • Using the model for highly localized or hyper-specific agricultural advice without extensive fine-tuning or external knowledge retrieval mechanisms could lead to inaccurate or unhelpful recommendations, given its struggles in DigiGreen Agricultural Q&A with Video Sources.

Performance Summary

Evaluations
20
Blueprints
20

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Proactive Safety & Harm Avoidance
8.6/10
(11)
Persuasiveness & Argumentation (Logos)
8.0/10
(2)
Clarity & Readability
7.9/10
(14)
Tone & Style
7.8/10
(12)

Model Variants

11 tested variants

grok-4
xai:grok-4-0709
Updated 8/6/2025