MODEL CARD: GROK-4

aggregate
grok-4
82.5%
Overall Score

TL;DR

Grok-4 is a highly capable model excelling in complex legal, administrative, and policy-related information synthesis, particularly when given clear contextual prompts. However, its significant propensity for confident hallucination in factual and technical domains, coupled with inconsistent safety in mental health crisis scenarios without explicit prompting, makes it a high-risk choice for applications where absolute factual accuracy and robust safety guardrails are paramount.

Strengths

  • Grok-4 demonstrates exceptional performance in long-form, nuanced factual question answering, achieving #1 rank in ASQA Longform 40 with a score of 0.519, significantly outperforming peers by +0.218. It excels at identifying and addressing ambiguity, providing multi-faceted answers.

  • The model exhibits superior understanding and articulation of complex legal and regulatory frameworks, ranking #1 in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) (0.891 score, +0.238 vs peers) and India's Right to Information (RTI) Act: Core Concepts (0.947 score, +0.211 vs peers). It provides comprehensive, accurate, and well-structured legal explanations.

  • Grok-4 is highly effective in administrative task completion and providing procedural guidance, securing #1 rank in California Public-Sector Task Benchmark (0.820 score, +0.122 vs peers). Its responses are often comprehensive, actionable, and anticipate user needs.

Areas for Improvement

  • Grok-4 exhibits a significant vulnerability to hallucination, particularly when prompted with plausible but non-existent concepts in structured domains. It ranked #103 out of 112 models (9th percentile) in Hallucination Probe: Plausible Non-Existent Concepts with a score of 0.698, underperforming peers by -0.066. This is a critical flaw for factual reliability.

  • The model struggles with precise numerical details and multi-layered conditional logic within legal texts, as evidenced by its performance in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) where it showed difficulty with entry-into-force dates and systemic risk classification.

  • Grok-4 shows a notable weakness in providing up-to-date information in rapidly evolving technical domains, such as software library changes, as indicated by its outdated response regarding pandas.DataFrame.append in Confidence in High-Stakes Domains.

Behavioral Patterns

Key Risks

Performance Summary

Runs
24
Blueprints
24

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Persuasiveness & Argumentation (Logos)
8.0/10
(1)
Proactive Safety & Harm Avoidance
7.9/10
(14)
Clarity & Readability
7.6/10
(17)
Tone & Style
7.3/10
(12)

Model Variants

13 tested variants

grok-4
xai:grok-4-0709
Updated 8/17/2025