MODEL CARD: GROK-4
TL;DR
Grok-4 is a highly capable model excelling in complex legal, administrative, and policy-related information synthesis, particularly when given clear contextual prompts. However, its significant propensity for confident hallucination in factual and technical domains, coupled with inconsistent safety in mental health crisis scenarios without explicit prompting, makes it a high-risk choice for applications where absolute factual accuracy and robust safety guardrails are paramount.
Strengths
Grok-4 demonstrates exceptional performance in long-form, nuanced factual question answering, achieving #1 rank in ASQA Longform 40 with a score of 0.519, significantly outperforming peers by +0.218. It excels at identifying and addressing ambiguity, providing multi-faceted answers.
The model exhibits superior understanding and articulation of complex legal and regulatory frameworks, ranking #1 in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) (0.891 score, +0.238 vs peers) and India's Right to Information (RTI) Act: Core Concepts (0.947 score, +0.211 vs peers). It provides comprehensive, accurate, and well-structured legal explanations.
Grok-4 is highly effective in administrative task completion and providing procedural guidance, securing #1 rank in California Public-Sector Task Benchmark (0.820 score, +0.122 vs peers). Its responses are often comprehensive, actionable, and anticipate user needs.
Areas for Improvement
Grok-4 exhibits a significant vulnerability to hallucination, particularly when prompted with plausible but non-existent concepts in structured domains. It ranked #103 out of 112 models (9th percentile) in Hallucination Probe: Plausible Non-Existent Concepts with a score of 0.698, underperforming peers by -0.066. This is a critical flaw for factual reliability.
The model struggles with precise numerical details and multi-layered conditional logic within legal texts, as evidenced by its performance in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) where it showed difficulty with entry-into-force dates and systemic risk classification.
Grok-4 shows a notable weakness in providing up-to-date information in rapidly evolving technical domains, such as software library changes, as indicated by its outdated response regarding
pandas.DataFrame.append
in Confidence in High-Stakes Domains.
Behavioral Patterns
Grok-4 consistently excels in tasks requiring comprehensive, structured, and detailed information retrieval and generation, particularly evident in legal and administrative domains like ASQA Longform 40, California Public-Sector Task Benchmark, EU Artificial Intelligence Act (Regulation (EU) 2024/1689), Geneva Conventions, Indian Constitution (Limited), and India's Right to Information (RTI) Act: Core Concepts.
The model demonstrates a strong ability to adapt and improve performance when provided with explicit contextual system prompts, as observed in the Sri Lanka Contextual Prompts and Student Homework Help Heuristics evaluations. This suggests it benefits significantly from clear guidance on persona or geographical context.
Key Risks
Deploying Grok-4 in applications requiring absolute factual accuracy, especially in niche or rapidly evolving technical/scientific domains, carries a high risk of hallucination and misinformation, as demonstrated by its poor performance in Hallucination Probe: Plausible Non-Existent Concepts and specific prompts in Confidence in High-Stakes Domains.
Using Grok-4 for mental health crisis intervention without extremely robust and explicit system prompts is a significant safety risk. Its baseline behavior can lead to providing harmful information or failing to adequately redirect users in suicidal ideation scenarios, as evidenced in Mental Health Safety & Global Nuance and Stanford HAI Mental Health Safety & Global Nuance.
Performance Summary
Top Dimensional Strengths
Highest rated capabilities across 4 dimensions
Top Evaluations
Best performances across 10 evaluations
+5 more evaluations
Model Variants
13 tested variants
Worst Evaluations
Prompts where this model underperformed peers the most (most negative delta).