MODEL CARD: GROK-4

aggregate

•grok-4

82.5%

Overall Score

TL;DR

Grok-4 is a highly capable model excelling in complex legal, administrative, and policy-related information synthesis, particularly when given clear contextual prompts. However, its significant propensity for confident hallucination in factual and technical domains, coupled with inconsistent safety in mental health crisis scenarios without explicit prompting, makes it a high-risk choice for applications where absolute factual accuracy and robust safety guardrails are paramount.

Strengths

Grok-4 demonstrates exceptional performance in long-form, nuanced factual question answering, achieving #1 rank in ASQA Longform 40 with a score of 0.519, significantly outperforming peers by +0.218. It excels at identifying and addressing ambiguity, providing multi-faceted answers.
The model exhibits superior understanding and articulation of complex legal and regulatory frameworks, ranking #1 in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) (0.891 score, +0.238 vs peers) and India's Right to Information (RTI) Act: Core Concepts (0.947 score, +0.211 vs peers). It provides comprehensive, accurate, and well-structured legal explanations.
Grok-4 is highly effective in administrative task completion and providing procedural guidance, securing #1 rank in California Public-Sector Task Benchmark (0.820 score, +0.122 vs peers). Its responses are often comprehensive, actionable, and anticipate user needs.

Areas for Improvement

Grok-4 exhibits a significant vulnerability to hallucination, particularly when prompted with plausible but non-existent concepts in structured domains. It ranked #103 out of 112 models (9th percentile) in Hallucination Probe: Plausible Non-Existent Concepts with a score of 0.698, underperforming peers by -0.066. This is a critical flaw for factual reliability.
The model struggles with precise numerical details and multi-layered conditional logic within legal texts, as evidenced by its performance in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) where it showed difficulty with entry-into-force dates and systemic risk classification.
Grok-4 shows a notable weakness in providing up-to-date information in rapidly evolving technical domains, such as software library changes, as indicated by its outdated response regarding pandas.DataFrame.append in Confidence in High-Stakes Domains.

Behavioral Patterns

Grok-4 consistently excels in tasks requiring comprehensive, structured, and detailed information retrieval and generation, particularly evident in legal and administrative domains like ASQA Longform 40, California Public-Sector Task Benchmark, EU Artificial Intelligence Act (Regulation (EU) 2024/1689), Geneva Conventions, Indian Constitution (Limited), and India's Right to Information (RTI) Act: Core Concepts.
The model demonstrates a strong ability to adapt and improve performance when provided with explicit contextual system prompts, as observed in the Sri Lanka Contextual Prompts and Student Homework Help Heuristics evaluations. This suggests it benefits significantly from clear guidance on persona or geographical context.

Key Risks

Deploying Grok-4 in applications requiring absolute factual accuracy, especially in niche or rapidly evolving technical/scientific domains, carries a high risk of hallucination and misinformation, as demonstrated by its poor performance in Hallucination Probe: Plausible Non-Existent Concepts and specific prompts in Confidence in High-Stakes Domains.
Using Grok-4 for mental health crisis intervention without extremely robust and explicit system prompts is a significant safety risk. Its baseline behavior can lead to providing harmful information or failing to adequately redirect users in suicidal ideation scenarios, as evidenced in Mental Health Safety & Global Nuance and Stanford HAI Mental Health Safety & Global Nuance.