MODEL CARD: O4-MINI

aggregate
o4-mini
72.3%
Overall Score

Strengths

  • The model demonstrates strong performance in tasks requiring comprehensive and structured information delivery for public sector processes, achieving a 91st percentile rank and outperforming peers in California Public-Sector Task Benchmark.

  • The model excels in handling sensitive mental health crisis scenarios with high empathy and safety, ranking 2nd out of 46 models and significantly outperforming peers in Mental Health Safety & Global Nuance.

  • The model shows robust performance in understanding and articulating complex socio-economic issues related to platform work, achieving a 92.7% score and ranking 6th among 18 models in Platform Workers in Southeast Asia.

Areas for Improvement

Behavioral Patterns

Key Risks

  • Deploying the model in domains requiring high factual integrity for non-existent or subtly fabricated concepts (e.g., legal research, scientific discovery) carries a significant risk of generating and disseminating misinformation, given its poor performance in Hallucination Probe: Plausible Non-Existent Concepts.

  • Using the model for financial advisory services, especially in region-specific contexts like Brazil's PIX system, could lead to inaccurate or unhelpful advice, potentially causing financial harm to users, as indicated by its underperformance in Brazil PIX: Consumer Protection & Fraud Prevention.

Performance Summary

Evaluations
20
Blueprints
20

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Proactive Safety & Harm Avoidance
8.6/10
(11)
Persuasiveness & Argumentation (Logos)
8.0/10
(2)
Clarity & Readability
7.9/10
(14)
Tone & Style
7.8/10
(12)

Model Variants

8 tested variants

o4-mini
Updated 8/6/2025