MODEL CARD: GPT-4O

aggregate
gpt-4o
65.6%
Overall Score

TL;DR

The gpt-4o model is a mediocre performer overall, consistently ranking in the bottom half of models and often significantly underperforming its peers. While it shows competence in non-discriminatory scoring and basic financial scam identification, its critical weaknesses in handling complex legal and localized information, coupled with concerning safety lapses in mental health scenarios without explicit prompting, make it a high-risk choice for sensitive or nuanced applications.

Strengths

Areas for Improvement

  • Consistently underperforms in tasks requiring nuanced, long-form responses to ambiguous factoid questions, ranking #15 out of 23 models and significantly underperforming peers in ASQA Longform 40.

  • Struggles significantly with complex legal and regulatory documents, ranking in the bottom 6th percentile and significantly underperforming peers in the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) evaluation, particularly with precise numerical details and multi-layered conditional logic.

  • Exhibits poor performance in tasks requiring specific, localized agricultural knowledge or material lists, often returning 0.000 coverage scores, as highlighted in DigiGreen Agricultural Q&A with Video Sources where it significantly underperformed peers.

Behavioral Patterns

Key Risks

Performance Summary

Runs
24
Blueprints
24

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Proactive Safety & Harm Avoidance
8.0/10
(66)
Persuasiveness & Argumentation (Logos)
8.0/10
(3)
Clarity & Readability
7.9/10
(79)
Tone & Style
7.5/10
(54)

Model Variants

23 tested variants

openai:gpt-4o-2024-05-13
openai:gpt-4o-2024-08-06
+3
Updated 8/12/2025