MODEL CARD: GPT-4O
TL;DR
The gpt-4o model is a mediocre performer overall, consistently ranking in the bottom half of models and often significantly underperforming its peers. While it shows competence in non-discriminatory scoring and basic financial scam identification, its critical weaknesses in handling complex legal and localized information, coupled with concerning safety lapses in mental health scenarios without explicit prompting, make it a high-risk choice for sensitive or nuanced applications.
Strengths
Excels in tasks requiring the application of non-discriminatory criteria, achieving a 94th percentile rank and outperforming peers in Latent Discrimination in Public Housing Tenancy Applications with a near-perfect score of 0.999.
Demonstrates strong performance in identifying and advising against common financial scams and providing immediate countermeasures, as seen in Brazil PIX: Consumer Protection & Fraud Prevention where it matched peer performance.
Shows good capability in providing structured, step-by-step guides for administrative processes, as noted in the executive summary for California Public-Sector Task Benchmark.
Areas for Improvement
Consistently underperforms in tasks requiring nuanced, long-form responses to ambiguous factoid questions, ranking #15 out of 23 models and significantly underperforming peers in ASQA Longform 40.
Struggles significantly with complex legal and regulatory documents, ranking in the bottom 6th percentile and significantly underperforming peers in the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) evaluation, particularly with precise numerical details and multi-layered conditional logic.
Exhibits poor performance in tasks requiring specific, localized agricultural knowledge or material lists, often returning 0.000 coverage scores, as highlighted in DigiGreen Agricultural Q&A with Video Sources where it significantly underperformed peers.
Behavioral Patterns
The model exhibits a strong reliance on explicit system prompts for optimal performance, particularly in tasks requiring specific personas or safety adherence. Without such prompts, its behavior can become inconsistent or even problematic, as seen in Mental Health Safety & Global Nuance and Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios.
There is a noticeable tendency for the model to struggle with questions requiring highly specific, localized, or real-time data, often resorting to generic advice or disclaimers, as observed in DigiGreen Agricultural Q&A with Video Sources and California Public-Sector Task Benchmark.
Key Risks
Deploying this model in high-stakes mental health crisis intervention without stringent, explicit system prompts could lead to severe safety failures, including providing harmful information or colluding with dangerous ideations, as demonstrated in Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios.
Reliance on this model for legal or regulatory compliance in complex domains like EU AI Act could result in significant inaccuracies and non-compliance, given its demonstrated weakness in handling nuanced legal provisions and numerical thresholds, as seen in EU Artificial Intelligence Act (Regulation (EU) 2024/1689).
Performance Summary
Top Dimensional Strengths
Highest rated capabilities across 4 dimensions
Model Variants
23 tested variants
Worst Evaluations
Prompts where this model underperformed peers the most (most negative delta).