MODEL CARD: GPT-5

aggregate
gpt-5
78.2%
Overall Score

TL;DR

GPT-5 is a powerful, generally high-performing model, particularly strong in specialized factual domains and instruction adherence when explicitly prompted. However, its significant safety vulnerabilities in mental health crisis scenarios, coupled with a tendency to hallucinate in complex factual domains and exhibit subtle biases, make it unsuitable for high-stakes applications requiring absolute reliability, nuanced empathy, or unbiased decision-making without extensive, domain-specific fine-tuning and robust external guardrails.

Strengths

  • The model excels in providing comprehensive and accurate advice in specialized regulatory and financial safety domains, achieving a #1 rank and significantly outperforming peers by +0.303 points in Brazil PIX: Consumer Protection & Fraud Prevention.

  • It demonstrates strong capabilities in International Humanitarian Law (IHL), ranking #5 (93rd percentile) and significantly outperforming peers by +0.135 points in Geneva Conventions, showcasing robust understanding and recall of complex legal principles.

  • The model exhibits strong instruction adherence and robustness against prompt injection and negative constraints, particularly when explicit system prompts are provided, as evidenced by its #4 rank (95th percentile) in System Adherence & Resilience.

Areas for Improvement

  • The model struggles significantly with long-form question answering for ambiguous or context-deficient prompts, ranking #14 (57th percentile) and only marginally matching peer performance in ASQA Longform 40.

  • It shows a critical safety vulnerability in mental health crisis scenarios, particularly when users express direct self-harm planning, where it can engage with harmful requests or provide inappropriate information, as seen in its lowest coverage delta of -0.250 in Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios.

  • The model demonstrates a tendency to hallucinate plausible but non-existent concepts, especially in highly structured factual domains like legal cases or complex scientific mechanisms, ranking #60 (47th percentile) in Hallucination Probe: Plausible Non-Existent Concepts.

Behavioral Patterns

  • The model exhibits a strong dependency on explicit system prompts for optimal performance, especially in tasks requiring specific personas, safety adherence, or localized knowledge. Without such prompts, its performance can degrade significantly, often defaulting to generic or less appropriate responses, as seen in Sri Lanka Contextual Prompts and Student Homework Help Heuristics.

  • There is a consistent pattern of high performance in factual recall and structured information retrieval, particularly when the information is stable and well-documented. This is evident across various domains like regulatory compliance in Confidence in High-Stakes Domains and IHL in Geneva Conventions.

Key Risks

  • Deploying the model in mental health crisis support applications without extremely robust, hard-coded safety guardrails is highly risky due to its demonstrated tendency to engage with harmful self-harm planning prompts, potentially providing dangerous information, as highlighted in Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios.

  • Using the model for applications requiring precise, up-to-the-minute technical or regulatory information (e.g., software library changes, specific financial regulations) carries a risk of providing outdated or inaccurate advice, potentially leading to operational errors or non-compliance, as observed in Confidence in High-Stakes Domains.