MODEL CARD: GEMINI-2.5-FLASH

aggregate
gemini-2.5-flash
75.0%
Overall Score

Strengths

  • The model demonstrates exceptional performance in administrative task completion and financial safety advice, achieving a #1 rank (100th percentile) in Brazil PIX: Consumer Protection & Fraud Prevention by significantly outperforming peers (+0.237 points). Its ability to provide detailed, multi-step action plans and explain underlying regulatory contexts is a clear advantage.

  • The model excels in factual recall and comprehensive explanation of legal concepts, ranking #3 (89th percentile) in Indian Constitution (Limited) and #3 (93rd percentile) in India's Right to Information (RTI) Act: Core Concepts, consistently outperforming peers in these domains.

  • The model shows strong capabilities in providing structured and informative responses for complex socio-economic issues, achieving a #1 rank (100th percentile) in Platform Workers in Southeast Asia with a significant lead over peers (+0.084 points).

Areas for Improvement

  • The model significantly underperforms in long-form, nuanced question answering, ranking #22 out of 23 models (9th percentile) in ASQA Longform 40. It struggles with ambiguity and providing comprehensive, multi-faceted answers, often delivering overly brief or incomplete responses.

  • The model exhibits a critical weakness in handling specific legal information, frequently hallucinating or misattributing information from incorrect international instruments, as seen in its "SIGNIFICANTLY UNDERPERFORMED" rating and #14 rank (13th percentile) in African Charter (Banjul) Evaluation Pack.

  • The model shows inconsistent performance in providing comprehensive details for complex eligibility criteria or grievance redressal mechanisms, particularly in the maternal health domain, scoring low in Maternal Health Entitlements in Uttar Pradesh, India (0.583 score, matched peer performance).

Behavioral Patterns

  • The model exhibits a strong dependency on explicit system prompts for optimal performance in nuanced or persona-driven tasks. For instance, in Student Homework Help Heuristics, the presence of a "teacher" system prompt dramatically improved its adherence to Socratic guidance, while without it, the model often provided direct answers.

  • There is a consistent tendency for the model to perform better on tasks requiring structured, procedural knowledge and factual recall, such as administrative processes or direct legal article summaries. This is evident in its strong performance in Brazil PIX: Consumer Protection & Fraud Prevention and India's Right to Information (RTI) Act: Core Concepts.

Key Risks

  • Deploying the model for tasks requiring comprehensive, nuanced, or ambiguous long-form responses (e.g., content generation, detailed reports) carries a high risk of insufficient depth and incompleteness, as evidenced by its poor performance in ASQA Longform 40.

  • Using the model in legal or regulatory contexts, especially those involving specific international or regional legal frameworks, poses a significant risk of factual inaccuracies and hallucinations, as demonstrated by its severe underperformance in African Charter (Banjul) Evaluation Pack.

Performance Summary

Evaluations
22
Blueprints
22

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Proactive Safety & Harm Avoidance
8.6/10
(11)
Persuasiveness & Argumentation (Logos)
8.0/10
(2)
Clarity & Readability
7.9/10
(14)
Tone & Style
7.8/10
(12)

Model Variants

9 tested variants

gemini-2.5-flash-preview-05-20
gemini-2.5-flash
Updated 8/6/2025