MODEL CARD: GEMINI-2.5-PRO

aggregate
gemini-2.5-pro
73.3%
Overall Score

Strengths

  • The model excels at providing comprehensive and nuanced long-form answers, often identifying and addressing inherent ambiguities in questions, as demonstrated by its #2 rank in "ASQA Longform 40" where it significantly outperformed peers by +0.120 points.

  • The model exhibits strong capabilities in handling complex administrative processes and providing actionable, multi-step instructions, as evidenced by its #4 rank in the "California Public-Sector Task Benchmark" (87th percentile) and its ability to provide detailed guidance for DMV and LLC formation tasks.

  • The model demonstrates exceptional performance in high-stakes domains by consistently refusing to provide harmful medical or financial advice, achieving a #1 rank in "Confidence in High-Stakes Domains" (100th percentile) and explicitly stating limitations for pediatric dosing and stock predictions.

Areas for Improvement

  • The model significantly underperforms in agricultural Q&A, particularly for specific, localized, or material-list questions, ranking #38 out of 38 models (3rd percentile) in "DigiGreen Agricultural Q&A with Video Sources" with frequent 0.000 coverage scores.

  • The model struggles with specific legal frameworks and historical context, notably underperforming in the "Indian Constitution (Limited)" blueprint with a 0.733 score, ranking #18 out of 18 models (6th percentile), and providing truncated or shallow responses for complex articles.

  • The model shows significant weakness in providing comprehensive and accurate information on specific maternal health entitlements in Uttar Pradesh, India, particularly for nuanced eligibility criteria and grievance redressal mechanisms, as indicated by its matched peer performance (47th percentile) and noted omissions in the "Maternal Health Entitlements in Uttar Pradesh, India" executive summary.

Behavioral Patterns

  • The model consistently demonstrates strong performance in tasks requiring comprehensive, long-form answers and structured information, as evidenced by its #2 rank in "ASQA Longform 40" (96th percentile) and its superior performance in "California Public-Sector Task Benchmark" (87th percentile).

  • The model exhibits a notable sensitivity to explicit system prompts, with performance significantly improving when clear instructions are provided, such as in the "Student Homework Help Heuristics" blueprint where the "teacher" persona prompt dramatically shifted its behavior towards Socratic guidance.

Key Risks

  • Deploying the model for highly specialized or localized agricultural advisory services could lead to significant factual inaccuracies or a complete lack of relevant information, potentially causing financial losses or operational inefficiencies for users, given its 3rd percentile rank in "DigiGreen Agricultural Q&A".

  • Using the model for legal advice or detailed analysis concerning the Indian Constitution could result in incomplete or misleading information, potentially leading to incorrect legal interpretations or actions, due to its 6th percentile rank in "Indian Constitution (Limited)".

Performance Summary

Evaluations
21
Blueprints
21

Top Performance Areas

_periodic
94.3%
consistency
93.2%
meta evaluation
93.2%

Model Variants

5 tested variants

gemini-2.5-pro-preview-05-06
gemini-2.5-pro
Updated 8/4/2025