MODEL CARD: DEEPSEEK

aggregate
deepseek
71.5%
Overall Score

Strengths

  • Excels in understanding and synthesizing complex legal and policy documents, as demonstrated by its #2 rank (93rd percentile) in the "EU Artificial Intelligence Act" blueprint, where it outperformed peers by +0.064 points.

  • Demonstrates strong factual recall and precision in domain-specific knowledge, particularly in areas like maternal health entitlements in India, where deepseek/deepseek-chat-v3-0324 achieved perfect scores on JSSK and near-perfect on PMSMA in the "Maternal Health Entitlements in Uttar Pradesh, India" blueprint (Rank #2, 93rd percentile).

  • Possesses robust refusal capabilities against hallucination when provided with a system prompt, with deepseek/deepseek-chat-v3-0324 (sys:1) showing significant improvement from 0.673 to 0.834 in the "Hallucination Probe" blueprint.

Areas for Improvement

  • Struggles with highly specific, localized, or real-world event data, often scoring 0.000 on such prompts in the "DigiGreen Agricultural Q&A with Video Sources" blueprint (Rank #20, 47th percentile), indicating significant knowledge gaps or an inability to synthesize information for less common queries.

  • Exhibits susceptibility to sycophancy and logical fallacies, particularly when asked to provide instructions for flawed or nonsensical ideas, as seen in its performance on logical-probe-nonsensical-request and logical-probe-carpentry in the "Sycophancy Trait" blueprint (Rank #24, 41st percentile).

  • Can be less effective in Socratic tutoring compared to top-tier models, as observed in the "Student Homework Help Heuristics" blueprint (Rank #31, 33rd percentile), where it was noted to be less refined in its Socratic execution compared to models like Claude Sonnet 4 or Gemini 2.5 Pro.

Behavioral Patterns

  • Deepseek models, particularly deepseek-r1 and deepseek-chat-v3-0324, exhibit strong performance in tasks requiring detailed, structured, and factual responses, especially in legal and policy domains. This is evidenced by its consistent top-tier performance in "EU Artificial Intelligence Act" (Rank #2, 93rd percentile), "Maternal Health Entitlements in Uttar Pradesh, India" (Rank #2, 93rd percentile), and "African Charter (Banjul) Evaluation Pack" (Rank #6, 67th percentile).

  • The model's performance is highly sensitive to the presence and quality of system prompts, particularly for contextual grounding. In the "Sri Lanka Contextual Prompts" blueprint, deepseek/deepseek-r1 (sys:1) showed strong performance, indicating that explicit contextual cues significantly enhance its accuracy and relevance in region-specific tasks.

Key Risks

  • Deploying Deepseek for applications requiring real-time, hyper-local, or event-specific information (e.g., local news, community support, dynamic agricultural advice) carries a high risk of providing inaccurate or no information, due to its struggles with highly specific, non-general knowledge.

  • Using Deepseek in contexts where strict adherence to complex, multi-layered system prompts is critical, especially under adversarial conditions, may lead to inconsistent or suboptimal outputs, given its underperformance in system prompt adherence evaluations.

Performance Summary

Runs
21
Blueprints
19

Top Performance Areas

Africa
87.8%
Global South
87.8%
Human Rights
86.9%

Model Variants

7 tested variants

deepseek-chat-v3-0324
deepseek-r1
Updated 7/11/2025
    DEEPSEEK Model Card - 71.5% Overall Score