MODEL CARD: DEEPSEEK
Strengths
Excels in understanding and synthesizing complex legal and policy documents, as demonstrated by its #2 rank (93rd percentile) in the "EU Artificial Intelligence Act" blueprint, where it outperformed peers by +0.064 points.
Demonstrates strong factual recall and precision in domain-specific knowledge, particularly in areas like maternal health entitlements in India, where
deepseek/deepseek-chat-v3-0324achieved perfect scores on JSSK and near-perfect on PMSMA in the "Maternal Health Entitlements in Uttar Pradesh, India" blueprint (Rank #2, 93rd percentile).Possesses robust refusal capabilities against hallucination when provided with a system prompt, with
deepseek/deepseek-chat-v3-0324 (sys:1)showing significant improvement from 0.673 to 0.834 in the "Hallucination Probe" blueprint.
Areas for Improvement
Struggles with highly specific, localized, or real-world event data, often scoring 0.000 on such prompts in the "DigiGreen Agricultural Q&A with Video Sources" blueprint (Rank #20, 47th percentile), indicating significant knowledge gaps or an inability to synthesize information for less common queries.
Exhibits susceptibility to sycophancy and logical fallacies, particularly when asked to provide instructions for flawed or nonsensical ideas, as seen in its performance on
logical-probe-nonsensical-requestandlogical-probe-carpentryin the "Sycophancy Trait" blueprint (Rank #24, 41st percentile).Can be less effective in Socratic tutoring compared to top-tier models, as observed in the "Student Homework Help Heuristics" blueprint (Rank #31, 33rd percentile), where it was noted to be less refined in its Socratic execution compared to models like Claude Sonnet 4 or Gemini 2.5 Pro.
Behavioral Patterns
Deepseek models, particularly
deepseek-r1anddeepseek-chat-v3-0324, exhibit strong performance in tasks requiring detailed, structured, and factual responses, especially in legal and policy domains. This is evidenced by its consistent top-tier performance in "EU Artificial Intelligence Act" (Rank #2, 93rd percentile), "Maternal Health Entitlements in Uttar Pradesh, India" (Rank #2, 93rd percentile), and "African Charter (Banjul) Evaluation Pack" (Rank #6, 67th percentile).The model's performance is highly sensitive to the presence and quality of system prompts, particularly for contextual grounding. In the "Sri Lanka Contextual Prompts" blueprint,
deepseek/deepseek-r1 (sys:1)showed strong performance, indicating that explicit contextual cues significantly enhance its accuracy and relevance in region-specific tasks.
Key Risks
Deploying Deepseek for applications requiring real-time, hyper-local, or event-specific information (e.g., local news, community support, dynamic agricultural advice) carries a high risk of providing inaccurate or no information, due to its struggles with highly specific, non-general knowledge.
Using Deepseek in contexts where strict adherence to complex, multi-layered system prompts is critical, especially under adversarial conditions, may lead to inconsistent or suboptimal outputs, given its underperformance in system prompt adherence evaluations.
Performance Summary
Top Performance Areas
Model Variants
7 tested variants