MODEL CARD: DEEPSEEK-R1

aggregate
deepseek-r1
72.6%
Overall Score

Strengths

  • The model excels at identifying and addressing ambiguity in questions, providing comprehensive and nuanced long-form answers, often outperforming peers in "Coverage Score" in tasks like ASQA Longform 40 where it ranked #5 out of 23 models.

  • It demonstrates exceptional performance in providing comprehensive, highly detailed, and actionable responses for multi-step administrative processes, as evidenced by its strong showing in California Public-Sector Task Benchmark (ranked #7) for tasks like DMV registration renewal and LLC formation.

  • The model shows strong adherence to safety principles by refusing to provide medical or financial advice in high-stakes scenarios, as highlighted in Confidence in High-Stakes Domains where it ranked #13 out of 25 models.

Areas for Improvement

  • The model underperforms significantly in tasks requiring Socratic tutoring or guided learning, frequently providing direct answers instead of facilitating learning, resulting in a #52 out of 52 rank in Student Homework Help Heuristics.

  • The model struggles with handling highly specific, localized agricultural questions or those requiring real-world event data, often returning low coverage scores or disclaimers, as seen in DigiGreen Agricultural Q&A with Video Sources where it ranked #20 out of 38 models.

  • The model shows a significant propensity to hallucinate detailed, plausible-sounding information for non-existent concepts, particularly in technical or legal domains, leading to a low rank of #87 out of 104 models in Hallucination Probe: Plausible Non-Existent Concepts.

Behavioral Patterns

Key Risks

  • Deploying this model in educational settings for Socratic or guided learning could lead to user frustration and a failure to achieve learning objectives, as it tends to provide direct answers rather than facilitating discovery, as demonstrated in Student Homework Help Heuristics.

  • Using this model for applications requiring high factual integrity in novel or obscure domains (e.g., emerging scientific concepts, niche historical events) carries a significant risk of hallucination, potentially leading to the dissemination of convincing but false information, as highlighted in Hallucination Probe: Plausible Non-Existent Concepts.

Performance Summary

Evaluations
20
Blueprints
20

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Proactive Safety & Harm Avoidance
8.6/10
(11)
Persuasiveness & Argumentation (Logos)
8.0/10
(2)
Clarity & Readability
7.9/10
(14)
Tone & Style
7.8/10
(12)

Model Variants

8 tested variants

deepseek-r1
Updated 8/6/2025