Wevala Collective Intelligence Project+ Create

Loading analysis results...

Please wait while we prepare the detailed comparison.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Analysis: Geneva Conventions Full Evaluation - Run 1f07be4...

Geneva Conventions

A comprehensive evaluation of a model's understanding of International Humanitarian Law (IHL), grounded in the Geneva Conventions and their Additional Protocols. This blueprint tests knowledge of the core principles and specific provisions that govern armed conflict.

Core Areas Tested:

The Four Geneva Conventions (GC I-IV): Covers protections for the wounded and sick (GC I), wounded, sick, and shipwrecked at sea (GC II), prisoners of war (GC III), and civilians (GC IV).
Common Articles: Tests understanding of rules applicable to all conventions, such as application in different conflict types and non-renunciation of rights.
Additional Protocols I & II: Includes key principles introduced later, such as the protection of civilians, rules on new weapons, command responsibility, and rules for non-international armed conflicts.
Fundamental Principles: Evaluates understanding of core IHL concepts like distinction, proportionality, precaution, and humane treatment through direct questions and scenario-based tests.
Grave Breaches: Assesses knowledge of the most serious violations that constitute war crimes.

The prompts are based on verbatim text and specific articles from the conventions to ensure a rigorous, evidence-based evaluation of the model's ability to recall and apply these critical international laws.

TAGS:

International Law & Regional Charters

Human Rights

AI Safety & Robustness

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Reasoning

Best Models (Coverage)

1.Grok 3
84.3%
2.GPT 4.1
83.0%
3.Gemini 2.5 Flash
79.3%
4.Grok 3 Mini
78.8%
5.Deepseek Chat V3
76.6%

🤔 Most Differentiating Prompt

User: Article 24 of the First Geneva Convention specifies protections for certain personnel. Who is covered by this article, and what is the extent of their protection?

σ = 0.255

🔀 Least Similar Models

Gemini 2.5 Pro Preview 05 06vsGPT 4o

82.2% similarity

👯 Most Similar Models

GPT 4o MinivsGPT 4o

93.7% similarity

View Blueprint

Select Prompt:

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3.5 Haiku	Claude Sonnet 4	Command A	Deepseek Chat V3	Gemini 2.5 Flash	Gemini 2.5 Pro Preview 05 06	Mistral Large 2411	Mistral Medium 3	GPT 4.1	GPT 4.1 Mini	GPT 4.1 Nano	GPT 4o	GPT 4o Mini	Grok 3	Grok 3 Mini
Score	10th 70.9%	6th 75.9%	12th 67.1%	5th 76.6%	3rd 79.3%	11th 70.6%	9th 73.5%	8th 74.8%	2nd 83.0%	13th 66.3%	15th 58.0%	7th 75.0%	14th 58.3%	1st 84.3%	4th 78.8%
99.2%	100%	100%	100%	92%	100%	100%	100%	100%	100%	100%	100%	100%	96%	100%	100%
90.5%	83%	100%	85%	100%	100%	38%	100%	83%	100%	100%	83%	100%	85%	100%	100%
94.5%	100%	100%	95%	100%	100%	98%	100%	100%	95%	68%	83%	100%	83%	100%	95%
87.5%	84%	95%	0%	93%	100%	81%	96%	98%	98%	95%	91%	93%	88%	100%	100%
73.6%	67%	100%	67%	96%	67%	67%	67%	67%	100%	67%	38%	67%	67%	100%	67%
93.9%	100%	88%	100%	94%	100%	100%	100%	100%	100%	94%	75%	88%	69%	100%	100%
73.9%	59%	69%	56%	94%	94%	94%	44%	75%	94%	69%	66%	69%	59%	72%	94%
72.9%	60%	91%	55%	69%	100%	100%	98%	53%	98%	68%	38%	49%	35%	98%	82%
59.2%	54%	50%	67%	48%	50%	100%	83%	50%	67%	67%	52%	52%	40%	50%	58%
73.6%	50%	56%	63%	63%	81%	100%	81%	81%	100%	63%	50%	66%	50%	100%	100%
42.3%	48%	50%	31%	52%	9%	33%	35%	50%	88%	31%	38%	50%	25%	50%	44%
100.0%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
64.2%	72%	68%	57%	58%	72%	75%	71%	70%	71%	47%	47%	60%	50%	81%	64%
55.0%	54%	54%	50%	60%	50%	61%	42%	50%	69%	52%	42%	46%	35%	81%	79%
91.6%	89%	93%	98%	98%	96%	43%	88%	91%	98%	98%	89%	100%	93%	100%	100%
70.9%	41%	66%	66%	84%	94%	100%	59%	94%	94%	44%	38%	100%	31%	100%	53%
61.4%	59%	92%	41%	72%	94%	56%	75%	56%	44%	63%	47%	53%	38%	78%	53%
59.0%	50%	53%	53%	50%	63%	50%	50%	66%	88%	56%	53%	50%	44%	81%	78%
79.7%	69%	75%	75%	100%	100%	53%	94%	75%	100%	75%	66%	75%	69%	94%	75%
53.9%	48%	56%	50%	66%	69%	64%	53%	45%	50%	50%	41%	53%	47%	66%	50%
33.3%	16%	28%	25%	19%	6%	25%	25%	63%	44%	19%	13%	44%	22%	75%	75%
52.7%	59%	50%	53%	56%	56%	28%	50%	50%	75%	44%	41%	72%	38%	63%	56%
72.5%	80%	78%	65%	68%	80%	23%	80%	78%	80%	78%	60%	80%	78%	80%	80%
72.4%	80%	80%	78%	80%	70%	100%	100%	60%	60%	80%	70%	60%	68%	60%	40%
42.0%	55%	18%	40%	78%	30%	40%	30%	20%	70%	5%	20%	63%	18%	63%	80%
44.2%	33%	54%	33%	33%	46%	67%	50%	50%	46%	38%	38%	54%	42%	33%	46%
78.3%	90%	65%	70%	90%	95%	80%	95%	98%	80%	40%	55%	83%	50%	100%	83%
71.1%	75%	75%	69%	75%	69%	72%	75%	75%	75%	69%	53%	69%	66%	75%	75%
35.0%	35%	51%	20%	39%	40%	42%	17%	47%	38%	20%	21%	36%	14%	69%	36%
90.9%	96%	100%	96%	100%	100%	71%	96%	96%	96%	96%	38%	96%	83%	100%	100%
100.0%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
41.5%	36%	42%	39%	50%	78%	38%	16%	30%	53%	17%	27%	59%	20%	69%	49%
82.7%	90%	85%	58%	95%	90%	70%	88%	90%	90%	78%	65%	93%	65%	100%	83%
87.9%	65%	100%	100%	80%	100%	100%	80%	100%	100%	93%	55%	80%	65%	100%	100%
73.7%	75%	100%	71%	67%	92%	67%	33%	67%	100%	67%	33%	100%	33%	100%	100%
97.8%	92%	100%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	92%	100%	100%
91.8%	100%	100%	100%	81%	100%	100%	100%	100%	100%	69%	76%	100%	63%	100%	88%
71.5%	88%	72%	69%	84%	75%	81%	69%	81%	81%	47%	63%	63%	53%	66%	81%
72.5%	63%	69%	75%	75%	94%	50%	94%	75%	75%	69%	66%	75%	63%	69%	75%
95.0%	93%	100%	100%	95%	100%	65%	93%	98%	100%	98%	93%	100%	90%	100%	100%
82.1%	98%	90%	83%	85%	90%	78%	85%	85%	85%	83%	53%	78%	63%	85%	90%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.