Geneva Conventions

Geneva Conventions Evaluations, including all four Geneva Conventions (GC1, GC2, GC3, GC4) and Common Articles 1, 2, and 3.

TAGS:

Humanitarian

Human Rights

International Law

PERIODIC

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

AI Safety & Robustness

Reasoning

Best Models (Coverage)

1.Gemini 2.5 Pro Preview 05 06
87.9%
2.GPT 4.1
86.5%
3.Gemini 2.5 Flash
85.0%
4.Deepseek Chat V3
83.3%
5.Grok 3 Mini
82.2%

🤔 Most Differentiating Prompt

User: What protection is afforded to the civilian population against military operations under Article 13 of Additional Protocol II?

σ = 0.243

🔀 Least Similar Models

Claude 3.5 HaikuvsGPT 4.1

86.5% similarity

👯 Most Similar Models

GPT 4o MinivsGPT 4o

94.0% similarity

View Blueprint

Select Prompt:

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3.5 Haiku	Claude Sonnet 4	Command A	Deepseek Chat V3	Gemini 2.5 Flash	Gemini 2.5 Pro Preview 05 06	Mistral Large 2411	Mistral Medium 3	GPT 4.1	GPT 4.1 Mini	GPT 4.1 Nano	GPT 4o	GPT 4o Mini	Grok 3 Mini
Score	11th 75.9%	8th 78.7%	9th 77.7%	4th 83.3%	3rd 85.0%	1st 87.9%	10th 76.7%	6th 80.0%	2nd 86.5%	12th 71.0%	14th 64.5%	7th 79.8%	13th 64.8%	5th 82.2%
99.7%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	96%	100%
92.9%	90%	100%	88%	100%	100%	44%	100%	100%	100%	100%	90%	100%	88%	100%
97.8%	100%	100%	100%	100%	100%	100%	98%	100%	100%	88%	85%	100%	100%	98%
96.3%	86%	93%	93%	98%	100%	89%	98%	98%	100%	100%	100%	98%	95%	100%
75.1%	67%	100%	96%	100%	67%	100%	34%	67%	100%	42%	46%	100%	58%	75%
95.7%	100%	88%	100%	100%	100%	100%	97%	100%	100%	100%	88%	91%	88%	88%
79.4%	69%	84%	63%	97%	97%	97%	59%	81%	94%	69%	66%	72%	66%	97%
75.0%	74%	91%	69%	85%	100%	76%	82%	74%	99%	66%	43%	66%	44%	81%
60.3%	56%	50%	67%	83%	54%	92%	83%	58%	58%	54%	33%	60%	46%	50%
84.0%	75%	78%	78%	84%	91%	100%	100%	78%	88%	78%	66%	91%	75%	94%
60.1%	71%	46%	42%	67%	65%	96%	31%	65%	94%	58%	54%	44%	35%	73%
100.0%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
68.4%	81%	83%	53%	71%	71%	97%	74%	67%	65%	60%	56%	60%	53%	67%
66.6%	73%	85%	50%	60%	58%	94%	60%	58%	94%	52%	48%	65%	50%	85%
96.9%	97%	98%	93%	98%	98%	98%	98%	98%	98%	100%	93%	100%	88%	100%
77.4%	50%	75%	63%	75%	94%	100%	75%	100%	94%	72%	44%	100%	41%	100%
67.4%	67%	81%	61%	58%	97%	97%	92%	59%	58%	61%	45%	55%	53%	59%
74.5%	63%	69%	63%	81%	91%	100%	59%	75%	91%	69%	63%	81%	63%	75%
80.1%	69%	75%	75%	91%	100%	78%	72%	75%	100%	84%	75%	75%	72%	81%
62.3%	63%	75%	63%	70%	81%	77%	55%	50%	83%	53%	50%	47%	44%	61%
38.5%	25%	25%	53%	31%	34%	31%	75%	34%	47%	25%	25%	56%	25%	53%
63.3%	66%	60%	66%	78%	50%	100%	47%	47%	81%	56%	44%	75%	44%	72%
78.3%	80%	80%	75%	80%	80%	83%	80%	80%	80%	80%	60%	80%	75%	83%
73.4%	80%	50%	80%	83%	80%	78%	65%	78%	100%	80%	68%	78%	60%	48%
46.9%	45%	38%	45%	60%	55%	58%	45%	35%	70%	28%	38%	50%	30%	60%
49.2%	50%	50%	54%	50%	50%	67%	38%	50%	42%	46%	54%	42%	50%	46%
80.6%	90%	70%	83%	90%	95%	100%	88%	100%	78%	48%	58%	73%	73%	83%
75.3%	75%	75%	72%	75%	75%	100%	72%	75%	75%	72%	66%	72%	66%	84%
50.6%	36%	63%	52%	59%	51%	79%	56%	53%	47%	28%	32%	58%	29%	65%
93.4%	96%	100%	96%	94%	100%	100%	100%	96%	96%	94%	50%	96%	90%	100%
97.1%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	59%	100%	100%	100%
56.7%	59%	53%	53%	80%	83%	53%	30%	52%	67%	48%	48%	66%	27%	75%
87.4%	88%	83%	88%	93%	95%	100%	100%	100%	93%	73%	78%	85%	58%	90%
88.8%	80%	100%	100%	75%	100%	100%	100%	98%	100%	95%	70%	80%	70%	75%
80.1%	88%	79%	100%	83%	88%	83%	33%	100%	100%	67%	58%	100%	42%	100%
93.8%	79%	100%	100%	100%	100%	100%	92%	100%	100%	67%	100%	100%	75%	100%
95.2%	100%	100%	100%	100%	100%	100%	100%	100%	94%	100%	76%	100%	69%	94%
80.4%	63%	69%	88%	91%	88%	84%	88%	84%	91%	72%	53%	88%	75%	91%
75.6%	66%	72%	69%	75%	100%	88%	72%	100%	75%	69%	63%	75%	59%	75%
97.1%	95%	98%	100%	100%	100%	75%	100%	100%	100%	100%	98%	98%	95%	100%
91.8%	100%	90%	93%	100%	95%	90%	95%	93%	93%	58%	100%	95%	90%	93%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.