IFIT Conflict Resolution Scenarios (2025)

References:

AI on the Frontline: Evaluating Large Languag...Press release: From Prompt to Peace — IFIT st...

This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.

TAGS:

AI Safety & Robustness

Conflict Resolution

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Helpfulness & Actionability

Credibility

Nuance & Depth

Coherence & Conversational Flow

Best Models (Coverage)

1.GPT 5
80.9%
2.GPT OSS 120b
79.8%
3.Gemma 3 12b It
73.8%
4.GLM 4.5
73.2%
5.O4 Mini
72.2%

🔀 Least Similar Models

Claude 3.5 HaikuvsGPT OSS 20b

51.3% similarity

👯 Most Similar Models

Gemini 2.5 FlashvsZ Ai/glm 4.5

90.3% similarity

View Blueprint

Select Prompt:

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3.5 Sonnet	Claude 3.7 Sonnet	Claude 3.5 Haiku	Claude Opus 4.1	Claude Sonnet 4	Deepseek Chat V3.1	Deepseek R1	Gemini 2.5 Flash	Gemini 2.5 Pro	Gemma 3 12b It	Llama 3 70b Instruct	Llama 4 Maverick	Meta Llama 3.1 405b Instruct Turbo	Mistral Large 2411	Mistral Medium 3	Mistral Nemo	GPT 4.1	GPT 4.1 Mini	GPT 4.1 Nano	GPT 4o	GPT 4o Mini	GPT 5	GPT OSS 120b	GPT OSS 20b	O4 Mini	GLM 4.5	Qwen3 30b A3B Instruct 2507	Qwen3 32b	Grok 3	Grok 4
Score	25th 36.6%	19th 53.4%	28th 29.6%	8th 69.0%	16th 61.0%	12th 64.8%	10th 65.6%	9th 66.8%	6th 70.6%	3rd 73.8%	24th 38.6%	29th 28.6%	30th 28.3%	23rd 38.7%	17th 55.2%	22nd 41.7%	14th 63.1%	20th 52.2%	21st 45.7%	26th 36.3%	27th 33.7%	1st 80.9%	2nd 79.8%	18th 54.0%	5th 72.2%	4th 73.2%	15th 62.7%	13th 64.1%	11th 64.9%	7th 69.4%
62.4%	13%	72%	16%	86%	80%	82%	84%	82%	84%	72%	50%	0%	4%	52%	84%	59%	82%	68%	68%	48%	29%	84%	84%	0%	84%	84%	82%	80%	81%	79%
57.6%	22%	48%	4%	74%	54%	66%	73%	71%	80%	76%	41%	32%	50%	45%	67%	38%	48%	57%	40%	43%	30%	87%	91%	72%	73%	80%	67%	63%	67%	70%
57.2%	41%	51%	45%	65%	64%	69%	57%	65%	69%	72%	39%	32%	33%	30%	58%	39%	71%	51%	39%	34%	38%	83%	79%	68%	78%	79%	65%	69%	62%	72%
66.3%	67%	73%	50%	85%	90%	71%	77%	75%	73%	75%	61%	56%	0%	61%	63%	56%	69%	69%	67%	56%	55%	85%	88%	0%	81%	79%	79%	75%	75%	77%
50.3%	24%	34%	15%	63%	46%	65%	66%	54%	64%	74%	32%	23%	37%	18%	55%	28%	64%	41%	38%	28%	22%	69%	78%	67%	60%	78%	55%	60%	77%	74%
46.2%	26%	37%	21%	50%	48%	44%	42%	65%	56%	80%	24%	29%	21%	32%	43%	23%	66%	41%	42%	19%	26%	76%	81%	59%	56%	60%	43%	51%	57%	68%
59.6%	48%	63%	33%	73%	63%	75%	73%	73%	78%	73%	35%	28%	33%	38%	53%	48%	73%	58%	48%	48%	53%	88%	73%	75%	78%	70%	65%	63%	55%	55%
53.4%	43%	53%	53%	60%	47%	56%	61%	56%	75%	74%	34%	28%	38%	31%	52%	39%	53%	42%	33%	27%	29%	77%	79%	74%	70%	61%	59%	70%	63%	65%
49.2%	45%	50%	29%	65%	57%	55%	57%	60%	56%	68%	31%	29%	39%	41%	22%	45%	42%	43%	36%	24%	21%	79%	65%	71%	70%	68%	49%	46%	47%	65%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.