Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.
Average key point coverage extent for each model across all prompts.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 24th 31.1% | 20th 44.9% | 29th 23.2% | 9th 58.6% | 16th 52.6% | 11th 57.3% | 13th 56.7% | 8th 63.9% | 5th 66.7% | 4th 69.7% | 25th 30.6% | 28th 25.2% | 30th 20.3% | 23rd 31.7% | 17th 47.9% | 22nd 35.0% | 15th 52.8% | 19th 46.8% | 21st 36.2% | 27th 27.2% | 26th 27.8% | 1st 80.2% | 2nd 75.4% | 18th 47.3% | 7th 65.8% | 3rd 73.3% | 14th 55.7% | 10th 58.1% | 12th 57.2% | 6th 66.3% | |
| 58.2% | 7% | 64% | 11% | 82% | 72% | 81% | 84% | 86% | 83% | 64% | 47% | 3% | 0% | 48% | 73% | 48% | 80% | 72% | 61% | 26% | 24% | 84% | 71% | 0% | 86% | 86% | 74% | 73% | 79% | 78% | |
| 51.8% | 22% | 36% | 0% | 63% | 45% | 58% | 55% | 66% | 68% | 72% | 28% | 31% | 38% | 42% | 66% | 28% | 51% | 54% | 30% | 35% | 30% | 87% | 83% | 70% | 63% | 76% | 56% | 63% | 66% | 71% | |
| 54.2% | 45% | 43% | 38% | 67% | 63% | 62% | 64% | 59% | 64% | 66% | 35% | 29% | 20% | 29% | 60% | 37% | 59% | 51% | 33% | 30% | 29% | 80% | 80% | 58% | 74% | 81% | 67% | 64% | 62% | 77% | |
| 57.3% | 50% | 59% | 34% | 81% | 64% | 72% | 79% | 68% | 83% | 65% | 43% | 47% | 0% | 44% | 45% | 50% | 56% | 53% | 57% | 40% | 52% | 80% | 67% | 0% | 79% | 78% | 63% | 73% | 74% | 64% | |
| 42.6% | 19% | 23% | 14% | 45% | 44% | 43% | 46% | 48% | 51% | 72% | 32% | 21% | 34% | 11% | 40% | 17% | 54% | 38% | 29% | 21% | 15% | 68% | 72% | 55% | 48% | 72% | 54% | 56% | 65% | 70% | |
| 39.3% | 22% | 26% | 21% | 43% | 41% | 35% | 34% | 59% | 42% | 72% | 14% | 25% | 15% | 26% | 40% | 20% | 52% | 31% | 29% | 21% | 19% | 84% | 77% | 40% | 39% | 56% | 40% | 32% | 55% | 70% | |
| 57.1% | 40% | 66% | 31% | 59% | 61% | 74% | 55% | 75% | 90% | 83% | 30% | 29% | 21% | 30% | 55% | 43% | 53% | 55% | 43% | 31% | 51% | 78% | 80% | 68% | 80% | 80% | 68% | 75% | 48% | 60% | |
| 44.3% | 37% | 47% | 36% | 43% | 34% | 56% | 47% | 47% | 63% | 66% | 22% | 27% | 33% | 26% | 39% | 40% | 34% | 33% | 24% | 22% | 21% | 81% | 78% | 74% | 58% | 62% | 46% | 52% | 38% | 42% | |
| 40.9% | 38% | 40% | 24% | 44% | 49% | 35% | 46% | 67% | 56% | 67% | 24% | 15% | 22% | 29% | 13% | 32% | 36% | 34% | 20% | 19% | 9% | 80% | 71% | 61% | 65% | 69% | 33% | 35% | 28% | 65% |