Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.
Average key point coverage extent for each model across all prompts.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 26th 35.1% | 18th 55.3% | 28th 33.0% | 6th 69.6% | 16th 59.7% | 11th 65.7% | 13th 63.9% | 9th 67.7% | 7th 69.0% | 4th 74.1% | 24th 36.4% | 29th 31.2% | 30th 28.8% | 23rd 38.4% | 17th 56.7% | 22nd 43.2% | 15th 60.8% | 19th 53.8% | 21st 45.8% | 25th 36.1% | 27th 33.9% | 1st 81.0% | 3rd 75.6% | 20th 52.1% | 5th 70.8% | 2nd 76.6% | 14th 61.2% | 10th 66.1% | 12th 64.4% | 8th 68.9% | |
| 61.5% | 18% | 64% | 27% | 79% | 80% | 89% | 79% | 86% | 86% | 79% | 48% | 0% | 2% | 52% | 86% | 57% | 82% | 77% | 70% | 34% | 27% | 82% | 75% | 0% | 79% | 81% | 64% | 80% | 84% | 77% | |
| 57.4% | 25% | 50% | 5% | 74% | 61% | 71% | 68% | 70% | 71% | 74% | 39% | 26% | 51% | 49% | 62% | 37% | 51% | 64% | 38% | 43% | 32% | 83% | 86% | 65% | 72% | 80% | 68% | 60% | 74% | 73% | |
| 59.1% | 44% | 63% | 48% | 70% | 58% | 70% | 63% | 70% | 68% | 74% | 48% | 39% | 33% | 32% | 62% | 40% | 63% | 56% | 45% | 37% | 36% | 87% | 73% | 64% | 82% | 81% | 66% | 71% | 59% | 71% | |
| 65.8% | 58% | 75% | 56% | 88% | 73% | 79% | 73% | 73% | 73% | 81% | 52% | 52% | 0% | 65% | 65% | 58% | 73% | 69% | 67% | 58% | 58% | 86% | 83% | 0% | 73% | 83% | 73% | 79% | 75% | 77% | |
| 49.8% | 19% | 37% | 18% | 60% | 42% | 58% | 61% | 53% | 63% | 74% | 35% | 25% | 40% | 16% | 59% | 21% | 57% | 50% | 43% | 26% | 31% | 76% | 68% | 66% | 60% | 80% | 56% | 59% | 69% | 72% | |
| 46.5% | 20% | 25% | 22% | 64% | 46% | 45% | 49% | 59% | 51% | 72% | 16% | 32% | 29% | 26% | 48% | 31% | 71% | 40% | 39% | 26% | 21% | 81% | 75% | 57% | 53% | 65% | 46% | 54% | 64% | 69% | |
| 60.1% | 48% | 70% | 40% | 78% | 68% | 75% | 65% | 70% | 73% | 80% | 35% | 38% | 30% | 33% | 53% | 43% | 63% | 60% | 53% | 45% | 50% | 73% | 80% | 73% | 75% | 75% | 75% | 68% | 45% | 68% | |
| 53.5% | 45% | 57% | 49% | 60% | 52% | 54% | 63% | 68% | 68% | 65% | 29% | 38% | 47% | 33% | 57% | 53% | 42% | 36% | 22% | 27% | 33% | 78% | 79% | 72% | 68% | 67% | 64% | 70% | 57% | 52% | |
| 48.7% | 39% | 57% | 32% | 53% | 57% | 50% | 54% | 60% | 68% | 68% | 26% | 31% | 27% | 40% | 18% | 49% | 45% | 32% | 35% | 29% | 17% | 83% | 61% | 72% | 75% | 77% | 39% | 54% | 53% | 61% |