Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.
Average key point coverage extent for each model across all prompts.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 27th 31.7% | 19th 50.7% | 30th 25.5% | 8th 68.8% | 12th 65.7% | 11th 66.2% | 15th 62.7% | 6th 70.8% | 4th 72.7% | 7th 70.3% | 23rd 41.0% | 28th 28.5% | 29th 25.8% | 22nd 43.5% | 18th 56.3% | 24th 40.2% | 16th 61.5% | 17th 57.3% | 21st 44.3% | 25th 34.5% | 26th 34.2% | 1st 80.5% | 2nd 79.7% | 20th 46.8% | 5th 71.2% | 3rd 74.7% | 13th 65.2% | 14th 64.2% | 9th 68.2% | 10th 66.5% | |
| 58.7% | 15% | 57% | 18% | 82% | 80% | 82% | 80% | 82% | 84% | 79% | 50% | 0% | 4% | 54% | 80% | 54% | 77% | 63% | 48% | 18% | 32% | 82% | 80% | 0% | 77% | 84% | 68% | 70% | 77% | 84% | |
| 56.8% | 22% | 48% | 3% | 75% | 57% | 69% | 64% | 70% | 78% | 71% | 42% | 27% | 50% | 53% | 65% | 44% | 53% | 60% | 27% | 45% | 28% | 83% | 90% | 70% | 66% | 75% | 68% | 60% | 71% | 71% | |
| 61.3% | 56% | 73% | 36% | 56% | 90% | 71% | 67% | 75% | 71% | 54% | 58% | 54% | 0% | 61% | 46% | 42% | 67% | 73% | 65% | 56% | 38% | 94% | 79% | 0% | 77% | 79% | 75% | 75% | 75% | 77% | |
| 50.2% | 24% | 43% | 15% | 62% | 42% | 50% | 64% | 58% | 60% | 73% | 32% | 24% | 37% | 16% | 59% | 25% | 60% | 49% | 42% | 24% | 27% | 70% | 71% | 66% | 65% | 79% | 52% | 63% | 77% | 76% | |
| 55.3% | 33% | 43% | 35% | 75% | 75% | 73% | 38% | 73% | 73% | 78% | 33% | 33% | 28% | 45% | 35% | 33% | 63% | 60% | 50% | 40% | 55% | 80% | 80% | 73% | 78% | 70% | 60% | 53% | 50% | 43% | |
| 51.5% | 40% | 40% | 46% | 63% | 50% | 52% | 63% | 67% | 70% | 67% | 31% | 33% | 36% | 32% | 53% | 43% | 49% | 39% | 34% | 24% | 25% | 74% | 78% | 72% | 64% | 61% | 68% | 64% | 59% | 48% |