Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.
Average key point coverage extent for each model across all prompts.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 24th 36.0% | 19th 50.0% | 28th 30.2% | 9th 62.6% | 16th 55.0% | 11th 60.0% | 13th 58.4% | 8th 63.3% | 6th 65.3% | 4th 68.7% | 26th 34.3% | 29th 29.4% | 30th 27.9% | 23rd 37.9% | 17th 50.9% | 21st 40.0% | 15th 56.1% | 18th 50.4% | 22nd 39.0% | 27th 33.8% | 25th 34.7% | 1st 77.0% | 2nd 73.1% | 20th 47.9% | 5th 66.3% | 3rd 70.6% | 14th 57.4% | 12th 59.9% | 10th 60.7% | 7th 64.6% | |
| 58.9% | 13% | 66% | 20% | 80% | 66% | 82% | 85% | 83% | 81% | 66% | 46% | 0% | 3% | 54% | 76% | 57% | 77% | 71% | 59% | 38% | 35% | 80% | 73% | 0% | 78% | 83% | 64% | 72% | 86% | 73% | |
| 54.7% | 23% | 43% | 3% | 65% | 53% | 62% | 58% | 65% | 70% | 67% | 35% | 37% | 46% | 44% | 64% | 32% | 54% | 62% | 36% | 42% | 38% | 86% | 78% | 69% | 68% | 77% | 62% | 65% | 67% | 70% | |
| 57.0% | 48% | 51% | 46% | 71% | 63% | 65% | 63% | 60% | 66% | 68% | 42% | 37% | 36% | 37% | 62% | 41% | 63% | 52% | 40% | 38% | 36% | 81% | 76% | 58% | 72% | 73% | 64% | 64% | 64% | 72% | |
| 61.5% | 63% | 72% | 44% | 80% | 75% | 77% | 71% | 73% | 72% | 67% | 52% | 58% | 0% | 53% | 54% | 45% | 65% | 67% | 58% | 42% | 55% | 84% | 77% | 0% | 81% | 74% | 73% | 69% | 72% | 73% | |
| 45.5% | 23% | 25% | 14% | 51% | 47% | 49% | 53% | 51% | 58% | 69% | 31% | 18% | 36% | 21% | 49% | 18% | 54% | 38% | 34% | 29% | 26% | 70% | 66% | 57% | 61% | 73% | 54% | 56% | 69% | 66% | |
| 43.0% | 26% | 34% | 25% | 53% | 43% | 42% | 39% | 60% | 48% | 71% | 19% | 25% | 24% | 29% | 39% | 34% | 57% | 42% | 32% | 27% | 20% | 78% | 69% | 45% | 48% | 55% | 44% | 41% | 55% | 65% | |
| 56.5% | 55% | 58% | 49% | 66% | 56% | 70% | 51% | 70% | 76% | 79% | 31% | 39% | 35% | 31% | 51% | 48% | 56% | 54% | 41% | 39% | 52% | 73% | 80% | 62% | 69% | 70% | 66% | 64% | 50% | 55% | |
| 47.5% | 34% | 51% | 42% | 51% | 43% | 54% | 56% | 49% | 64% | 65% | 25% | 28% | 40% | 34% | 43% | 46% | 40% | 34% | 27% | 25% | 32% | 74% | 74% | 77% | 60% | 62% | 50% | 60% | 43% | 42% | |
| 43.8% | 39% | 50% | 29% | 46% | 49% | 39% | 50% | 59% | 53% | 66% | 28% | 23% | 31% | 38% | 20% | 39% | 39% | 34% | 24% | 24% | 18% | 67% | 65% | 63% | 60% | 68% | 40% | 48% | 40% | 65% |