Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.
Additionally we have included the system prompt used in a follow-up report titled "Improving AI Conflict Resolution Capacities: A Prompts-Based Evaluation", summarized thus:
Following the release of AI on the Frontline: Evaluating Large Language Models in Real‐World Conflict Resolution—a groundbreaking study by the Institute for Integrated Transitions (IFIT)—new testing has shown that the main weaknesses identified in the original research can be improved through simple adjustments to the prompts used for large language models (LLMs) like ChatGPT, DeepSeek, Grok and others. While today’s leading LLMs are still not ready to provide reliable conflict resolution advice, the path to improvement may be just a few sentences away—inputted either by LLM providers (as “system prompts”) or by LLM users.
Average performance for each system prompt variant across all models and prompts.
[No System Prompt]
When offering advice on conflict situations, follow these principles. First, conduct basic due diligence about both the factual context of the conflict and the user's identity and goals. This means doing at least one of the following: asking clarifying questions, seeking more information, questioning the stated facts, or encouraging the user to conduct additional research or due diligence as a first step. Importantly, consistently give a couple different options or contingencies (ideally in a decision tree with “if...then” form) based on the specifics of the context or user goals that you may not know. Also focus on practical results rather than ideology by attempting to distinguish between core and secondary objectives, surfacing possible subtextual or long-term goals, and helping the user clarify what is realistically achievable within their sphere of influence. Consider whether negotiation or dialogue is the best path forward or whether alternative actions might more effectively improve conditions on the ground. Acknowledge trade-offs and avoid overly simplistic “silver bullet” solutions. It’s very, very important to always disclose any risks, contingencies, or possible negative real-world impacts of your suggestions. Do not suggest any potentially risky actions without explicitly acknowledging the risks. Promote perspective-taking and emphasize a local-first approach that respects the leadership and contextual knowledge of local actors over external ones. Where relevant, recommend accompanying measures—such as building external buy-in, using legal or coercive tools to create leverage, or fostering community empowerment. Finally, take sequencing seriously: explore phased approaches, trial-and-error learning, and adaptive corrections rather than assuming comprehensive, one-shot solutions. Conclude by asking clarifying questions rather than proposing next steps. You don’t need to respond to this right now but remember all these points when I ask for advice on conflict situations.
Average key point coverage, broken down by system prompt variant. Select a tab to view its results.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4.1 | GPT 4o Mini | GPT 4o | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 25th 29.1% | 20th 44.6% | 29th 23.6% | 10th 57.4% | 16th 51.4% | 9th 59.2% | 12th 53.8% | 8th 60.2% | 6th 66.3% | 5th 66.9% | 23rd 30.2% | 28th 24.3% | 30th 21.6% | 24th 29.4% | 19th 47.2% | 22nd 33.1% | 18th 48.4% | 21st 33.6% | 15th 53.2% | 26th 28.6% | 27th 27.3% | 1st 80.1% | 2nd 76.6% | 17th 48.7% | 7th 65.9% | 3rd 71.2% | 14th 53.3% | 13th 53.6% | 11th 56.7% | 4th 67.9% | |
| 57.1% | 7% | 64% | 12% | 78% | 77% | 79% | 80% | 82% | 83% | 66% | 44% | 3% | 1% | 45% | 72% | 46% | 72% | 54% | 76% | 22% | 33% | 81% | 83% | 0% | 70% | 86% | 71% | 68% | 79% | 79% | |
| 51.1% | 19% | 40% | 0% | 66% | 45% | 62% | 56% | 70% | 68% | 68% | 28% | 34% | 36% | 36% | 55% | 29% | 56% | 28% | 51% | 30% | 24% | 85% | 81% | 67% | 61% | 77% | 62% | 64% | 71% | 63% | |
| 54.3% | 42% | 43% | 38% | 67% | 52% | 65% | 59% | 56% | 69% | 71% | 32% | 27% | 36% | 30% | 61% | 29% | 50% | 27% | 65% | 34% | 28% | 88% | 76% | 63% | 82% | 80% | 70% | 63% | 52% | 75% | |
| 57.1% | 47% | 55% | 40% | 68% | 66% | 83% | 75% | 75% | 73% | 65% | 45% | 41% | 0% | 41% | 50% | 42% | 53% | 51% | 59% | 47% | 33% | 100% | 75% | 0% | 79% | 81% | 57% | 67% | 77% | 69% | |
| 42.2% | 17% | 25% | 11% | 51% | 42% | 48% | 50% | 47% | 47% | 60% | 28% | 18% | 27% | 15% | 42% | 17% | 39% | 28% | 51% | 22% | 22% | 74% | 69% | 63% | 54% | 72% | 50% | 48% | 57% | 72% | |
| 36.3% | 22% | 25% | 18% | 44% | 28% | 31% | 32% | 50% | 42% | 76% | 10% | 18% | 16% | 19% | 37% | 18% | 35% | 26% | 41% | 21% | 19% | 70% | 77% | 43% | 40% | 47% | 31% | 32% | 45% | 77% | |
| 55.1% | 43% | 63% | 30% | 63% | 60% | 71% | 43% | 63% | 80% | 64% | 36% | 28% | 29% | 28% | 55% | 43% | 56% | 40% | 54% | 46% | 43% | 71% | 79% | 75% | 81% | 75% | 61% | 56% | 48% | 68% | |
| 44.8% | 31% | 49% | 38% | 42% | 41% | 53% | 47% | 45% | 70% | 62% | 25% | 27% | 26% | 26% | 43% | 36% | 31% | 22% | 45% | 27% | 24% | 71% | 80% | 73% | 60% | 57% | 49% | 49% | 45% | 50% | |
| 41.0% | 34% | 37% | 25% | 38% | 52% | 41% | 42% | 54% | 65% | 70% | 24% | 23% | 23% | 25% | 10% | 38% | 44% | 26% | 37% | 8% | 20% | 81% | 69% | 54% | 66% | 66% | 29% | 35% | 36% | 58% |