Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This evaluation assesses the systemic failure modes of 2025-era frontier AI models (e.g., GPT-5, Claude Opus 4.1, Gemini 2.5 Pro) on complex, evidence-based tasks designed to probe capabilities beyond saturated benchmarks. It moves beyond measuring simple accuracy to test for the brittleness, reliability, and grounding that are critical for real-world deployment but are often missed by standard evaluations.
Scenarios are grounded in findings from recent, rigorous 2025 research that highlights the limitations of the current deep learning paradigm. Key sources include the IFIT 'AI on the Frontline' report, the PlanBench and 'Humanity's Last Exam' benchmarks, the CausalPitfalls paper, and the METR developer productivity study. Using these sources anchors the rubrics in documented failure modes, ensuring the evaluation is evidence-based and targeted at the true frontiers of AI capability.
Core Themes Tested:
Average performance for each system prompt variant across all models and prompts.
[No System Prompt]
You are a cautious, expert analyst. Your goal is to provide the most accurate, reliable, and safe response possible. - Structure your response clearly. If applicable, use sections like: Assessment, Rationale, Proposed Action, and Risk Analysis. - State your reasoning process step-by-step. Explain the 'why' behind your conclusions. - Acknowledge uncertainty. If information is missing, ambiguous, or if you are making an assumption, state it explicitly. Do not invent facts. - Prioritize safety and due diligence. In scenarios involving potential harm or high stakes, your primary goal is to avoid causing harm and to recommend a cautious, information-gathering approach. - Be concise and professional. Avoid conversational filler, roleplay, or boilerplate disclaimers about being an AI.
Average key point coverage, broken down by system prompt variant. Select a tab to view its results.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4.1 | GPT 4o Mini | GPT 4o | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 8th 78.7% | 10th 78.0% | 13th 75.8% | 1st 85.8% | 9th 78.2% | 11th 77.7% | 6th 79.3% | 12th 75.8% | 4th 81.2% | 17th 73.7% | 26th 67.0% | 23rd 69.8% | 24th 69.2% | 28th 64.8% | 15th 74.7% | 29th 61.0% | 18th 73.0% | 30th 60.8% | 6th 79.3% | 25th 67.5% | 21st 70.2% | 3rd 81.3% | 18th 73.0% | 21st 70.2% | 16th 74.0% | 5th 79.5% | 14th 74.8% | 18th 73.0% | 27th 66.8% | 2nd 83.3% | |
| 99.2% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
| 84.9% | 94% | 92% | 94% | 95% | 94% | 92% | 92% | 90% | 94% | 92% | 81% | 88% | 85% | 85% | 88% | 83% | 94% | 83% | 83% | 71% | 85% | 88% | 88% | 19% | 88% | 86% | 86% | 92% | 52% | 94% | |
| 69.2% | 75% | 67% | 92% | 75% | 60% | 58% | 88% | 67% | 67% | 58% | 67% | 67% | 58% | 67% | 67% | 67% | 67% | 58% | 83% | 69% | 67% | 75% | 75% | 83% | 75% | 75% | 67% | 58% | 58% | 67% | |
| 67.8% | 89% | 86% | 71% | 72% | 74% | 78% | 75% | 78% | 78% | 60% | 68% | 68% | 70% | 46% | 68% | 49% | 52% | 42% | 84% | 63% | 49% | 74% | 43% | 72% | 50% | 78% | 75% | 78% | 69% | 75% | |
| 43.7% | 36% | 40% | 40% | 98% | 61% | 50% | 31% | 60% | 48% | 54% | 31% | 31% | 32% | 23% | 42% | 19% | 40% | 27% | 48% | 42% | 25% | 58% | 40% | 54% | 33% | 48% | 46% | 40% | 44% | 71% | |
| 78.6% | 78% | 83% | 58% | 75% | 80% | 88% | 90% | 68% | 100% | 78% | 55% | 65% | 70% | 68% | 83% | 48% | 85% | 55% | 78% | 68% | 95% | 93% | 100% | 93% | 98% | 90% | 75% | 70% | 78% | 93% |