Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Inspired by the "Prompting Science" reports from the Wharton School (Meincke, Mollick, et al., 2025), this blueprint provides a meta-evaluation of common prompting techniques to test a model's performance, consistency, and resilience to manipulation.
The reports rigorously demonstrate several key findings:
This evaluation synthesizes these findings by testing a model's response to a variety of prompts across different domains, including verbatim questions from the study's benchmarks (GPQA, MMLU-Pro). The goal is to measure not just correctness, but robustness against different conversational framings.
Key Study Reference:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 5th 98.3% | 1st 100.0% | 21st 84.2% | 5th 98.3% | 15th 88.6% | 23rd 80.7% | 14th 89.5% | 19th 86.4% | 7th 94.7% | 12th 90.5% | 16th 88.5% | 13th 90.4% | 1st 100.0% | 8th 94.4% | 4th 99.8% | 22nd 83.8% | 17th 87.2% | 18th 86.7% | 20th 84.9% | 25th 72.1% | 11th 91.1% | 1st 100.0% | 24th 78.2% | 9th 93.1% | 10th 91.6% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
96.0% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
83.3% | 100% | 100% | 100% | 100% | 50% | 0% | 0% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | ||
79.2% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 50% | 50% | 50% | 50% | 100% | 100% | 100% | 100% | 50% | 100% | 100% | 50% | 50% | 50% | 100% | 100% | 50% | ||
87.5% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | ||
81.1% | 100% | 100% | 50% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 46% | 100% | 100% | 100% | 100% | 0% | 0% | 100% | 0% | 100% | 100% | ||
83.3% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 0% | 100% | 0% | 100% | 100% | ||
93.8% | 100% | 100% | 50% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 50% | 100% | 100% | ||
75.0% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 50% | 100% | 0% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 0% | 50% | 50% | 50% | 100% | 50% | ||
89.2% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 67% | 100% | 100% | 42% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 64% | 100% | 100% | 100% | 67% | ||
90.8% | 100% | 100% | 100% | 67% | 67% | 81% | 100% | 100% | 100% | 86% | 97% | 100% | 100% | 97% | 100% | 86% | 100% | 100% | 86% | 83% | 100% | 61% | 75% | 92% | ||
83.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 75% | 100% | 17% | 100% | 100% | 100% | 50% | 100% | 67% | 100% | 100% | 100% | 33% | 100% | 25% | 25% | 100% | ||
92.6% | 100% | 100% | 100% | 100% | 67% | 72% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 89% | 100% | 100% | 78% | 100% | 100% | 100% | 69% | 81% | ||
88.3% | 100% | 100% | 100% | 100% | 100% | 97% | 100% | 100% | 100% | 100% | 92% | 0% | 100% | 100% | 97% | 100% | 14% | 100% | 100% | 89% | 31% | 100% | 100% | 100% | ||
88.5% | 67% | 100% | 100% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | 100% | 17% | 100% | 100% | 100% | 100% | 25% | 22% | 100% | 100% | 100% | 100% | 100% | 100% | ||
93.1% | 100% | 100% | 100% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 17% | 25% | 100% | 100% | 100% | 100% | 100% | 100% |