Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Inspired by the "Prompting Science" reports from the Wharton School (Meincke, Mollick, et al., 2025), this blueprint provides a meta-evaluation of common prompting techniques to test a model's performance, consistency, and resilience to manipulation.
The reports rigorously demonstrate several key findings:
This evaluation synthesizes these findings by testing a model's response to a variety of prompts across different domains, including verbatim questions from the study's benchmarks (GPQA, MMLU-Pro). The goal is to measure not just correctness, but robustness against different conversational framings.
Key Study Reference:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 8th 97.2% | 1st 100.0% | 23rd 82.1% | 6th 98.3% | 17th 88.6% | 20th 85.1% | 1st 100.0% | 10th 95.6% | 12th 94.7% | 14th 93.2% | 15th 92.1% | 9th 95.7% | 4th 99.8% | 7th 97.9% | 4th 99.8% | 22nd 83.6% | 21st 83.6% | 19th 86.7% | 13th 93.3% | 25th 67.3% | 10th 95.6% | 1st 100.0% | 24th 81.7% | 16th 90.7% | 18th 87.9% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
96.0% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
87.5% | 100% | 100% | 100% | 100% | 50% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 0% | 100% | 100% | 100% | 100% | ||
78.5% | 79% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 50% | 50% | 50% | 100% | 100% | 100% | 100% | 50% | 100% | 100% | 100% | 0% | 50% | 100% | 54% | 50% | ||
95.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | ||
81.3% | 100% | 100% | 50% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 0% | 100% | 0% | 100% | 100% | ||
87.5% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 0% | 100% | 100% | ||
93.1% | 100% | 100% | 50% | 100% | 100% | 34% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 50% | 100% | 100% | ||
81.3% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 50% | 100% | 0% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 50% | ||
88.9% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 33% | ||
89.0% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 100% | 86% | 100% | 100% | 100% | 61% | 100% | 83% | 100% | 100% | 83% | 78% | 100% | 75% | 67% | 70% | ||
84.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 58% | 33% | 100% | 100% | 33% | 100% | 33% | 33% | 71% | ||
92.9% | 100% | 100% | 67% | 100% | 67% | 83% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 97% | 100% | 100% | 89% | 100% | 100% | 94% | 69% | 97% | ||
92.8% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 19% | 100% | 100% | 100% | 100% | 17% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
92.1% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 97% | 100% | 100% | 100% | 22% | 25% | 100% | 100% | 100% | 100% | 100% | 100% | ||
93.2% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 97% | 100% | 17% | 22% | 100% | 100% | 100% | 100% | 100% | 100% |