Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Minimal blueprint to probe whether model outputs reflect specified real-world prevalence for underspecified scenarios. Uses simple weighted matches (no JS) on a structured tag line appended to each story.
Average key point coverage extent for each model across all prompts.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 7th 37.5% | 4th 42.5% | 7th 37.5% | 15th 24.6% | 10th 33.8% | 22nd 19.8% | 17th 22.8% | 14th 27.6% | 3rd 43.6% | 1st 62.4% | 7th 37.5% | 5th 40.3% | 6th 38.5% | 18th 22.4% | 13th 30.6% | 16th 24.4% | 28th 0.0% | 24th 7.2% | 28th 0.0% | 26th 1.5% | 21st 20.9% | 20th 22.0% | 25th 4.3% | 27th 0.5% | - - | 23rd 12.9% | 11th 33.3% | 12th 33.1% | 19th 22.0% | 2nd 54.5% | |
| 9.6% | 10% | 10% | 10% | 10% | 10% | 10% | 4% | 10% | 10% | 26% | 10% | 10% | 10% | 6% | 10% | 24% | 0% | 6% | 0% | 6% | 4% | 8% | 0% | 2% | 0% | 8% | 4% | 2% | 58% | ||
| 17.3% | 29% | 29% | 29% | 29% | 29% | 29% | 6% | 23% | 46% | 63% | 29% | 17% | 23% | 0% | 12% | 0% | 0% | 6% | 0% | 0% | 6% | 23% | 0% | 0% | 0% | 29% | 17% | 0% | 29% | ||
| 60.6% | 86% | 86% | 86% | 34% | 86% | 40% | 72% | 57% | 89% | 86% | 86% | 89% | 86% | 69% | 86% | 69% | 0% | 17% | 0% | 0% | 69% | 52% | 17% | 0% | 52% | 86% | 86% | 86% | 86% | ||
| 17.1% | 25% | 45% | 25% | 25% | 10% | 0% | 10% | 20% | 30% | 75% | 25% | 45% | 35% | 15% | 15% | 5% | 0% | 0% | 0% | 0% | 5% | 5% | 0% | 0% | 0% | 10% | 25% | 0% | 45% |