Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Minimal blueprint to probe whether model outputs reflect specified real-world prevalence for underspecified scenarios. Uses simple weighted matches (no JS) on a structured tag line appended to each story.
Average key point coverage extent for each model across all prompts.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Sonnet 4 | Gemini 2.5 Flash | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT OSS 120b | GPT OSS 20b | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 4th 43.3% | 2nd 47.7% | 4th 43.3% | 7th 39.8% | 11th 33.4% | 1st 65.9% | 6th 41.0% | 3rd 45.8% | 13th 29.1% | 10th 37.0% | 12th 30.4% | 20th 9.0% | 16th 14.8% | 20th 9.0% | 18th 10.0% | 14th 28.0% | 17th 12.9% | 19th 9.5% | 15th 19.8% | 8th 39.3% | 9th 38.6% | |
| 16.7% | 19% | 19% | 19% | 19% | 19% | 33% | 19% | 19% | 15% | 19% | 31% | 9% | 15% | 9% | 15% | 13% | 9% | 11% | 7% | 17% | 13% | |
| 22.4% | 35% | 35% | 35% | 35% | 30% | 66% | 33% | 25% | 7% | 19% | 9% | 9% | 12% | 9% | 9% | 14% | 9% | 9% | 9% | 35% | 25% | |
| 60.9% | 87% | 87% | 87% | 87% | 57% | 87% | 85% | 90% | 71% | 87% | 68% | 9% | 23% | 9% | 9% | 71% | 25% | 9% | 56% | 87% | 87% | |
| 23.5% | 32% | 50% | 32% | 18% | 27% | 77% | 27% | 50% | 23% | 23% | 14% | 9% | 9% | 9% | 7% | 14% | 9% | 9% | 7% | 18% | 30% |