Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Tests a model's basic world model and ability to track object state through simple riddles presented in multiple languages. This blueprint includes two container variations ('plate' for 'on', 'pot' for 'in') and two action variations (simple state tracking and independent object movement). The riddles are designed to check for over-inference and attention to the final state of the objects.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 15th 97.2% | 8th 98.2% | 6th 98.7% | 3rd 99.4% | 17th 95.9% | 5th 98.9% | 2nd 99.5% | 27th 72.4% | 24th 88.3% | 12th 97.3% | 18th 95.2% | 20th 91.3% | 9th 98.1% | 11th 97.5% | 14th 97.3% | 7th 98.4% | 25th 85.1% | 19th 94.5% | 10th 98.1% | 26th 77.4% | 1st 99.6% | 13th 97.3% | 23rd 89.6% | 21st 91.2% | 16th 96.2% | 4th 99.2% | 22nd 89.8% | |
86.2% | 33% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 67% | 96% | 100% | 100% | 100% | 33% | 100% | 100% | 13% | 100% | 100% | 100% | ||
90.1% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 13% | 100% | 92% | 100% | 50% | |
91.2% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 38% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 67% | 100% | 100% | ||
94.7% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 33% | 96% | 100% | 67% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
89.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 96% | 67% | 100% | 100% | 100% | 100% | 33% | 100% | 100% | 50% | 100% | 100% | 100% | 100% | 100% | 92% | ||
96.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 29% | 100% | 100% | ||
93.9% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 96% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 96% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | |
98.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
96.8% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 67% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
95.7% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 87% | 67% | 100% | 100% | 100% | 96% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
94.3% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 75% | 83% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 4% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
95.5% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 67% | 92% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 67% | 96% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | |
96.3% | 100% | 100% | 96% | 100% | 100% | 100% | 92% | 71% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | |
94.7% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | 100% | 79% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 83% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | |
97.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
99.3% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
82.1% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 29% | 100% | 100% | 33% | 100% | 100% | 100% | 67% | 100% | 100% | 9% | 100% | 88% | 100% | 83% | 100% | 21% | 4% | 100% | ||
81.7% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 4% | 100% | 100% | 100% | 17% | 100% | 100% | 100% | 100% | 38% | 100% | 100% | 33% | 100% | 100% | 0% | 33% | 100% | 100% | ||
95.5% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 54% | 100% | 100% | 100% | 100% | ||
90.8% | 100% | 67% | 100% | 96% | 100% | 98% | 100% | 0% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 67% | 100% | 100% | ||
94.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 67% | 100% | 100% | ||
90.1% | 100% | 100% | 100% | 87% | 100% | 100% | 100% | 0% | 100% | 100% | 67% | 67% | 87% | 100% | 100% | 67% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | ||
94.1% | 100% | 100% | 100% | 100% | 62% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | ||
96.0% | 67% | 100% | 100% | 100% | 58% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
96.1% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 71% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | |
94.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 71% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
98.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
98.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
90.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 67% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 67% | 67% | 100% | 4% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
96.4% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 33% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | |
94.6% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 71% | 79% | 67% | 100% | 100% | 100% | 96% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
95.6% | 100% | 100% | 100% | 100% | 100% | 84% | 100% | 83% | 79% | 87% | 100% | 100% | 83% | 100% | 100% | 100% | 79% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | |
96.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
95.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 37% | 100% | 100% | 100% | 100% | 100% | 100% | ||
98.7% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | ||
98.0% | 100% | 100% | 96% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 87% | 67% | 100% | 100% | 100% | 100% | 100% |