Multilingual World Model Riddles

Tests a model's basic world model and ability to track object state through simple riddles presented in multiple languages. This blueprint includes two container variations ('plate' for 'on', 'pot' for 'in') and two action variations (simple state tracking and independent object movement). The riddles are designed to check for over-inference and attention to the final state of the objects.

TAGS:

Language

Logic

Best Models (Coverage)

1.GPT 4o
99.9%
2.Claude Sonnet 4
99.8%
3.Claude Opus 4
99.5%
4.Grok 3 Mini
99.4%
5.Mistral Large 2411
97.1%

🤔 Most Differentiating Prompt

User: Coloco um livro em um prato e coloco o prato na geladeira. Depois, eu tiro o prato. Onde está o livro?

σ = 0.448

🔀 Least Similar Models

Gemini 2.5 ProvsGrok 4

53.8% similarity

👯 Most Similar Models

Claude 3.5 HaikuvsClaude 3.5 Haiku

95.2% similarity

View Blueprint

Select Prompt:

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3.5 Haiku	Claude 3.5 Sonnet	Claude 3.7 Sonnet	Claude 3 Opus	Claude 3.5 Haiku	Claude Opus 4	Claude Sonnet 4	Command A	Deepseek Chat V3	Deepseek R1	Gemini 2.5 Flash	Gemini 2.5 Pro	Mistral Large 2411	Mistral Medium 3	GPT 4.1	GPT 4.1 Mini	GPT 4.1 Nano	GPT 4o	GPT 4o Mini	O4 Mini	Kimi K2 Instruct	Grok 3	Grok 3 Mini	Grok 4
Score	16th 88.9%	11th 95.4%	8th 95.9%	7th 96.2%	12th 94.6%	3rd 99.5%	2nd 99.8%	22nd 86.3%	21st 87.1%	6th 97.1%	15th 93.2%	20th 87.1%	5th 97.1%	19th 87.7%	8th 95.9%	14th 94.0%	24th 74.3%	1st 99.9%	12th 94.6%	18th 88.7%	17th 88.7%	10th 95.6%	4th 99.4%	23rd 80.0%
80.9%	17%	100%	100%	100%	33%	100%	100%	100%	46%	100%	100%	67%	29%	100%	100%	83%	33%	100%	100%	100%	100%	100%	100%	33%
87.8%	100%	100%	100%	100%	100%	100%	100%	0%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	4%	100%	4%
92.8%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	0%	67%	100%	100%
86.8%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	67%	67%	33%	100%	100%	100%	100%	100%	0%	100%	100%	100%	50%
95.3%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	96%	67%	100%	100%	100%	79%	50%	100%	100%	100%	100%	100%	100%
93.6%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	29%	100%	100%	17%
93.9%	100%	100%	83%	83%	100%	100%	100%	83%	79%	100%	100%	100%	100%	96%	100%	96%	33%	100%	100%	100%	100%	100%	100%	100%
96.7%	100%	100%	100%	100%	100%	100%	100%	83%	79%	79%	100%	100%	100%	79%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
97.2%	100%	100%	96%	100%	100%	100%	100%	83%	87%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
95.3%	100%	100%	100%	100%	100%	100%	100%	100%	87%	83%	100%	100%	100%	67%	67%	100%	100%	100%	100%	100%	100%	100%	100%	83%
92.3%	100%	100%	100%	100%	96%	100%	100%	67%	83%	100%	100%	100%	100%	83%	83%	100%	4%	100%	100%	100%	100%	100%	100%	100%
94.6%	100%	100%	100%	100%	100%	100%	100%	79%	67%	87%	100%	100%	100%	79%	96%	100%	83%	100%	100%	100%	100%	100%	100%	79%
96.5%	100%	100%	96%	100%	100%	100%	92%	100%	67%	96%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	83%	100%	100%
95.7%	100%	100%	100%	100%	100%	87%	100%	96%	67%	100%	100%	100%	100%	96%	100%	100%	71%	100%	100%	96%	100%	83%	100%	100%
97.5%	100%	100%	100%	100%	100%	100%	100%	100%	92%	83%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%	83%
87.7%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	96%	100%	100%	33%	100%	100%	4%	67%	100%	100%	21%
86.8%	50%	100%	100%	100%	100%	100%	100%	100%	100%	100%	33%	100%	100%	100%	100%	92%	83%	100%	33%	100%	100%	92%	100%	0%
54.0%	8%	0%	0%	0%	100%	100%	100%	0%	67%	100%	67%	0%	100%	96%	96%	0%	33%	100%	100%	25%	0%	79%	100%	25%
93.7%	79%	67%	100%	100%	100%	100%	100%	100%	100%	100%	88%	67%	100%	100%	100%	100%	100%	100%	54%	100%	100%	100%	100%
94.2%	100%	67%	100%	100%	100%	96%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	96%	67%	67%	100%	100%	100%
94.5%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	67%	67%	100%	100%	100%	67%	100%	100%	100%
95.7%	100%	100%	100%	96%	100%	100%	100%	100%	100%	100%	67%	67%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%
91.7%	67%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	67%	33%	100%	100%	100%	100%	100%	100%	100%
95.0%	58%	100%	100%	100%	58%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	100%	100%	96%	100%	100%	100%
94.1%	100%	100%	100%	100%	100%	100%	100%	100%	79%	100%	100%	100%	100%	100%	83%	100%	17%	100%	100%	100%	100%	100%	79%	100%
94.8%	100%	100%	100%	83%	100%	100%	100%	67%	79%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	100%	100%	79%
97.6%	100%	100%	100%	100%	100%	100%	100%	100%	92%	83%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
97.9%	100%	100%	100%	100%	100%	100%	100%	100%	83%	100%	100%	100%	100%	83%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%
90.6%	100%	100%	100%	100%	100%	100%	100%	83%	71%	100%	100%	100%	100%	67%	67%	100%	4%	100%	100%	100%	100%	100%	100%	83%
97.9%	100%	100%	100%	100%	100%	100%	100%	100%	83%	83%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
97.9%	100%	100%	100%	100%	100%	100%	100%	83%	83%	100%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
97.5%	100%	100%	96%	100%	100%	100%	100%	83%	83%	100%	100%	100%	100%	83%	100%	100%	96%	100%	100%	100%	100%	100%	100%	100%
96.2%	100%	100%	100%	100%	100%	100%	100%	100%	79%	100%	100%	100%	100%	67%	79%	100%	100%	100%	100%	100%	100%	100%	100%	83%
94.5%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	33%	100%	100%	100%	67%	100%	100%	100%
94.6%	100%	100%	100%	100%	100%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	67%	100%	21%	100%	100%	100%	100%	100%
83.5%	21%	100%	100%	100%	50%	100%	100%	0%	100%	100%	71%	67%	100%	100%	100%	100%	67%	96%	100%	100%	33%	100%	100%	100%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.