Multilingual World Model Riddles

Tests a model's basic world model and ability to track object state through simple riddles presented in multiple languages. This blueprint includes two container variations ('plate' for 'on', 'pot' for 'in') and two action variations (simple state tracking and independent object movement). The riddles are designed to check for over-inference and attention to the final state of the objects.

TAGS:

Language

Logic

Multilingual Reasoning

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Best Models (Coverage)

1.Claude Sonnet 4
99.8%
2.Claude Opus 4
99.5%
3.GPT 4o
99.4%
4.Grok 3 Mini
99.4%
5.Claude 3.7 Sonnet
98.7%

🤔 Most Differentiating Prompt

User: Ich lege ein Buch auf einen Teller und stelle den Teller in den Kühlschrank. Dann nehme ich den Teller heraus. Wo ist das Buch?

σ = 0.323

🔀 Least Similar Models

Gemini 2.5 ProvsGrok 4

54.4% similarity

👯 Most Similar Models

Claude 3.5 HaikuvsClaude 3.5 Haiku

95.4% similarity

View Blueprint

Select Prompt:

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3.5 Haiku	Claude 3.5 Sonnet	Claude 3.7 Sonnet	Claude 3 Opus	Claude 3.5 Haiku	Claude Opus 4	Claude Sonnet 4	Command A	Deepseek Chat V3	Deepseek R1	Gemini 2.5 Flash	Gemini 2.5 Pro	Mistral Large 2411	Mistral Medium 3	GPT 4.1	GPT 4.1 Mini	GPT 4.1 Nano	GPT 4o	GPT 4o Mini	O4 Mini	Kimi K2 Instruct	Grok 3	Grok 3 Mini	Grok 4
Score	16th 93.6%	7th 98.2%	5th 98.7%	6th 98.6%	11th 95.9%	2nd 99.5%	1st 99.8%	18th 91.5%	22nd 87.4%	14th 94.8%	12th 94.9%	21st 87.9%	8th 97.1%	19th 88.2%	9th 96.2%	13th 94.8%	24th 77.1%	3rd 99.4%	15th 93.8%	20th 88.0%	17th 91.9%	9th 96.2%	3rd 99.4%	23rd 79.8%
80.9%	17%	100%	100%	100%	33%	100%	100%	100%	46%	100%	100%	67%	29%	100%	100%	83%	33%	100%	100%	100%	100%	100%	100%	33%
87.8%	100%	100%	100%	100%	100%	100%	100%	0%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	4%	100%	4%
92.8%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	0%	67%	100%	100%
86.8%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	67%	67%	33%	100%	100%	100%	100%	100%	0%	100%	100%	100%	50%
95.3%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	96%	67%	100%	100%	100%	79%	50%	100%	100%	100%	100%	100%	100%
93.6%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	29%	100%	100%	17%
93.9%	100%	100%	83%	83%	100%	100%	100%	83%	79%	100%	100%	100%	100%	96%	100%	96%	33%	100%	100%	100%	100%	100%	100%	100%
96.7%	100%	100%	100%	100%	100%	100%	100%	83%	79%	79%	100%	100%	100%	79%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
97.2%	100%	100%	96%	100%	100%	100%	100%	83%	87%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
95.3%	100%	100%	100%	100%	100%	100%	100%	100%	87%	83%	100%	100%	100%	67%	67%	100%	100%	100%	100%	100%	100%	100%	100%	83%
92.3%	100%	100%	100%	100%	96%	100%	100%	67%	83%	100%	100%	100%	100%	83%	83%	100%	4%	100%	100%	100%	100%	100%	100%	100%
94.6%	100%	100%	100%	100%	100%	100%	100%	79%	67%	87%	100%	100%	100%	79%	96%	100%	83%	100%	100%	100%	100%	100%	100%	79%
96.5%	100%	100%	96%	100%	100%	100%	92%	100%	67%	96%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	83%	100%	100%
95.7%	100%	100%	100%	100%	100%	87%	100%	96%	67%	100%	100%	100%	100%	96%	100%	100%	71%	100%	100%	96%	100%	83%	100%	100%
96.5%	100%	100%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	83%	83%	100%	100%	100%	100%	100%	100%	100%	100%	83%
87.7%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	96%	100%	100%	33%	100%	100%	4%	67%	100%	100%	21%
86.8%	50%	100%	100%	100%	100%	100%	100%	100%	100%	100%	33%	100%	100%	100%	100%	92%	83%	100%	33%	100%	100%	92%	100%	0%
86.0%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	13%	100%	100%	100%	33%	100%	100%	100%	0%	100%	100%	100%	17%
93.7%	79%	67%	100%	100%	100%	100%	100%	100%	100%	100%	88%	67%	100%	100%	100%	100%	100%	100%	54%	100%	100%	100%	100%
94.2%	100%	67%	100%	100%	100%	96%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	96%	67%	67%	100%	100%	100%
94.5%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	67%	67%	100%	100%	100%	67%	100%	100%	100%
95.7%	100%	100%	100%	96%	100%	100%	100%	100%	100%	100%	67%	67%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%
91.7%	67%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	67%	33%	100%	100%	100%	100%	100%	100%	100%
95.0%	58%	100%	100%	100%	58%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	100%	100%	96%	100%	100%	100%
94.1%	100%	100%	100%	100%	100%	100%	100%	100%	79%	100%	100%	100%	100%	100%	83%	100%	17%	100%	100%	100%	100%	100%	79%	100%
94.8%	100%	100%	100%	83%	100%	100%	100%	67%	79%	100%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	100%	100%	79%
97.6%	100%	100%	100%	100%	100%	100%	100%	100%	92%	83%	100%	100%	100%	67%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
97.9%	100%	100%	100%	100%	100%	100%	100%	100%	83%	100%	100%	100%	100%	83%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%
90.6%	100%	100%	100%	100%	100%	100%	100%	83%	71%	100%	100%	100%	100%	67%	67%	100%	4%	100%	100%	100%	100%	100%	100%	83%
97.9%	100%	100%	100%	100%	100%	100%	100%	100%	83%	83%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
97.9%	100%	100%	100%	100%	100%	100%	100%	83%	83%	100%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%
97.5%	100%	100%	96%	100%	100%	100%	100%	83%	83%	100%	100%	100%	100%	83%	100%	100%	96%	100%	100%	100%	100%	100%	100%	100%
97.7%	100%	100%	100%	100%	100%	100%	100%	100%	83%	100%	100%	100%	100%	79%	100%	100%	100%	100%	100%	100%	100%	100%	100%	83%
94.5%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	100%	100%	100%	100%	33%	100%	100%	100%	67%	100%	100%	100%
94.6%	100%	100%	100%	100%	100%	100%	100%	100%	83%	100%	100%	100%	100%	100%	100%	100%	67%	100%	21%	100%	100%	100%	100%	100%
89.7%	100%	100%	100%	87%	100%	100%	100%	87%	100%	0%	100%	83%	100%	100%	100%	96%	100%	79%	71%	100%	50%	100%	100%	100%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.