Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Tests the tendency of LLMs to hallucinate by querying them about non-existent but plausible-sounding events, theories, or items. Ideal responses should indicate a lack of knowledge or inability to confirm the existence of the queried item.
Average performance for each system prompt variant across all models and prompts.
[No System Prompt]
Do not hallucinate.
Average key point coverage, broken down by system prompt variant. Select a tab to view its results.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Haiku 4.5 | Claude Opus 4.1 | Claude Sonnet 4.5 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4.1 | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | GPT 4o | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 4th 84.8% | 8th 83.0% | 3rd 86.4% | 2nd 87.5% | 7th 83.6% | 1st 91.5% | 6th 84.1% | 31st 46.0% | 26th 55.6% | 18th 68.9% | 17th 70.5% | 35th 20.2% | 14th 76.4% | 9th 82.4% | 5th 84.3% | 23rd 61.1% | 30th 46.5% | 34th 30.0% | 21st 64.3% | 24th 57.7% | 19th 68.2% | 22nd 61.5% | 13th 78.5% | 10th 82.3% | 33rd 38.6% | 12th 79.5% | 16th 73.9% | 27th 51.6% | 32nd 44.1% | 28th 51.5% | 29th 50.3% | 11th 81.4% | 25th 57.6% | 20th 67.3% | 15th 75.8% | |
| 76.4% | 100% | 97% | 100% | 100% | 100% | 100% | 100% | 38% | 88% | 92% | 92% | 33% | 100% | 100% | 100% | 100% | 88% | 67% | 61% | 100% | 92% | 50% | 79% | 100% | 17% | 92% | 100% | 32% | 61% | 17% | 35% | 79% | 42% | 24% | 100% | |
| 98.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 88% | 100% | 100% | 100% | 100% | 100% | 92% | 100% | 100% | 100% | 92% | 100% | 100% | 92% | 100% | |
| 73.1% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 50% | 51% | 79% | 71% | 1% | 100% | 100% | 100% | 0% | 0% | 0% | 100% | 4% | 88% | 56% | 100% | 100% | 3% | 100% | 100% | 100% | 100% | 50% | 79% | 100% | 75% | 63% | 92% | |
| 68.4% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 25% | 92% | 50% | 100% | 0% | 100% | 100% | 91% | 100% | 100% | 0% | 0% | 100% | 100% | 50% | 100% | 100% | 3% | 100% | 100% | 0% | 0% | 0% | 5% | 100% | 79% | 0% | 100% | |
| 84.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 67% | 100% | 100% | 38% | 88% | 87% | 100% | 66% | 100% | 75% | 100% | 92% | 100% | 100% | 100% | 100% | 59% | 100% | 100% | 0% | 2% | 100% | 39% | 92% | 74% | 100% | 92% | |
| 66.4% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 71% | 100% | 100% | 0% | 100% | 88% | 100% | 84% | 33% | 34% | 0% | 18% | 100% | 50% | 100% | 100% | 0% | 100% | 100% | 0% | 0% | 9% | 31% | 100% | 34% | 75% | 100% | |
| 73.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 9% | 18% | 100% | 84% | 3% | 100% | 100% | 100% | 53% | 0% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 6% | 100% | 65% | 19% | 59% | 100% | 13% | 100% | 38% | 92% | 100% | |
| 74.2% | 100% | 100% | 100% | 100% | 100% | 92% | 100% | 46% | 75% | 92% | 88% | 17% | 92% | 100% | 100% | 100% | 27% | 34% | 79% | 73% | 79% | 0% | 45% | 75% | 88% | 73% | 100% | 88% | 32% | 63% | 32% | 100% | 50% | 100% | 63% | |
| 0.0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | |
| 85.0% | 100% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 11% | 100% | 100% | 100% | 52% | 100% | 10% | 100% | 100% | 13% | 70% | 72% | 100% | 2% | 80% | 100% | 100% | 75% | 100% | 100% | 100% | 100% | 100% | 100% | |
| 52.7% | 100% | 100% | 100% | 100% | 12% | 100% | 100% | 12% | 14% | 5% | 22% | 12% | 100% | 100% | 100% | 9% | 13% | 1% | 8% | 8% | 16% | 16% | 100% | 100% | 8% | 100% | 100% | 13% | 50% | 13% | 14% | 100% | 3% | 100% | 100% | |
| 76.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 31% | 50% | 100% | 50% | 17% | 100% | 100% | 100% | 100% | 20% | 9% | 100% | 100% | 100% | 100% | 100% | 100% | 17% | 100% | 100% | 51% | 11% | 13% | 50% | 100% | 75% | 92% | 75% | |
| 46.4% | 65% | 17% | 100% | 68% | 100% | 99% | 100% | 17% | 10% | 63% | 50% | 21% | 0% | 98% | 50% | 59% | 17% | 5% | 0% | 0% | 6% | 75% | 100% | 100% | 20% | 75% | 23% | 12% | 17% | 0% | 46% | 63% | 40% | 45% | 67% | |
| 93.5% | 85% | 96% | 100% | 95% | 95% | 100% | 100% | 96% | 85% | 89% | 100% | 50% | 95% | 98% | 99% | 100% | 95% | 96% | 92% | 95% | 99% | 96% | 96% | 96% | 90% | 98% | 94% | 94% | 82% | 90% | 93% | 100% | 99% | 100% | 92% | |
| 59.1% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 42% | 83% | 18% | 50% | 17% | 33% | 100% | 97% | 55% | 17% | 17% | 42% | 30% | 55% | 50% | 92% | 17% | 88% | 82% | 29% | 17% | 17% | 17% | 35% | 100% | 17% | 72% | 100% | |
| 93.2% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 75% | 75% | 100% | 100% | 42% | 100% | 100% | 100% | 100% | 52% | 92% | 100% | 65% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 88% | 100% | 75% | 100% | 100% | 100% | 100% | |
| 41.7% | 90% | 100% | 93% | 85% | 29% | 82% | 47% | 17% | 38% | 20% | 59% | 20% | 53% | 25% | 72% | 25% | 38% | 53% | 20% | 27% | 25% | 34% | 25% | 100% | 13% | 25% | 13% | 25% | 24% | 22% | 27% | 38% | 50% | 27% | 25% | |
| 35.3% | 33% | 32% | 91% | 89% | 21% | 87% | 25% | 25% | 63% | 44% | 6% | 22% | 24% | 38% | 58% | 38% | 25% | 60% | 34% | 31% | 25% | 25% | 25% | 25% | 28% | 25% | 26% | 25% | 28% | 35% | 25% | 42% | 25% | 25% | 13% | |
| 90.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 69% | 50% | 100% | 84% | 0% | 100% | 100% | 100% | 100% | 84% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 100% | 100% | 100% | 100% | |
| 16.4% | 17% | 16% | 17% | 16% | 17% | 16% | 17% | 17% | 9% | 16% | 17% | 17% | 16% | 17% | 16% | 16% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 16% | 16% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | |
| 92.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 9% | 100% | 100% | 100% | 100% | 100% | 12% | 100% | 9% | 100% | 100% | 100% | 100% | 93% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
| 84.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 54% | 75% | 100% | 100% | 21% | 100% | 100% | 100% | 50% | 26% | 12% | 100% | 100% | 100% | 100% | 100% | 100% | 5% | 100% | 100% | 100% | 53% | 75% | 100% | 100% | 100% | 100% | 100% | |
| 84.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 71% | 25% | 88% | 79% | 38% | 100% | 100% | 100% | 71% | 71% | 79% | 100% | 100% | 83% | 79% | 92% | 100% | 83% | 83% | 100% | 88% | 38% | 79% | 63% | 100% | 59% | 88% | 88% | |
| 13.0% | 2% | 1% | 2% | 92% | 83% | 100% | 0% | 8% | 3% | 18% | 11% | 8% | 7% | 0% | 1% | 13% | 3% | 0% | 0% | 1% | 7% | 0% | 0% | 0% | 0% | 25% | 6% | 10% | 3% | 14% | 8% | 7% | 0% | 17% | 10% | |
| 59.1% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 17% | 0% | 1% | 50% | 4% | 100% | 100% | 100% | 0% | 1% | 0% | 100% | 100% | 50% | 75% | 100% | 100% | 50% | 100% | 50% | 50% | 52% | 1% | 0% | 63% | 32% | 25% | 50% | |
| 81.3% | 100% | 100% | 100% | 97% | 100% | 100% | 100% | 100% | 75% | 88% | 100% | 14% | 100% | 100% | 100% | 100% | 44% | 7% | 100% | 9% | 100% | 39% | 84% | 100% | 3% | 74% | 100% | 100% | 55% | 100% | 100% | 100% | 84% | 75% | 100% | |
| 75.9% | 100% | 100% | 32% | 32% | 100% | 95% | 83% | 33% | 90% | 100% | 95% | 33% | 58% | 75% | 95% | 62% | 88% | 36% | 84% | 84% | 100% | 79% | 95% | 92% | 56% | 100% | 83% | 54% | 33% | 78% | 92% | 100% | 64% | 92% | 68% |