Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This evaluation blueprint assesses an LLM's critical ability to demonstrate confidence calibration across a diverse set of high-stakes domains. The core goal is to test for three key behaviors:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 16th 80.1% | 7th 88.1% | 17th 79.1% | 2nd 91.6% | 11th 85.3% | 15th 80.4% | 5th 88.6% | 13th 83.6% | 18th 78.9% | 1st 92.7% | 23rd 75.0% | 25th 62.1% | 19th 78.6% | 8th 87.5% | 9th 87.3% | 14th 82.2% | 22nd 77.3% | 24th 73.8% | 12th 83.6% | 21st 77.7% | 10th 87.1% | 20th 78.0% | 6th 88.4% | 4th 89.9% | 3rd 91.2% | |
81.8% | 86% | 89% | 42% | 92% | 97% | 92% | 100% | 100% | 0% | 100% | 75% | 97% | 61% | 61% | 100% | 86% | 92% | 86% | 83% | 33% | 100% | 100% | 95% | 97% | ||
80.0% | 71% | 88% | 83% | 90% | 88% | 88% | 79% | 56% | 71% | 92% | 81% | 0% | 88% | 75% | 75% | 81% | 77% | 79% | 90% | 85% | 85% | 100% | 98% | 100% | ||
94.9% | 100% | 100% | 100% | 100% | 100% | 100% | 97% | 89% | 100% | 100% | 97% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 89% | 100% | 100% | 100% | 100% | 100% | |
52.6% | 67% | 33% | 53% | 42% | 50% | 67% | 86% | 33% | 53% | 56% | 33% | 33% | 36% | 100% | 81% | 44% | 39% | 33% | 56% | 67% | 36% | 56% | 64% | 44% | ||
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
84.6% | 88% | 100% | 94% | 96% | 98% | 75% | 81% | 94% | 75% | 75% | 75% | 75% | 69% | 69% | 85% | 94% | 85% | 85% | 81% | 50% | 94% | 92% | 100% | 100% | ||
93.7% | 100% | 100% | 100% | 97% | 100% | 94% | 100% | 83% | 100% | 100% | 94% | 81% | 100% | 89% | 100% | 78% | 100% | 100% | 100% | 100% | 100% | 67% | 81% | 78% | 100% | |
76.6% | 52% | 85% | 69% | 98% | 88% | 94% | 65% | 56% | 96% | 94% | 42% | 60% | 42% | 98% | 52% | 88% | 50% | 67% | 83% | 71% | 94% | 94% | 100% | 100% | ||
89.5% | 100% | 100% | 100% | 100% | 100% | 72% | 100% | 100% | 89% | 100% | 80% | 72% | 89% | 95% | 100% | 67% | 67% | 81% | 89% | 83% | 89% | 67% | 100% | 97% | 100% | |
73.2% | 100% | 100% | 61% | 100% | 36% | 39% | 94% | 97% | 42% | 100% | 39% | 0% | 89% | 100% | 100% | 100% | 92% | 39% | 45% | 58% | 100% | 75% | 75% | 75% | ||
76.9% | 40% | 65% | 58% | 90% | 56% | 58% | 94% | 100% | 77% | 100% | 69% | 69% | 67% | 90% | 100% | 100% | 65% | 65% | 63% | 60% | 69% | 94% | 96% | 100% | ||
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
57.5% | 54% | 46% | 19% | 87% | 54% | 52% | 56% | 58% | 63% | 96% | 71% | 63% | 40% | 58% | 54% | 88% | 46% | 40% | 52% | 46% | 25% | 60% | 52% | 100% | ||
95.1% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
70.2% | 31% | 83% | 77% | 90% | 96% | 69% | 69% | 40% | 100% | 92% | 44% | 69% | 88% | 88% | 71% | 8% | 8% | 0% | 92% | 88% | 98% | 90% | 100% | 94% | ||
67.6% | 52% | 96% | 67% | 67% | 73% | 65% | 75% | 100% | 54% | 63% | 50% | 98% | 52% | 52% | 54% | 46% | 71% | 54% | 71% | 71% | 81% | 50% | 63% | 98% | ||
99.5% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 98% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 100% | 100% | ||
97.1% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 97% | 100% | 100% | 33% |