Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.
Core Areas Tested:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 5th 90.7% | 2nd 92.2% | 7th 90.1% | 19th 80.6% | 20th 79.1% | 9th 86.9% | 6th 90.6% | 11th 86.0% | 21st 77.7% | 23rd 76.2% | 14th 82.6% | 16th 81.1% | 17th 80.8% | 10th 86.5% | 18th 80.7% | 22nd 76.4% | 13th 85.3% | 15th 81.3% | 8th 87.4% | 12th 85.3% | 4th 91.1% | 1st 95.7% | 3rd 91.9% | |
85.8% | 100% | 90% | 98% | 80% | 90% | 90% | 90% | 90% | 80% | 78% | 80% | 80% | 80% | 78% | 100% | 80% | 78% | 80% | 83% | 88% | 85% | 90% | ||
89.2% | 88% | 86% | 86% | 86% | 77% | 100% | 93% | 93% | 79% | 86% | 93% | 100% | 80% | 86% | 100% | 100% | 79% | 71% | 100% | 79% | 100% | 100% | ||
96.5% | 100% | 100% | 94% | 100% | 72% | 94% | 100% | 100% | 100% | 100% | 100% | 62% | 97% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
98.1% | 100% | 100% | 94% | 100% | 74% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | |
99.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 91% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.4% | 100% | 100% | 91% | 89% | 94% | 89% | 100% | 100% | 91% | 100% | 100% | 86% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
99.1% | 100% | 100% | 100% | 88% | 100% | 97% | 100% | 100% | 100% | 100% | 100% | 95% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
70.6% | 100% | 60% | 90% | 70% | 20% | 80% | 80% | 40% | 80% | 80% | 93% | 20% | 30% | 90% | 80% | 70% | 80% | 70% | 80% | 80% | 70% | 80% | 80% | |
84.0% | 83% | 92% | 98% | 96% | 81% | 50% | 90% | 83% | 98% | 88% | 73% | 90% | 96% | 90% | 58% | 42% | 96% | 100% | 83% | 60% | 88% | 98% | 100% | |
99.3% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | |
51.4% | 83% | 100% | 48% | 45% | 28% | 60% | 90% | 40% | 45% | 23% | 53% | 25% | 43% | 43% | 40% | 38% | 43% | 43% | 45% | 43% | 100% | 53% | ||
82.0% | 96% | 100% | 100% | 25% | 98% | 96% | 100% | 92% | 17% | 17% | 33% | 96% | 94% | 98% | 71% | 85% | 90% | 96% | 94% | 90% | 98% | 100% | 100% | |
82.7% | 80% | 80% | 80% | 80% | 83% | 100% | 75% | 80% | 80% | 90% | 80% | 75% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 98% | 100% | |
78.0% | 100% | 100% | 100% | 90% | 70% | 90% | 100% | 100% | 20% | 13% | 60% | 100% | 60% | 83% | 50% | 30% | 100% | 60% | 90% | 100% | 100% | 100% | ||
50.1% | 70% | 45% | 55% | 60% | 40% | 40% | 50% | 40% | 30% | 30% | 55% | 38% | 45% | 28% | 40% | 40% | 53% | 40% | 70% | 20% | 100% | 95% | 68% | |
90.6% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 83% | 80% | 80% | 100% | 85% | 83% | 88% | 83% | 100% | 85% | 85% | 83% | 83% | 83% | 100% | |
77.8% | 65% | 90% | 90% | 83% | 58% | 95% | 73% | 85% | 93% | 70% | 75% | 75% | 70% | 90% | 73% | 68% | 63% | 48% | 85% | 83% | 85% | 100% | 73% | |
92.6% | 80% | 100% | 100% | 95% | 100% | 95% | 100% | 100% | 83% | 88% | 93% | 90% | 90% | 100% | 83% | 60% | 100% | 95% | 90% | 100% | 95% | 100% | ||
77.7% | 71% | 100% | 94% | 42% | 98% | 65% | 71% | 77% | 81% | 81% | 90% | 100% | 67% | 81% | 50% | 52% | 77% | 81% | 65% | 92% | 88% | 81% | 83% | |
97.1% | 98% | 100% | 100% | 83% | 100% | 98% | 100% | 100% | 100% | 100% | 94% | 98% | 100% | 100% | 100% | 100% | 67% | 100% | 98% | 98% | 100% | 100% | 100% |