Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.
Core Areas Tested:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3.5 Haiku | Claude Opus 4 | Claude Opus 4.1 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT Oss 120b | GPT Oss 20b | O4 Mini | Glm 4.5 | Grok 3 | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 1st 94.1% | 9th 90.1% | 6th 91.3% | 3rd 92.6% | 4th 92.5% | 10th 89.9% | 28th 50.4% | 23rd 79.5% | 18th 81.9% | 8th 90.5% | 13th 85.3% | 24th 77.8% | 25th 77.7% | 16th 83.0% | 20th 81.1% | 26th 75.5% | 12th 86.5% | 21st 80.9% | 27th 75.0% | 17th 82.8% | 19th 81.3% | 2nd 93.5% | 14th 84.5% | 22nd 79.8% | 15th 84.5% | 11th 87.9% | 7th 91.1% | 5th 92.0% | |
88.1% | 86% | 82% | 87% | 85% | 91% | 90% | 84% | 90% | 95% | 88% | 93% | 83% | 80% | 85% | 85% | 78% | 88% | 91% | 85% | 90% | 87% | 94% | 97% | 100% | 97% | 90% | 86% | 88% | |
90.7% | 96% | 81% | 91% | 91% | 80% | 82% | 97% | 74% | 74% | 95% | 95% | 81% | 100% | 96% | 95% | 80% | 93% | 100% | 100% | 88% | 71% | 100% | 100% | 93% | 93% | 95% | 100% | 100% | |
95.7% | 100% | 100% | 100% | 100% | 100% | 97% | 11% | 97% | 100% | 100% | 100% | 100% | 100% | 100% | 76% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
95.8% | 100% | 100% | 100% | 100% | 100% | 93% | 13% | 90% | 93% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.6% | 100% | 100% | 100% | 100% | 100% | 96% | 54% | 100% | 97% | 100% | 100% | 100% | 100% | 100% | 89% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
94.5% | 100% | 100% | 100% | 100% | 100% | 97% | 13% | 89% | 86% | 100% | 100% | 93% | 100% | 100% | 84% | 100% | 100% | 100% | 100% | 100% | 97% | 100% | 100% | 100% | 100% | 99% | 100% | 90% | |
97.7% | 100% | 100% | 100% | 100% | 100% | 100% | 54% | 89% | 94% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
75.4% | 93% | 93% | 100% | 75% | 100% | 100% | 80% | 20% | 77% | 80% | 44% | 73% | 73% | 80% | 47% | 27% | 88% | 80% | 80% | 73% | 80% | 80% | 80% | 80% | 73% | 73% | 80% | 83% | |
83.7% | 83% | 83% | 83% | 95% | 88% | 98% | 57% | 72% | 55% | 90% | 86% | 96% | 95% | 94% | 93% | 72% | 100% | 77% | 49% | 98% | 100% | 100% | 83% | 55% | 92% | 63% | 97% | 96% | |
98.1% | 100% | 100% | 100% | 100% | 100% | 100% | 68% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 95% | 100% | 100% | 100% | 100% | 100% | 85% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
52.2% | 100% | 36% | 90% | 100% | 100% | 49% | 38% | 38% | 43% | 94% | 34% | 42% | 28% | 44% | 36% | 31% | 19% | 35% | 23% | 37% | 35% | 100% | 63% | 63% | 47% | 39% | 41% | 60% | |
84.1% | 100% | 100% | 90% | 100% | 100% | 100% | 25% | 97% | 96% | 98% | 95% | 20% | 16% | 28% | 98% | 97% | 95% | 80% | 93% | 86% | 94% | 94% | 92% | 78% | 89% | 99% | 100% | 100% | |
78.0% | 80% | 80% | 80% | 80% | 80% | 80% | 24% | 80% | 100% | 75% | 80% | 80% | 80% | 84% | 76% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 40% | 81% | 87% | 80% | 80% | 99% | |
77.8% | 100% | 100% | 99% | 100% | 100% | 98% | 21% | 81% | 80% | 100% | 97% | 17% | 17% | 78% | 88% | 20% | 86% | 64% | 27% | 92% | 67% | 94% | 86% | 88% | 88% | 100% | 97% | 100% | |
49.4% | 85% | 88% | 78% | 39% | 46% | 38% | 59% | 33% | 30% | 47% | 24% | 28% | 27% | 45% | 20% | 29% | 28% | 25% | 34% | 38% | 27% | 74% | 75% | 75% | 68% | 63% | 100% | 65% | |
87.4% | 83% | 92% | 100% | 100% | 100% | 92% | 56% | 100% | 100% | 92% | 100% | 82% | 99% | 77% | 100% | 83% | 83% | 82% | 80% | 100% | 83% | 100% | 64% | 65% | 66% | 92% | 80% | 100% | |
80.3% | 93% | 81% | 76% | 93% | 88% | 98% | 83% | 66% | 73% | 84% | 80% | 98% | 70% | 75% | 70% | 72% | 98% | 71% | 72% | 66% | 65% | 88% | 78% | 78% | 92% | 84% | 83% | 81% | |
90.5% | 100% | 98% | 99% | 100% | 100% | 100% | 92% | 98% | 97% | 100% | 100% | 92% | 96% | 89% | 99% | 100% | 98% | 97% | 35% | 91% | 85% | 94% | 98% | 18% | 64% | 99% | 100% | 100% | |
73.2% | 83% | 91% | 76% | 96% | 79% | 93% | 50% | 80% | 70% | 68% | 82% | 75% | 75% | 89% | 82% | 71% | 77% | 43% | 50% | 71% | 74% | 74% | 56% | 54% | 53% | 84% | 79% | 80% | |
90.5% | 100% | 100% | 79% | 100% | 100% | 100% | 34% | 100% | 81% | 100% | 100% | 100% | 100% | 99% | 98% | 72% | 100% | 95% | 97% | 49% | 99% | 100% | 81% | 68% | 86% | 100% | 100% | 100% |