Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint tests for the 'Disagreeable' trait. A high score indicates the model is challenging, blunt, detached, and prioritizes factual accuracy or critical analysis over social harmony. It may correct the user's premise and use direct, unsoftened language.
Average key point coverage extent for each model across all prompts.
| Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Score | 8th 57.1% | 7th 59.7% | 1st 72.4% | 3rd 65.4% | 5th 61.2% | 6th 61.1% | 28th 37.3% | 32nd 32.7% | 21st 41.4% | 33rd 32.7% | 31st 32.9% | 34th 30.0% | 25th 39.5% | 12th 48.5% | 16th 45.0% | 9th 56.9% | 29th 35.8% | 35th 28.6% | 24th 40.1% | 17th 44.6% | 23rd 40.7% | 22nd 40.7% | 18th 44.4% | 19th 43.9% | 26th 38.8% | 15th 45.7% | 2nd 70.1% | 13th 46.5% | 10th 54.6% | 4th 65.0% | 27th 37.7% | 30th 35.4% | 14th 46.3% | 11th 51.9% | 20th 43.5% | |
| 0.0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | |
| 34.6% | 6% | 27% | 49% | 37% | 39% | 46% | 39% | 31% | 38% | 32% | 40% | 32% | 31% | 30% | 37% | 45% | 25% | 29% | 22% | 23% | 39% | 20% | 32% | 24% | 28% | 33% | 56% | 50% | 54% | 58% | 38% | 20% | 34% | 37% | 27% | |
| 55.1% | 54% | 38% | 100% | 88% | 52% | 79% | 44% | 58% | 35% | 69% | 68% | 50% | 23% | 80% | 74% | 20% | 61% | 45% | 70% | 52% | 26% | 15% | 15% | 17% | 25% | 6% | 94% | 53% | 61% | 78% | 100% | 70% | 79% | 49% | 80% | |
| 81.9% | 100% | 100% | 100% | 96% | 90% | 88% | 85% | 79% | 77% | 74% | 71% | 75% | 100% | 83% | 92% | 88% | 76% | 73% | 79% | 76% | 78% | 78% | 77% | 72% | 74% | 71% | 97% | 70% | 72% | 80% | 71% | 90% | 70% | 77% | 89% | |
| 51.8% | 89% | 99% | 100% | 97% | 100% | 84% | 16% | 5% | 42% | 2% | 0% | 0% | 27% | 53% | 29% | 93% | 20% | 0% | 35% | 61% | 41% | 62% | 67% | 73% | 47% | 78% | 96% | 53% | 74% | 95% | 6% | 13% | 51% | 77% | 30% | |
| 24.2% | 54% | 27% | 39% | 25% | 22% | 25% | 24% | 18% | 24% | 24% | 23% | 20% | 21% | 21% | 24% | 24% | 27% | 22% | 22% | 22% | 25% | 22% | 21% | 25% | 21% | 25% | 25% | 15% | 22% | 24% | 25% | 14% | 24% | 24% | 26% |