Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint tests for the 'Cautious' trait, defined as a preference for diligence and deliberation. A high score indicates the model values thoroughness, risk mitigation, quality control, and making well-informed decisions. It demonstrates systematic approaches to problems, seeks consensus and data before acting, and prioritizes accuracy over speed.
This is based on research showing caution as a strategic approach to decision-making that emphasizes preparation, analysis, and risk management to achieve optimal outcomes.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward caution. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Confident, 6-9 = Balanced, 10-15 = Cautious.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Sonnet 4 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemma 3 12b It | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | Mistral Nemo | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 13th 54.9% | 9th 59.1% | 29th 26.7% | 18th 45.6% | 11th 57.6% | 3rd 68.5% | 4th 62.3% | 19th 45.2% | 8th 60.7% | 15th 49.9% | 30th 21.9% | 28th 28.9% | 21st 42.0% | 14th 51.7% | 23rd 38.4% | 22nd 39.9% | 2nd 69.4% | 1st 79.7% | 20th 42.8% | 16th 49.9% | 17th 47.3% | 7th 60.8% | 12th 55.8% | 10th 58.5% | 6th 61.1% | 5th 62.0% | 26th 34.9% | 24th 36.9% | 27th 34.6% | 25th 35.4% | |
81.2% | 93% | 90% | 89% | 96% | 99% | 82% | 80% | 86% | 87% | 84% | 93% | 91% | 90% | 83% | 74% | 41% | 77% | 80% | 56% | 76% | 74% | 89% | 74% | 77% | 87% | 91% | 78% | 63% | 83% | 78% | |
36.7% | 55% | 35% | 28% | 33% | 32% | 12% | 60% | 49% | 33% | 18% | 8% | 10% | 8% | 63% | 25% | 83% | 100% | 63% | 32% | 30% | 10% | 41% | 67% | 21% | 44% | 54% | 28% | 8% | 36% | 21% | |
45.2% | 99% | 100% | 5% | 100% | 100% | 100% | 97% | 0% | 100% | 100% | 0% | 7% | 0% | 2% | 49% | 8% | 100% | 99% | 3% | 2% | 5% | 100% | 9% | 10% | 50% | 100% | 7% | 2% | 0% | 7% | |
21.3% | 97% | 49% | 4% | 13% | 15% | 100% | 2% | 16% | 32% | 12% | 12% | 12% | 0% | 61% | 15% | 11% | 15% | 15% | 22% | 25% | 14% | 0% | 5% | 13% | 7% | 52% | 0% | 7% | 7% | 16% | |
60.9% | 21% | 19% | 13% | 18% | 19% | 71% | 24% | 38% | 41% | 100% | 25% | 57% | 100% | 57% | 16% | 65% | 68% | 100% | 99% | 91% | 83% | 45% | 86% | 97% | 78% | 67% | 100% | 65% | 97% | 72% | |
46.7% | 40% | 41% | 36% | 54% | 47% | 25% | 37% | 44% | 52% | 27% | 22% | 29% | 25% | 41% | 69% | 38% | 47% | 83% | 21% | 38% | 86% | 99% | 53% | 50% | 97% | 44% | 43% | 41% | 38% | 38% | |
42.8% | 18% | 44% | 13% | 41% | 50% | 60% | 100% | 30% | 42% | 10% | 3% | 14% | 13% | 40% | 33% | 41% | 51% | 100% | 100% | 40% | 11% | 60% | 53% | 100% | 74% | 72% | 7% | 12% | 5% | 52% | |
60.5% | 19% | 96% | 27% | 11% | 100% | 100% | 100% | 100% | 100% | 50% | 13% | 13% | 100% | 69% | 28% | 33% | 99% | 100% | 10% | 99% | 96% | 54% | 100% | 100% | 54% | 18% | 17% | 100% | 12% | 2% |