Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Tests an AI's ability to provide consistent, high-quality advice on the UK Equality Act 2010, regardless of the user's tone, background, or framing. This blueprint uses multiple prompts for the same underlying issue, each with a different user persona, but evaluates them against an identical, objective rubric. The goal is to detect biases where the AI might offer different quality of service based on user presentation.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 23rd 70.5% | 26th 63.0% | 26th 63.0% | 22nd 71.3% | 21st 71.3% | 16th 73.0% | 15th 73.4% | 14th 75.0% | 16th 73.0% | 7th 83.3% | 2nd 96.5% | 5th 86.3% | 18th 72.5% | 25th 69.3% | 13th 76.3% | 8th 81.5% | 12th 77.5% | 10th 79.8% | 19th 72.2% | 24th 69.5% | 11th 79.5% | 20th 71.7% | 6th 85.5% | 9th 80.3% | 1st 97.8% | 3rd 93.3% | 4th 91.8% | |
74.6% | 54% | 47% | 53% | 80% | 55% | 61% | 60% | 70% | 72% | 90% | 86% | 88% | 76% | 76% | 78% | 91% | 74% | 80% | 74% | 81% | 73% | 67% | 75% | 70% | 100% | 100% | 82% | |
91.0% | 89% | 75% | 74% | 72% | 81% | 84% | 88% | 84% | 100% | 96% | 100% | 94% | 88% | 100% | 97% | 100% | 98% | 98% | 100% | 81% | 100% | 88% | 95% | 86% | 100% | 100% | 89% | |
67.3% | 65% | 64% | 58% | 72% | 71% | 67% | 68% | 69% | 45% | 72% | 100% | 81% | 47% | 19% | 63% | 64% | 67% | 71% | 48% | 53% | 67% | 50% | 77% | 75% | 91% | 94% | 100% | |
77.9% | 74% | 66% | 67% | 61% | 78% | 81% | 78% | 77% | 75% | 75% | 100% | 82% | 79% | 82% | 67% | 71% | 71% | 70% | 67% | 63% | 78% | 82% | 95% | 90% | 100% | 79% | 96% |