Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates the model's ability to accurately answer questions based on the UK Freedom of Information Act 2000.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Llama 3 70b Instruct | Llama 4 Maverick | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | Kimi K2 Instruct | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 8th 82.6% | 4th 87.1% | 16th 69.1% | 2nd 89.1% | 18th 55.8% | 7th 82.6% | 3rd 88.9% | 11th 79.4% | 15th 71.5% | 10th 79.9% | 6th 83.8% | 1st 91.9% | 14th 76.5% | 13th 78.6% | 12th 79.1% | 17th 68.4% | 9th 80.0% | 5th 86.8% | |
89.9% | 100% | 100% | 100% | 90% | 83% | 94% | 100% | 100% | 91% | 62% | 100% | 100% | 71% | 89% | 99% | 62% | 81% | 99% | |
63.9% | 77% | 77% | 67% | 95% | 56% | 64% | 67% | 52% | 45% | 66% | 71% | 65% | 66% | 54% | 55% | 53% | 54% | 69% | |
70.6% | 86% | 86% | 66% | 89% | 59% | 96% | 75% | 86% | 35% | 81% | 85% | 92% | 39% | 46% | 71% | 42% | 69% | ||
79.8% | 77% | 91% | 66% | 89% | 71% | 85% | 93% | 64% | 89% | 68% | 73% | 92% | 86% | 83% | 77% | 59% | 85% | 94% | |
82.2% | 78% | 99% | 48% | 78% | 14% | 97% | 100% | 68% | 73% | 92% | 87% | 100% | 96% | 100% | 93% | 88% | 90% | ||
65.7% | 57% | 50% | 43% | 80% | 65% | 42% | 94% | 69% | 75% | 76% | 56% | 88% | 67% | 69% | 52% | 63% | 76% | ||
90.5% | 100% | 99% | 76% | 98% | 58% | 85% | 86% | 100% | 72% | 100% | 99% | 100% | 92% | 92% | 96% | 88% | 100% | ||
92.8% | 86% | 95% | 89% | 95% | 42% | 100% | 98% | 98% | 94% | 96% | 100% | 100% | 98% | 98% | 92% | 94% | 100% | 100% |