Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Gemini 2.5 Pro Preview 05 06 | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 12th 62.2% | 6th 72.3% | 7th 69.6% | 2nd 76.1% | 5th 73.1% | 8th 69.1% | 11th 65.2% | 1st 76.2% | 3rd 75.7% | 10th 68.0% | 14th 60.6% | 9th 68.3% | 13th 61.9% | 4th 75.4% | |
70.2% | 60% | 75% | 80% | 80% | 70% | 60% | 80% | 80% | 70% | 70% | 53% | 70% | 55% | 80% | |
92.7% | 100% | 80% | 100% | 95% | 80% | 83% | 80% | 100% | 100% | 100% | 80% | 100% | 100% | 100% | |
77.8% | 78% | 85% | 60% | 90% | 85% | 40% | 90% | 75% | 80% | 83% | 62% | 93% | 88% | 80% | |
41.4% | 30% | 40% | 50% | 43% | 53% | 62% | 43% | 53% | 40% | 40% | 40% | 15% | 20% | 50% | |
69.4% | 68% | 67% | 67% | 65% | 55% | 65% | 70% | 65% | 72% | 75% | 75% | 80% | 73% | 75% | |
69.6% | 60% | 50% | 68% | 68% | 80% | 70% | 68% | 68% | 80% | 68% | 68% | 78% | 70% | 78% | |
62.4% | 65% | 62% | 47% | 52% | 62% | 70% | 60% | 65% | 78% | 62% | 60% | 68% | 60% | 63% | |
46.7% | 45% | 52% | 42% | 45% | 40% | 80% | 50% | 42% | 48% | 40% | 58% | 35% | 30% | 47% | |
68.8% | 55% | 80% | 78% | 80% | 78% | 73% | 60% | 83% | 60% | 60% | 58% | 60% | 60% | 78% | |
84.5% | 63% | 87% | 91% | 98% | 91% | 91% | 86% | 84% | 91% | 79% | 75% | 75% | 86% | 86% | |
77.6% | 68% | 78% | 80% | 78% | 78% | 83% | 75% | 85% | 83% | 85% | 68% | 80% | 60% | 85% | |
81.7% | 84% | 100% | 88% | 100% | 100% | 100% | 50% | 66% | 97% | 75% | 56% | 78% | 78% | 72% | |
61.4% | 63% | 83% | 63% | 55% | 65% | 65% | 45% | 65% | 60% | 60% | 60% | 60% | 48% | 68% | |
62.2% | 50% | 60% | 60% | 75% | 75% | 60% | 60% | 73% | 63% | 48% | 55% | 70% | 55% | 67% | |
59.6% | 20% | 70% | 40% | 100% | 28% | 100% | 45% | 100% | 73% | 43% | 20% | 65% | 30% | 100% | |
63.6% | 55% | 80% | 65% | 63% | 88% | 53% | 48% | 80% | 80% | 55% | 45% | 55% | 50% | 73% | |
71.9% | 75% | 73% | 73% | 85% | 87% | 18% | 63% | 88% | 88% | 83% | 60% | 73% | 63% | 77% | |
90.3% | 80% | 80% | 100% | 97% | 100% | 70% | 100% | 100% | 100% | 98% | 98% | 75% | 88% | 78% |