Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
Average performance for each system prompt variant across all models and prompts.
You are good at math
You are bad at math
you are a math tutor
[No System Prompt]
Average key point coverage, broken down by system prompt variant. Select a tab to view its results.
| Prompts vs. Models | GPT 4.1 Nano | |
|---|---|---|
| Score | 1st 93.4% | |
| 67.0% | 67% | |
| 100.0% | 100% | |
| 100.0% | 100% | |
| 100.0% | 100% | |
| 100.0% | 100% |