Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "public-sector".
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
Avg. Hybrid Score
Latest:
Unique Versions: 1