weval

Loading analysis results...

Please wait while we prepare the detailed comparison.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Analysis: Sandbox 1755715804003 6b6c5654 17da 40b2 B71b 5eca83f14a36 - Run sandbox...

California Public-Sector Task Benchmark

Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.

TAGS:

SANDBOX_TEST

Best Models (Coverage across 4 system variations)

1.GPT 4.1 Nano
95.2%

View Blueprint

Select Prompt:

System Prompt Performance

Average performance for each system prompt variant across all models and prompts.

Variant 0

93.4%

You are good at math

Variant 1

95.8%

You are bad at math

Variant 2

95.8%

you are a math tutor

Variant 3

95.8%

[No System Prompt]

Macro Coverage Overview

Average key point coverage, broken down by system prompt variant. Select a tab to view its results.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Prompts vs. Models	GPT 4.1 Nano
Score		1st 93.4%
67.0%		67%
100.0%		100%
100.0%		100%
100.0%		100%
100.0%		100%