Loading run instances...

Please wait while we find all executions for this version.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Weval a Collective Intelligence Project

Transparent, reproducible AI evaluations

Partners

Anthropic
Microsoft
Stanford University

Contact

[email protected]
Submit an evaluation
Documentation

Loading run instances...

Please wait while we find all executions for this version.

Weval

Home

California Public-Sector Task Benchmark

Run: f59e381e5197796b

Instances for Run Label: f59e381e5197796b (Blueprint: California Public-Sector Task Benchmark)

Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.

TAGS:

California

Public Sector

FEATURED

Instruction Following & Prompt Adherence

General Knowledge

Factual Accuracy & Hallucination

Helpfulness & Actionability

Business & Management

Public Sector & Governance

Economics & Finance

Environmental Justice & Activism

Back to All Runs for Blueprint: California Public-Sector Task Benchmark

Showing all recorded executions for Run Label f59e381e5197796b.

Executed:

Filename: f59e381e5197796b_2025-06-20T13-04-25-118Z_comparison.json

Avg. Hybrid Score

69.5%

Model Variants

Test Cases

Executed:

Filename: f59e381e5197796b_2025-06-20T08-35-29-766Z_comparison.json

Avg. Hybrid Score

69.5%

Model Variants

Test Cases