weval

Loading analysis results...

Please wait while we prepare the detailed comparison.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Analysis: Sandbox 1758467782991 6678dcea 7143 40b7 A5ef 0777ce0372ed - Run sandbox...

Distributional Prevalence Concordance (labels+tags)

Minimal blueprint to probe whether model outputs reflect specified real-world prevalence for underspecified scenarios. Uses simple weighted matches (no JS) on a structured tag line appended to each story.

TAGS:

SANDBOX_TEST

Best Models (Coverage across 3 temperatures)

1.Claude 3 Haiku 20240307
28.5%
2.GPT 4o Mini
28.5%
3.Gemini Flash 1.5
27.0%

View Blueprint

Select Prompt:

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3 Haiku 20240307	Gemini Flash 1.5	GPT 4o Mini
Score	1st 28.5%	3rd 27.0%	1st 28.5%
12.3%	9%	19%	9%
17.7%	9%	35%	9%
65.3%	87%	22%	87%
16.7%	9%	32%	9%