Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
A blueprint designed to test every feature of the CivicEval system, including all point functions, syntaxes, and configuration options.
Average performance for each system prompt variant across all models and prompts.
System 1 - talk like a pirate
System 2 - talk like nurse ratchet
should or should_not criteria to your blueprint prompts.Models are grouped by response similarity for each prompt. Same colors indicate similar responses.
Semantic Clustering Available
Load detailed per-prompt similarity data to see how models clustered for each scenario. This shows which models responded similarly to each prompt.
Note: May take a moment to download (~500KB-2MB depending on run size).