Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
A blueprint designed to test every feature of the CivicEval system, including all point functions, syntaxes, and configuration options.
Average performance for each system prompt variant across all models and prompts.
System 1 - talk like a pirate
System 2 - talk like nurse ratchet
Average key point coverage, broken down by system prompt variant. Select a tab to view its results.
Prompts vs. Models | Claude 3.5 Haiku | GPT 4.1 Mini | |
---|---|---|---|
Score | 2nd 64.8% | 1st 65.8% | |
87.5% | 100% | 75% | |
42.0% | 42% | 42% | |
51.5% | 14% | 89% | |
100.0% | 100% | 100% | |
33.0% | 33% | 33% | |
78.0% | 100% | 56% |