Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Evaluates the models on the UDHR dataset (Universal Declaration of Human Rights).
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 9th 95.5% | 7th 96.2% | 6th 96.6% | 5th 96.8% | 2nd 98.6% | 8th 95.8% | 11th 95.1% | 3rd 98.6% | 4th 97.0% | 13th 87.2% | 10th 95.3% | 12th 90.6% | 1st 99.1% | |
98.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 81% | 100% | 100% | 100% | |
98.8% | 100% | 98% | 96% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 94% | 100% | |
98.8% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 88% | 100% | 98% | 100% | |
99.3% | 95% | 98% | 100% | 100% | 98% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
93.5% | 89% | 92% | 100% | 89% | 98% | 100% | 100% | 100% | 95% | 86% | 89% | 77% | 100% | |
96.1% | 98% | 95% | 100% | 100% | 100% | 95% | 100% | 100% | 100% | 88% | 95% | 78% | 100% | |
93.8% | 90% | 100% | 93% | 99% | 100% | 100% | 90% | 100% | 100% | 57% | 100% | 90% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
93.4% | 100% | 100% | 88% | 88% | 100% | 81% | 88% | 100% | 100% | 100% | 81% | 88% | 100% | |
79.8% | 82% | 77% | 85% | 92% | 90% | 76% | 75% | 85% | 71% | 55% | 84% | 74% | 91% | |
94.8% | 92% | 94% | 97% | 94% | 97% | 97% | 94% | 98% | 98% | 91% | 95% | 88% | 98% |