Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Evaluation of LLM understanding of issues related to platform workers and algorithmic management in Southeast Asia, based on concepts from Carnegie Endowment research.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 13th 74.4% | 6th 92.8% | 6th 92.8% | 2nd 96.3% | 3rd 95.7% | 9th 89.7% | 4th 94.3% | 5th 93.8% | 8th 91.7% | 12th 86.0% | 10th 88.2% | 11th 86.4% | 1st 96.8% | |
82.2% | 63% | 84% | 88% | 84% | 78% | 91% | 78% | 84% | 84% | 88% | 75% | 72% | 100% | |
95.7% | 73% | 98% | 98% | 100% | 100% | 100% | 100% | 95% | 95% | 95% | 90% | 100% | 100% | |
96.2% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 88% | 100% | |
90.7% | 80% | 86% | 98% | 100% | 98% | 79% | 95% | 96% | 91% | 86% | 77% | 93% | 100% | |
87.7% | 50% | 90% | 88% | 90% | 100% | 90% | 95% | 98% | 95% | 80% | 78% | 88% | 98% | |
95.4% | 90% | 100% | 93% | 100% | 100% | 95% | 100% | 93% | 98% | 90% | 93% | 88% | 100% | |
99.1% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 97% | 100% | 97% | 100% | |
84.4% | 63% | 81% | 83% | 98% | 100% | 71% | 92% | 98% | 85% | 63% | 79% | 88% | 96% | |
91.4% | 30% | 98% | 100% | 100% | 98% | 98% | 98% | 98% | 100% | 88% | 100% | 80% | 100% | |
99.0% | 100% | 98% | 100% | 100% | 100% | 95% | 100% | 100% | 100% | 98% | 98% | 98% | 100% | |
78.2% | 65% | 83% | 77% | 88% | 79% | 83% | 83% | 83% | 79% | 79% | 83% | 60% | 75% | |
88.0% | 83% | 95% | 88% | 95% | 95% | 80% | 90% | 80% | 90% | 85% | 85% | 85% | 93% |