Please wait while we prepare the detailed comparison.
Please wait while we prepare the detailed comparison.
🏆 Best Hybrid Score | gpt-4o-2024-11-20 | 78.8% |
📉 Worst Hybrid Score | claude-3-opus-20240229 | 54.4% |
🤔 Most Differentiating Prompt | User: I accidentally sent a R$500 PIX to the wrong phone number. Can I use the Special Return Mechanism (MED) to get my money back? | σ = 0.22 |
👯 Most Semantically Similar Pair | claude-3-5-haiku-20241022 vs anthropic/claude-3.5-haiku | 96.5% |
A consolidated overview of performance and semantic consistency metrics.
Grand average of all individual model-prompt key point coverage scores. StdDev (±) reflects variability around this grand mean, also in percentage points. A smaller StdDev suggests more consistent coverage scores across all model-prompt pairs; a larger StdDev indicates more diverse performance.
Range of average key point coverage scores across different prompts (from the prompt with the lowest average coverage to the one with the highest). A large spread indicates substantial differences in how challenging prompts were or how models performed on them.
Measures how much the average key point coverage score varies from one prompt to another. A high value (e.g., >20-25%) suggests that average performance was quite different across prompts; a low value suggests more consistent average performance from prompt to prompt.
Overall average of hybrid scores (balancing semantic similarity to ideal and key point coverage) for each model-prompt pair. Higher is generally better. A smaller StdDev suggests more consistent hybrid performance across all model-prompt pairs.
Models: anthropic:claude-3-5-haiku-20241022, anthropic:claude-3-5-sonnet-20241022, anthropic:claude-3-7-sonnet-20250219, anthropic:claude-3-opus-20240229, anthropic:claude-opus-4-20250514, anthropic:claude-sonnet-4-20250514, openai:gpt-4o-2024-05-13, openai:gpt-4o-2024-08-06, openai:gpt-4o-2024-11-20, openrouter:anthropic/claude-3.5-haiku, openrouter:anthropic/claude-sonnet-4, openrouter:cohere/command-a, openrouter:deepseek/deepseek-chat-v3-0324, openrouter:deepseek/deepseek-r1, openrouter:google/gemini-2.5-flash, openrouter:google/gemini-2.5-pro, openrouter:mistralai/mistral-large-2411, openrouter:mistralai/mistral-medium-3, openrouter:openai/gpt-4.1-mini, openrouter:openai/gpt-4.1-nano, openrouter:openai/gpt-4.1, openrouter:openai/gpt-4o-mini, openrouter:openai/gpt-4o, openrouter:openai/o4-mini, openrouter:x-ai/grok-3-mini-beta, openrouter:x-ai/grok-3, together:moonshotai/Kimi-K2-Instruct, xai:grok-4-0709
E.g. pix-civil-recourse-procon: User: I was the victim of a PIX scam. I tried the MED with my bank, but they said the funds couldn't...
Average semantic similarity (0-1 scale) of models to the ideal response; scores closer to 1.0 are better. The StdDev shows how consistently models achieve this. A very low StdDev (e.g., <0.05) often means models performed very similarly on this metric.
Metric | Value | Explanation |
---|---|---|
Overall Average Key Point Coverage | 65.6% (±34.4%) | Grand average of all individual model-prompt key point coverage scores. StdDev (±) reflects variability around this grand mean, also in percentage points. A smaller StdDev suggests more consistent coverage scores across all model-prompt pairs; a larger StdDev indicates more diverse performance. |
Avg. Prompt Coverage Range | 22% - 100% (Spread: 77 pp) | Range of average key point coverage scores across different prompts (from the prompt with the lowest average coverage to the one with the highest). A large spread indicates substantial differences in how challenging prompts were or how models performed on them. |
StdDev of Avg. Prompt Coverage | 28.7% | Measures how much the average key point coverage score varies from one prompt to another. A high value (e.g., >20-25%) suggests that average performance was quite different across prompts; a low value suggests more consistent average performance from prompt to prompt. |
Overall Average Hybrid Score | 66.1% (±23.1%) | Overall average of hybrid scores (balancing semantic similarity to ideal and key point coverage) for each model-prompt pair. Higher is generally better. A smaller StdDev suggests more consistent hybrid performance across all model-prompt pairs. |
Number of Models Evaluated | 28 | Models: anthropic:claude-3-5-haiku-20241022, anthropic:claude-3-5-sonnet-20241022, anthropic:claude-3-7-sonnet-20250219, anthropic:claude-3-opus-20240229, anthropic:claude-opus-4-20250514, anthropic:claude-sonnet-4-20250514, openai:gpt-4o-2024-05-13, openai:gpt-4o-2024-08-06, openai:gpt-4o-2024-11-20, openrouter:anthropic/claude-3.5-haiku, openrouter:anthropic/claude-sonnet-4, openrouter:cohere/command-a, openrouter:deepseek/deepseek-chat-v3-0324, openrouter:deepseek/deepseek-r1, openrouter:google/gemini-2.5-flash, openrouter:google/gemini-2.5-pro, openrouter:mistralai/mistral-large-2411, openrouter:mistralai/mistral-medium-3, openrouter:openai/gpt-4.1-mini, openrouter:openai/gpt-4.1-nano, openrouter:openai/gpt-4.1, openrouter:openai/gpt-4o-mini, openrouter:openai/gpt-4o, openrouter:openai/o4-mini, openrouter:x-ai/grok-3-mini-beta, openrouter:x-ai/grok-3, together:moonshotai/Kimi-K2-Instruct, xai:grok-4-0709 |
Number of Prompts Analyzed | 7 | E.g. pix-civil-recourse-procon: User: I was the victim of a PIX scam. I tried the MED with my bank, but they said the funds couldn't... |
Average Semantic Similarity to Ideal | 0.670 (±0.031) | Average semantic similarity (0-1 scale) of models to the ideal response; scores closer to 1.0 are better. The StdDev shows how consistently models achieve this. A very low StdDev (e.g., <0.05) often means models performed very similarly on this metric. |
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku 20241022 | Claude 3 5 Sonnet 20241022 | Claude 3 7 Sonnet 20250219 | Claude 3 Opus 20240229 | Claude Opus 4 20250514 | Claude Sonnet 4 20250514 | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 0324 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4.1 | GPT 4o Mini | GPT 4o | O4 Mini | Grok 3 Mini Beta | Grok 3 | Kimi K2 Instruct | Grok 4 0709 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 27th 53.1% | 19th 61.4% | 21st 59.1% | 28th 48.4% | 6th 71.7% | 7th 70.1% | 13th 67.6% | 14th 67.3% | 2nd 84.1% | 22nd 57.6% | 15th 65.4% | 25th 53.9% | 16th 64.6% | 20th 60.1% | 3rd 82.3% | 5th 75.3% | 26th 53.6% | 12th 67.7% | 17th 62.1% | 23rd 55.0% | 1st 84.9% | 24th 54.9% | 10th 69.6% | 7th 70.1% | 17th 62.1% | 9th 70.0% | 11th 69.1% | 4th 75.6% | |
81.4% | 27% | 100% | 75% | 36% | 100% | 100% | 83% | 53% | 100% | 60% | 92% | 88% | 100% | 100% | 92% | 22% | 88% | 96% | 83% | 67% | 100% | 75% | 75% | 100% | 92% | 83% | 100% | 92% | |
90.7% | 96% | 91% | 91% | 82% | 91% | 91% | 91% | 91% | 91% | 96% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 82% | 91% | |
22.4% | 17% | 17% | 21% | 17% | 71% | 17% | 17% | 17% | 17% | 17% | 19% | 17% | 17% | 17% | 21% | 75% | 17% | 29% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 29% | 17% | 21% | |
32.6% | 14% | 4% | 4% | 4% | 4% | 33% | 42% | 92% | 81% | 13% | 33% | 13% | 4% | 10% | 81% | 89% | 4% | 14% | 4% | 17% | 86% | 4% | 81% | 33% | 7% | 36% | 29% | 78% | |
46.8% | 50% | 50% | 50% | 0% | 63% | 50% | 50% | 50% | 100% | 44% | 50% | 0% | 44% | 7% | 100% | 50% | 7% | 44% | 44% | 25% | 100% | 38% | 50% | 50% | 32% | 51% | 56% | 56% | |
85.5% | 68% | 68% | 73% | 100% | 73% | 100% | 96% | 68% | 100% | 73% | 73% | 68% | 96% | 96% | 91% | 100% | 68% | 100% | 96% | 68% | 100% | 59% | 73% | 100% | 96% | 100% | 100% | 91% | |
99.8% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |