User: A military commander is planning an airstrike on a factory producing artillery shells for the enemy, a clear military objective. The factory is located 200 meters from a civilian apartment building. The commander's intelligence indicates that the strike, while effective, is highly likely to cause si...
σ = 0.23
👯 Most Semantically Similar Pair
openai/gpt-4o-mini vs openai/gpt-4o
93.7%
🏆 Best Hybrid Score
x-ai/grok-3
81.8%
📉 Worst Hybrid Score
openai/gpt-4.1-nano
64.1%
🤔 Most Differentiating Prompt
User: A military commander is planning an airstrike on a factory producing artillery shells for the enemy, a clear military objective. The factory is located 200 meters from a civilian apartment building. The commander's intelligence indicates that the strike, while effective, is highly likely to cause si...
A consolidated overview of performance and semantic consistency metrics.
Overall Average Key Point Coverage
72.8% (±24.1%)
Grand average of all individual model-prompt key point coverage scores. StdDev (±) reflects variability around this grand mean, also in percentage points. A smaller StdDev suggests more consistent coverage scores across all model-prompt pairs; a larger StdDev indicates more diverse performance.
Avg. Prompt Coverage Range
33% - 100% (Spread: 67 pp)
Range of average key point coverage scores across different prompts (from the prompt with the lowest average coverage to the one with the highest). A large spread indicates substantial differences in how challenging prompts were or how models performed on them.
StdDev of Avg. Prompt Coverage
18.9%
Measures how much the average key point coverage score varies from one prompt to another. A high value (e.g., >20-25%) suggests that average performance was quite different across prompts; a low value suggests more consistent average performance from prompt to prompt.
Overall Average Hybrid Score
73.6% (±16.0%)
Overall average of hybrid scores (balancing semantic similarity to ideal and key point coverage) for each model-prompt pair. Higher is generally better. A smaller StdDev suggests more consistent hybrid performance across all model-prompt pairs.
E.g. api-article-36-new-weapons-review: User: What obligation does Article 36 of Additional Protocol I impose on States when they study, dev...
Average Semantic Similarity to Ideal
0.751 (±0.021)
Average semantic similarity (0-1 scale) of models to the ideal response; scores closer to 1.0 are better. The StdDev shows how consistently models achieve this. A very low StdDev (e.g., <0.05) often means models performed very similarly on this metric.
Metric
Value
Explanation
Overall Average Key Point Coverage
72.8% (±24.1%)
Grand average of all individual model-prompt key point coverage scores. StdDev (±) reflects variability around this grand mean, also in percentage points. A smaller StdDev suggests more consistent coverage scores across all model-prompt pairs; a larger StdDev indicates more diverse performance.
Avg. Prompt Coverage Range
33% - 100% (Spread: 67 pp)
Range of average key point coverage scores across different prompts (from the prompt with the lowest average coverage to the one with the highest). A large spread indicates substantial differences in how challenging prompts were or how models performed on them.
StdDev of Avg. Prompt Coverage
18.9%
Measures how much the average key point coverage score varies from one prompt to another. A high value (e.g., >20-25%) suggests that average performance was quite different across prompts; a low value suggests more consistent average performance from prompt to prompt.
Overall Average Hybrid Score
73.6% (±16.0%)
Overall average of hybrid scores (balancing semantic similarity to ideal and key point coverage) for each model-prompt pair. Higher is generally better. A smaller StdDev suggests more consistent hybrid performance across all model-prompt pairs.
E.g. api-article-36-new-weapons-review: User: What obligation does Article 36 of Additional Protocol I impose on States when they study, dev...
Average Semantic Similarity to Ideal
0.751 (±0.021)
Average semantic similarity (0-1 scale) of models to the ideal response; scores closer to 1.0 are better. The StdDev shows how consistently models achieve this. A very low StdDev (e.g., <0.05) often means models performed very similarly on this metric.
Macro Coverage Overview
Average key point coverage extent for each model across all prompts.
No models available.
Model Similarity Dendrogram
Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.