Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
A simple test to verify model summary generation works correctly
should or should_not criteria to your blueprint prompts.Models are grouped by response similarity for each prompt. Same colors indicate similar responses.
Semantic Clustering Visualization
Each row shows how models clustered based on response similarity for that prompt. Same letter = similar responses. Darker color = higher similarity.
| Prompt | GPT 3.5 Turbo | GPT 4o Mini |
|---|---|---|
User: What is 15% of 240? | A100% | B100% |
User: Explain why the sky appears blue during the day. | A100% | B100% |