This run used embeddings to measure semantic similarity between model responses. The table and dendrogram below show how models clustered based on their response content.To also see rubric-based coverage analysis, add should or should_not criteria to your blueprint prompts.

Semantic Clustering Overview

Models are grouped by response similarity for each prompt. Same colors indicate similar responses.

Semantic Clustering Visualization

Each row shows how models clustered based on response similarity for that prompt. Same letter = similar responses. Darker color = higher similarity.

Cluster A

Cluster B

Cluster C

Cluster D

Cluster E

Prompt	GPT 3.5 Turbo	GPT 4o Mini
User: What is 15% of 240?	A100%	B100%
User: Explain why the sky appears blue during the day.	A100%	B100%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.This visualization is based on semantic embeddings of model responses.