ASQA Longform 40

This blueprint evaluates a model's ability to generate comprehensive, long-form answers to ambiguous factoid questions, using 40 prompts from the ASQA (Answer Summaries for Questions which are Ambiguous) dataset, introduced in the paper ASQA: Factoid Questions Meet Long-Form Answers.

The core challenge is moving beyond single-fact extraction. Many real-world questions are ambiguous (e.g., "Who was the ruler of France in 1830?"), having multiple valid answers. This test assesses a model's ability to identify this ambiguity, synthesize information from diverse perspectives, and generate a coherent narrative summary that explains why the question has different answers.

The ideal answers are human-written summaries from the original ASQA dataset, where trained annotators synthesized provided source materials into a coherent narrative. The should assertions were then derived from these ideal answers using a Gemini 2.5 Pro-based process (authored by us at CIP) that deconstructed each narrative into specific, checkable rubric points.

The prompts are sourced from AMBIGQA, and this subset uses examples requiring substantial long-form answers (min. 50 words) to test for deep explanatory power.

TAGS:

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Long Form Question Answering

Reasoning

General Knowledge

Best Models (Coverage)

1.Grok 4
49.9%
2.Gemini 2.5 Pro Preview 05 06
38.7%
3.Grok 3
36.3%
4.Deepseek Chat V3
34.2%
5.Deepseek R1
34.2%

🤔 Most Differentiating Prompt

User: What book is the new season of game of thrones?

σ = 0.239

🔀 Least Similar Models

O4 MinivsGrok 4

76.1% similarity

👯 Most Similar Models

GPT 4o MinivsGPT 4o

90.8% similarity

View Blueprint

Select Prompt:

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3.5 Haiku	Claude Sonnet 4	Command A	Deepseek Chat V3	Deepseek R1	Gemini 2.5 Flash	Gemini 2.5 Pro Preview 05 06	Mistral Large 2411	Mistral Medium 3	GPT 4.1	GPT 4.1 Mini	GPT 4.1 Nano	GPT 4o	GPT 4o Mini	O4 Mini	Grok 3	Grok 3 Mini	Grok 4
Score	16th 22.6%	9th 30.0%	8th 30.3%	4th 34.2%	5th 34.2%	18th 20.0%	2nd 38.7%	12th 26.8%	7th 30.8%	10th 29.5%	13th 24.1%	17th 20.3%	15th 22.7%	14th 23.4%	11th 27.4%	3rd 36.3%	6th 32.1%	1st 49.9%
28.5%	31%	27%	38%	34%	30%	22%	33%	28%	20%	27%	27%	25%	30%	27%	27%	31%	22%	34%
9.9%	13%	5%	11%	5%	11%	13%	9%	11%	16%	17%	2%	0%	13%	3%	9%	14%	11%	16%
11.3%	0%	14%	0%	34%	23%	7%	2%	0%	7%	32%	2%	0%	0%	2%	2%	43%	7%	29%
12.1%	4%	17%	0%	17%	25%	17%	35%	10%	6%	6%	15%	13%	0%	0%	4%	13%	10%	25%
24.4%	17%	31%	17%	17%	29%	17%	33%	33%	29%	17%	17%	33%	15%	17%	17%	33%	17%	50%
32.3%	33%	33%	33%	38%	33%	17%	33%	33%	38%	33%	33%	25%	31%	33%	31%	33%	33%	38%
20.1%	0%	0%	0%	0%	53%	0%	38%	13%	0%	34%	13%	0%	0%	10%	56%	59%	22%	63%
26.1%	10%	20%	25%	23%	10%	20%	60%	20%	10%	20%	15%	10%	20%	20%	20%	23%	60%	83%
44.3%	29%	56%	56%	52%	50%	27%	56%	48%	52%	35%	46%	27%	31%	38%	25%	54%	50%	65%
2.4%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	3%	0%	0%	0%	3%	0%	0%	38%
36.1%	44%	28%	47%	44%	44%	13%	47%	38%	44%	23%	25%	19%	23%	25%	25%	38%	44%	78%
79.7%	75%	79%	83%	83%	83%	75%	83%	83%	83%	79%	79%	75%	71%	71%	83%	83%	83%	83%
18.6%	16%	18%	16%	30%	18%	18%	16%	18%	18%	18%	18%	18%	16%	16%	18%	18%	21%	23%
26.9%	20%	25%	38%	38%	38%	0%	30%	30%	33%	30%	20%	18%	25%	20%	28%	28%	15%	48%
47.7%	45%	60%	50%	63%	55%	15%	58%	43%	60%	60%	50%	23%	40%	20%	20%	68%	60%	68%
25.3%	0%	38%	0%	35%	25%	13%	27%	35%	23%	33%	33%	31%	19%	31%	25%	31%	19%	38%
20.3%	17%	21%	21%	21%	17%	15%	23%	21%	21%	21%	17%	21%	21%	21%	21%	21%	21%	25%
49.8%	27%	48%	75%	71%	56%	19%	40%	54%	42%	42%	33%	50%	52%	42%	50%	58%	50%	88%
0.0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
44.7%	54%	58%	56%	33%	33%	29%	50%	56%	33%	33%	35%	31%	27%	31%	50%	58%	54%	83%
17.1%	20%	20%	20%	10%	20%	10%	20%	5%	20%	20%	10%	5%	15%	5%	20%	20%	20%	48%
29.1%	30%	33%	30%	23%	48%	33%	48%	20%	20%	30%	18%	23%	20%	20%	30%	23%	25%	50%
44.4%	54%	48%	42%	54%	46%	33%	48%	42%	48%	33%	27%	31%	35%	38%	38%	63%	60%	60%
36.0%	30%	55%	52%	32%	48%	11%	79%	6%	63%	32%	18%	21%	23%	23%	11%	38%	43%	63%
44.4%	33%	44%	33%	71%	46%	19%	31%	60%	48%	33%	33%	8%	27%	44%	25%	77%	75%	92%
84.6%	41%	77%	77%	98%	80%	80%	98%	79%	98%	82%	79%	75%	91%	95%	82%	100%	93%	98%
47.2%	22%	75%	56%	44%	47%	25%	75%	31%	31%	72%	31%	31%	34%	28%	50%	75%	50%	72%
44.8%	38%	25%	48%	50%	50%	46%	46%	50%	50%	44%	48%	23%	48%	48%	48%	48%	50%	46%
46.5%	50%	35%	50%	53%	55%	20%	60%	0%	53%	55%	48%	53%	45%	45%	55%	60%	35%	65%
21.1%	10%	13%	13%	30%	25%	23%	30%	18%	35%	25%	8%	10%	15%	13%	23%	28%	28%	33%
17.9%	33%	4%	10%	35%	10%	17%	4%	29%	38%	8%	33%	8%	8%	13%	4%	44%	9%	15%
27.4%	5%	33%	30%	38%	48%	35%	53%	13%	35%	30%	10%	10%	8%	10%	40%	20%	25%	50%
19.4%	10%	5%	35%	35%	15%	5%	13%	25%	10%	10%	23%	10%	35%	25%	0%	30%	33%	30%
40.3%	42%	42%	42%	33%	46%	40%	40%	38%	44%	42%	40%	35%	33%	38%	42%	40%	38%	50%
23.9%	13%	16%	5%	30%	42%	25%	83%	27%	20%	45%	8%	11%	14%	14%	6%	13%	13%	45%
16.5%	15%	13%	17%	17%	21%	13%	17%	15%	19%	17%	15%	8%	4%	13%	13%	13%	17%	50%
13.9%	0%	27%	17%	10%	15%	0%	38%	0%	0%	17%	0%	4%	0%	13%		2%	13%	81%
11.4%	0%	17%	8%	15%	21%	0%	29%	0%	17%	0%	15%	0%	0%	0%	31%	8%	17%	27%
12.1%	8%	15%	30%	20%	10%	8%	20%	20%	23%	5%	5%	5%	0%	5%	0%	15%	20%	8%
25.9%	15%	23%	31%	33%	42%	21%	42%	21%	23%	21%	13%	21%	17%	17%	38%	29%	21%	38%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.