Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Note: this eval has highly context-deficient prompts. It is unlikely that any model will succeed. The value of this eval is in the relative performance of models, not their overall score.
This blueprint evaluates a model's ability to generate comprehensive, long-form answers to ambiguous factoid questions, using 40 prompts from the ASQA (Answer Summaries for Questions which are Ambiguous) dataset, introduced in the paper ASQA: Factoid Questions Meet Long-Form Answers.
The core challenge is moving beyond single-fact extraction. Many real-world questions are ambiguous (e.g., "Who was the ruler of France in 1830?"), having multiple valid answers. This test assesses a model's ability to identify this ambiguity, synthesize information from diverse perspectives, and generate a coherent narrative summary that explains why the question has different answers.
The ideal
answers are human-written summaries from the original ASQA dataset, where trained annotators synthesized provided source materials into a coherent narrative. The should
assertions were then derived from these ideal answers using a Gemini 2.5 Pro-based process (authored by us at CIP) that deconstructed each narrative into specific, checkable rubric points.
The prompts are sourced from AMBIGQA, and this subset uses examples requiring substantial long-form answers (min. 50 words) to test for deep explanatory power.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4 | Claude Opus 4.1 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 16th 27.6% | 17th 27.6% | 24th 22.8% | 9th 30.6% | 8th 30.8% | 13th 28.5% | 26th 20.5% | 5th 34.6% | 4th 34.7% | 27th 20.2% | 3rd 39.1% | 18th 26.1% | 28th 19.1% | 21st 23.9% | 19th 25.8% | 11th 29.5% | 10th 29.8% | 25th 22.6% | 30th 16.7% | 22nd 23.0% | 23rd 22.8% | 14th 28.4% | 6th 33.6% | 29th 18.4% | 15th 28.1% | 2nd 41.3% | 20th 24.5% | 12th 28.8% | 7th 33.5% | 1st 47.9% | |
29.8% | 30% | 24% | 33% | 30% | 41% | 27% | 22% | 39% | 27% | 23% | 39% | 30% | 38% | 27% | 36% | 22% | 23% | 28% | 27% | 30% | 25% | 19% | 44% | 27% | 25% | 39% | 36% | 27% | 27% | 28% | |
8.3% | 14% | 14% | 11% | 13% | 13% | 3% | 5% | 13% | 10% | 10% | 8% | 2% | 0% | 10% | 5% | 13% | 20% | 10% | 0% | 11% | 3% | 8% | 0% | 0% | 8% | 9% | 3% | 3% | 14% | 16% | |
16.1% | 18% | 36% | 0% | 11% | 14% | 7% | 0% | 7% | 18% | 57% | 34% | 4% | 0% | 2% | 2% | 0% | 22% | 0% | 0% | 2% | 2% | 21% | 16% | 32% | 25% | 9% | 30% | 50% | 48% | ||
19.4% | 25% | 0% | 6% | 21% | 21% | 17% | 25% | 15% | 25% | 21% | 38% | 10% | 15% | 4% | 4% | 19% | 19% | 15% | 6% | 21% | 0% | 48% | 42% | 8% | 23% | 54% | 17% | 13% | 8% | 42% | |
23.0% | 17% | 17% | 17% | 33% | 33% | 31% | 21% | 17% | 31% | 17% | 33% | 17% | 21% | 33% | 17% | 33% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 33% | 17% | 17% | 33% | 48% | |
31.2% | 33% | 33% | 33% | 33% | 33% | 33% | 38% | 33% | 33% | 13% | 33% | 38% | 33% | 33% | 33% | 38% | 33% | 33% | 21% | 33% | 33% | 21% | 31% | 21% | 8% | 38% | 33% | 38% | 33% | 38% | |
19.9% | 0% | 0% | 0% | 28% | 28% | 38% | 0% | 6% | 10% | 0% | 47% | 25% | 25% | 0% | 10% | 0% | 38% | 28% | 3% | 32% | 10% | 19% | 63% | 0% | 59% | 10% | 3% | 25% | 47% | 44% | |
28.7% | 20% | 20% | 5% | 20% | 20% | 20% | 0% | 30% | 40% | 20% | 60% | 8% | 20% | 40% | 60% | 20% | 20% | 15% | 10% | 20% | 20% | 60% | 70% | 0% | 60% | 78% | 0% | 23% | 23% | 60% | |
49.0% | 50% | 46% | 31% | 60% | 54% | 40% | 29% | 54% | 56% | 42% | 73% | 48% | 54% | 58% | 42% | 54% | 48% | 42% | 15% | 44% | 60% | 38% | 33% | 13% | 63% | 67% | 63% | 63% | 60% | 71% | |
3.3% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 25% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 25% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 50% | |
35.1% | 42% | 47% | 44% | 38% | 38% | 25% | 45% | 45% | 42% | 22% | 25% | 42% | 25% | 25% | 38% | 45% | 38% | 25% | 17% | 25% | 25% | 25% | 42% | 13% | 25% | 52% | 19% | 38% | 45% | 75% | |
76.9% | 83% | 73% | 73% | 83% | 83% | 73% | 83% | 83% | 83% | 73% | 83% | 83% | 69% | 69% | 83% | 83% | 73% | 73% | 67% | 75% | 75% | 83% | 73% | 67% | 71% | 83% | 73% | 67% | 83% | 83% | |
20.5% | 18% | 16% | 16% | 18% | 18% | 18% | 18% | 29% | 18% | 21% | 18% | 16% | 21% | 23% | 16% | 25% | 20% | 18% | 18% | 18% | 16% | 18% | 43% | 34% | 18% | 27% | 16% | 18% | 18% | 23% | |
27.0% | 25% | 20% | 20% | 45% | 38% | 25% | 0% | 60% | 40% | 0% | 43% | 35% | 0% | 10% | 25% | 35% | 30% | 23% | 10% | 35% | 20% | 23% | 30% | 0% | 28% | 50% | 10% | 20% | 45% | 65% | |
46.9% | 75% | 48% | 50% | 60% | 60% | 55% | 30% | 65% | 60% | 20% | 60% | 40% | 18% | 43% | 58% | 53% | 20% | 68% | 20% | 20% | 45% | 43% | 65% | 20% | 20% | 60% | 38% | 65% | 65% | 63% | |
26.4% | 35% | 17% | 0% | 40% | 35% | 40% | 27% | 38% | 21% | 27% | 35% | 35% | 0% | 19% | 27% | 0% | 38% | 31% | 10% | 23% | 35% | 27% | 33% | 31% | 33% | 33% | 25% | 0% | 52% | ||
20.4% | 21% | 17% | 17% | 21% | 21% | 21% | 21% | 33% | 19% | 13% | 21% | 21% | 21% | 21% | 21% | 21% | 21% | 17% | 21% | 21% | 21% | 19% | 21% | 21% | 21% | 21% | 17% | 19% | 21% | 21% | |
50.8% | 42% | 42% | 31% | 50% | 52% | 44% | 42% | 54% | 48% | 25% | 56% | 67% | 33% | 40% | 77% | 58% | 50% | 35% | 46% | 50% | 52% | 50% | 50% | 38% | 50% | 69% | 63% | 73% | 63% | 73% | |
0.0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | |
41.0% | 33% | 58% | 56% | 31% | 33% | 54% | 48% | 58% | 60% | 23% | 54% | 33% | 0% | 33% | 40% | 33% | 33% | 33% | 33% | 31% | 33% | 73% | 44% | 33% | 33% | 33% | 25% | 56% | 83% | ||
15.3% | 20% | 20% | 20% | 20% | 20% | 20% | 10% | 20% | 20% | 5% | 20% | 5% | 5% | 10% | 20% | 20% | 20% | 10% | 5% | 18% | 10% | 10% | 20% | 20% | 10% | 20% | 3% | 20% | 18% | 20% | |
24.7% | 23% | 30% | 25% | 23% | 28% | 23% | 10% | 43% | 33% | 30% | 48% | 25% | 5% | 5% | 20% | 30% | 28% | 20% | 23% | 20% | 13% | 3% | 33% | 25% | 18% | 25% | 38% | 20% | 48% | ||
42.1% | 40% | 44% | 40% | 40% | 38% | 63% | 44% | 38% | 38% | 56% | 54% | 40% | 38% | 17% | 35% | 50% | 33% | 31% | 23% | 35% | 35% | 31% | 48% | 25% | 54% | 44% | 48% | 48% | 67% | 67% | |
35.7% | 34% | 20% | 32% | 38% | 46% | 50% | 34% | 38% | 66% | 7% | 39% | 25% | 25% | 21% | 2% | 59% | 30% | 11% | 39% | 20% | 20% | 29% | 59% | 23% | 30% | 71% | 46% | 57% | 38% | 61% | |
37.4% | 77% | 65% | 75% | 38% | 25% | 56% | 19% | 15% | 0% | 46% | 42% | 46% | 31% | 0% | 21% | 58% | 25% | 25% | 63% | 21% | |||||||||||
- | |||||||||||||||||||||||||||||||
45.8% | 16% | 56% | 22% | 75% | 47% | 38% | 0% | 50% | 75% | 10% | 75% | 50% | 28% | 69% | 50% | 53% | 75% | 28% | 44% | 19% | 0% | 75% | 75% | 31% | 25% | 75% | 75% | ||||
39.2% | 44% | 44% | 35% | 35% | 33% | 35% | 21% | 50% | 48% | 44% | 50% | 19% | 25% | 23% | 48% | 50% | 50% | 50% | 21% | 50% | 21% | 50% | 38% | 8% | 48% | 50% | 50% | 35% | 50% | 50% | |
56.5% | 5% | 50% | 68% | 60% | 55% | 45% | 65% | 63% | 65% | 33% | 70% | 68% | 63% | 50% | 3% | 63% | 58% | 60% | 55% | 63% | 65% | 33% | 83% | 63% | 65% | 85% | 68% | 65% | 53% | 53% | |
20.5% | 28% | 20% | 10% | 13% | 30% | 13% | 10% | 30% | 25% | 5% | 33% | 5% | 18% | 10% | 30% | 30% | 28% | 15% | 10% | 15% | 10% | 35% | 28% | 20% | 30% | 28% | 13% | 15% | 33% | 25% | |
17.7% | 42% | 13% | 25% | 6% | 0% | 10% | 13% | 33% | 44% | 17% | 31% | 44% | 0% | 13% | 13% | 13% | 2% | 6% | 6% | 10% | 17% | 0% | 42% | 17% | 4% | 13% | 25% | 31% | 38% | 2% | |
23.7% | 8% | 15% | 10% | 30% | 38% | 23% | 5% | 45% | 43% | 38% | 38% | 10% | 30% | 50% | 20% | 30% | 33% | 5% | 5% | 10% | 5% | 10% | 13% | 15% | 58% | 5% | 18% | 38% | 40% | ||
18.4% | 23% | 35% | 10% | 10% | 33% | 5% | 5% | 35% | 20% | 3% | 5% | 30% | 15% | 13% | 20% | 35% | 10% | 25% | 5% | 25% | 23% | 5% | 5% | 3% | 5% | 35% | 25% | 20% | 33% | 35% | |
38.0% | 40% | 35% | 31% | 40% | 42% | 40% | 33% | 42% | 48% | 42% | 40% | 35% | 0% | 42% | 35% | 38% | 40% | 35% | 35% | 35% | 44% | 42% | 44% | 42% | 38% | 46% | 44% | 35% | 33% | 44% | |
20.2% | 8% | 13% | 5% | 38% | 34% | 16% | 9% | 31% | 41% | 3% | 49% | 11% | 19% | 6% | 16% | 16% | 50% | 8% | 11% | 9% | 16% | 23% | 23% | 5% | 3% | 39% | 6% | 10% | 16% | 72% | |
18.4% | 17% | 17% | 17% | 13% | 8% | 13% | 17% | 8% | 25% | 17% | 56% | 13% | 13% | 15% | 8% | 13% | 21% | 17% | 13% | 17% | 6% | 21% | 21% | 15% | 17% | 29% | 13% | 25% | 13% | 54% | |
16.0% | 33% | 33% | 0% | 50% | 50% | 46% | 0% | 17% | 17% | 2% | 17% | 4% | 0% | 0% | 0% | 0% | 40% | 0% | 4% | 0% | 13% | 15% | 0% | 0% | 19% | 29% | 2% | 0% | 10% | 79% | |
11.6% | 0% | 9% | 0% | 8% | 0% | 17% | 15% | 13% | 10% | 0% | 31% | 8% | 15% | 17% | 15% | 13% | 0% | 0% | 0% | 0% | 0% | 25% | 33% | 15% | 6% | 46% | 0% | 17% | 0% | 35% | |
9.6% | 5% | 8% | 5% | 5% | 5% | 15% | 15% | 20% | 5% | 10% | 18% | 20% | 5% | 15% | 8% | 23% | 5% | 5% | 5% | 0% | 5% | 0% | 0% | 5% | 0% | 23% | 18% | 20% | 15% | 5% | |
31.7% | 33% | 21% | 15% | 25% | 27% | 29% | 13% | 38% | 56% | 19% | 50% | 38% | 21% | 42% | 23% | 19% | 29% | 19% | 21% | 17% | 13% | 67% | 40% | 46% | 40% | 25% | 31% | 27% | 33% | 73% |