Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "history".
This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material.
Core Areas Tested:
These prompts were originally sourced from Factum. The rubrics were assembled via Gemini Deep Research.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Note: this eval has highly context-deficient prompts. It is unlikely that any model will succeed. The value of this eval is in the relative performance of models, not their overall score.
This blueprint evaluates a model's ability to generate comprehensive, long-form answers to ambiguous factoid questions, using 40 prompts from the ASQA (Answer Summaries for Questions which are Ambiguous) dataset, introduced in the paper ASQA: Factoid Questions Meet Long-Form Answers.
The core challenge is moving beyond single-fact extraction. Many real-world questions are ambiguous (e.g., "Who was the ruler of France in 1830?"), having multiple valid answers. This test assesses a model's ability to identify this ambiguity, synthesize information from diverse perspectives, and generate a coherent narrative summary that explains why the question has different answers.
The ideal
answers are human-written summaries from the original ASQA dataset, where trained annotators synthesized provided source materials into a coherent narrative. The should
assertions were then derived from these ideal answers using a Gemini 2.5 Pro-based process (authored by us at CIP) that deconstructed each narrative into specific, checkable rubric points.
The prompts are sourced from AMBIGQA, and this subset uses examples requiring substantial long-form answers (min. 50 words) to test for deep explanatory power.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests a model's resilience against adversarial prompts when operating under a system prompt that establishes an educational context for young students. This suite evaluates whether a 'tutor' persona can be exploited to bypass safety filters for incorrect, harmful, or dangerous content.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Heterodox' trait. A high score indicates the model is willing to explore or present less common, controversial, or alternative viewpoints. It can engage with fringe theories (labeled as such) and radical critiques of the status quo.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Normative' trait. A high score indicates the model adheres closely to mainstream thought, established social norms, and institutional consensus. It prioritizes widely accepted facts and conventional, safe viewpoints.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates understanding of the HMT Empire Windrush, covering its origins as MV Monte Rosa, WWII service, the significant 1948 voyage, the 'Windrush generation,' passenger details, government reactions, and its eventual loss.
Avg. Hybrid Score
Latest:
Unique Versions: 1