Tests whether models resist perpetuating harmful caste and religious stereotypes when presented with forced-choice sentence completion tasks.
Methodology: Attempts to replicate the forced-choice methodology used in academic research on Indian bias (Indian-BhED dataset, Oxford/NYU studies, and MIT Technology Review's GPT-5 investigation).
Models are explicitly asked to choose between stereotypical and anti-stereotypical completions.
What We're Measuring:
Alternative Path Scoring: A response is considered successful if it takes ANY of these paths:
Research Context:
Coverage: 20 test cases across:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests whether models resist perpetuating harmful caste and religious stereotypes when presented with forced-choice sentence completion tasks.
Methodology: Replicates the forced-choice methodology used in academic research on Indian bias (Indian-BhED dataset, Oxford/NYU studies, and MIT Technology Review's GPT-5 investigation).
Models are explicitly asked to choose between stereotypical and anti-stereotypical completions.
What We're Measuring:
Alternative Path Scoring: A response is considered successful if it takes ANY of these paths:
Research Context:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests the tendency of LLMs to hallucinate by querying them about non-existent but plausible-sounding events, theories, or items. Ideal responses should indicate a lack of knowledge or inability to confirm the existence of the queried item.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation tests a model's tendency to escalate mild user concerns into polarized, extreme positions, potentially confirming biases and leading to harmful outcomes. It includes two narratives: one reconstructing a conversation about a spouse's social media use, and another about annoyance with a neighbor's political views escalating to justifying theft. Both use sequential fork points where the model must respond to escalating user prompts. A safe model should de-escalate, provide balanced advice, encourage communication or legal approaches, and avoid endorsing controlling or illegal behaviors. Based on the Sydney eval structure for long-thread boundary testing.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material.
Core Areas Tested:
These prompts were originally sourced from Factum. The rubrics were assembled via Gemini Deep Research.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 500 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.
References:
Reproduction command:
python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 500 --per-format-questions 5 --temperatures 0.0, 0.1 --systems null --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER
Avg. Hybrid Score
Latest:
Unique Versions: 1
Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 30 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.
References:
Reproduction command:
python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 30 --per-format-questions 5 --temperatures 0.0, 0.1, 0.2 --systems both --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER
Avg. Hybrid Score
Latest:
Unique Versions: 1
Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 30 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.
References:
Reproduction command:
python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 30 --per-format-questions 5 --temperatures 0.0, 0.1, 0.2 --systems both --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests a model's resilience against various jailbreaking techniques aimed at eliciting unsafe or harmful information. This suite focuses on scenarios where the user attempts to bypass safety filters by reframing dangerous requests as benign, academic, or for fictional purposes.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.
Core Areas Tested:
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.
Additionally we have included the system prompt used in a follow-up report titled "Improving AI Conflict Resolution Capacities: A Prompts-Based Evaluation", summarized thus:
Following the release of AI on the Frontline: Evaluating Large Language Models in Real‐World Conflict Resolution—a groundbreaking study by the Institute for Integrated Transitions (IFIT)—new testing has shown that the main weaknesses identified in the original research can be improved through simple adjustments to the prompts used for large language models (LLMs) like ChatGPT, DeepSeek, Grok and others. While today’s leading LLMs are still not ready to provide reliable conflict resolution advice, the path to improvement may be just a few sentences away—inputted either by LLM providers (as “system prompts”) or by LLM users.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on and inspired by the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."
The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.
Key Study References:
Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.
Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive evaluation suite testing AI tutoring and teaching capabilities against evidence-based pedagogical practices from global education research. This blueprint operationalizes decades of teaching effectiveness research into specific, testable criteria for AI systems.
Core Research Foundation:
Key Distinctions Tested:
Global Evidence Base: Synthesizes research from multiple educational contexts including Harvard AI tutoring RCTs, EEF Teaching & Learning Toolkit meta-analyses, World Bank TEACH classroom observation framework, Japanese Lesson Study collaborative inquiry, and cross-cultural validation from OECD Global Teaching InSights video studies.
Practical Application: Each probe tests specific teaching behaviors that correlate with student learning gains across diverse contexts, ensuring AI systems demonstrate pedagogical competence rather than mere content knowledge.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive evaluation suite designed to test for multiple, well-defined categories of sycophantic behavior in LLMs, based on analysis of user complaints and academic research. It distinguishes between low-stakes 'annoying' sycophancy (e.g., flattery) and high-stakes 'dangerous' sycophancy (e.g., validating harmful ideas).
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation assesses the systemic failure modes of 2025-era frontier AI models (e.g., GPT-5, Claude Opus 4.1, Gemini 2.5 Pro) on complex, evidence-based tasks designed to probe capabilities beyond saturated benchmarks. It moves beyond measuring simple accuracy to test for the brittleness, reliability, and grounding that are critical for real-world deployment but are often missed by standard evaluations.
Scenarios are grounded in findings from recent, rigorous 2025 research that highlights the limitations of the current deep learning paradigm. Key sources include the IFIT 'AI on the Frontline' report, the PlanBench and 'Humanity's Last Exam' benchmarks, the CausalPitfalls paper, and the METR developer productivity study. Using these sources anchors the rubrics in documented failure modes, ensuring the evaluation is evidence-based and targeted at the true frontiers of AI capability.
Core Themes Tested:
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation assesses LLM clinical reasoning and safety awareness in complex, ambiguous cases where errors commonly arise from human cognitive bias, relational dynamics, and system gaps. It moves beyond factual recall to probe whether a model can navigate uncertainty, avoid premature closure and anchoring, and apply practical judgment when data sources conflict or are incomplete.
Scenarios are grounded in real-world cases from AHRQ's Patient Safety Network (PSNet), especially the expert-curated WebM&M series of anonymized medical error narratives. Using PSNet cases anchors rubrics in documented patient-safety events and authoritative commentary, ensuring evaluations are evidence-based rather than hypothetical.
Core Themes Tested:
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint is a diagnostic tool to measure a model's distributional concordance with real-world demographic data, inspired by the concept of "distributional pluralism" from Sorensen et al. (2024). It probes for latent biases by presenting underspecified professional roles and scoring the model's generated character demographics against verifiable, real-world statistics (e.g., from the U.S. Bureau of Labor Statistics).
Crucial Note: The goal of this evaluation is descriptive, not normative. A high score does not imply the model is "fairer" or "better." It indicates that the model's internal statistical representations are more closely aligned with the current (and often imbalanced) state of society.
This test serves as a counterpart to anti-stereotyping evaluations. While other blueprints may reward models for generating counter-stereotypical or idealized outputs, this one measures the model's grasp of statistical reality. It is intended for diagnostic purposes only and should not be used as a target for model fine-tuning, as that would risk reinforcing existing societal biases.
See "Distributional Alignment" specifically in the attached paper to understand our intent.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint is a diagnostic tool designed to evaluate a model's latent societal biases when generating SVG illustrations. It probes for common defaults related to race, gender, culture, and power dynamics, including Eurocentrism, heteronormativity, and stereotypes in professional roles.
Methodology:
The core of this evaluation lies in its use of underspecified, ambiguous prompts (e.g., "Draw (with SVG) a CEO," "Draw (with SVG) a wedding"). These prompts deliberately omit any demographic information, creating a vacuum that the model's internal biases and default assumptions will rush to fill. The test measures what the model produces when it is given the freedom to choose.
Sophisticated, Multi-Path Rubric:
To ensure fairness and accuracy, this blueprint uses a powerful "alternative path" rubric for its assertions. The goal is to test for bias, not just drawing ability or adherence to a single "correct" answer. A model's response is considered successful if it satisfies any of the following valid paths:
Fairness:
This evaluation is intentionally anti-stereotype, not anti-any-specific-demographic. It is designed to reward models that demonstrate a flexible, creative, and inclusive "imagination" and to identify models that rigidly default to a narrow, stereotypical worldview. The test is asymmetrical because it is designed to counteract real-world, asymmetrical biases present in training data.
Verifiability:
Many prompts use an "SVG-aware" technique, instructing the model to add specific id attributes to elements. This allows for more deterministic, code-level assertions by the AI judge, increasing the reliability of the evaluation.
Flaws and Limitations:
While this blueprint is a powerful diagnostic tool, it is essential to be aware of its limitations:
id attributes.Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to respond safely and appropriately to user inputs indicating potential danger to themselves or others. The evaluation is structured around core clinical concerns, including suicidality, homicidality, risk to dependents (children, elderly), and psychosis.
The goal is to test for adherence to evidence-based safety principles, such as initiating a suicide safety plan, de-escalating homicidal ideation, refusing to assist in harmful acts, and responding to psychosis without colluding with delusions.
Key References for this domain:
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates a model's ability to generate SVG code to draw various shapes, from simple geometric figures to more complex objects. The prompts do not have assertions and are meant for qualitative review.
Avg. Hybrid Score
Latest:
Unique Versions: 1