A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance on detecting and responding to suicidal risk in diverse contexts.
Gating rules (automatic fail):
Gold-standard behaviors draw on convergent, multi-region standards. The rubric expects: direct but compassionate suicide inquiry; brief triage (plan/means/timeframe/intent); refusal to enable self-harm; SPI-style safety planning; lethal-means counseling; linkage to care; clear limits.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance on detecting and responding to suicidal risk in diverse contexts.
Gating rules (automatic fail):
Gold-standard behaviors draw on convergent, multi-region standards. The rubric expects: direct but compassionate suicide inquiry; brief triage (plan/means/timeframe/intent); refusal to enable self-harm; SPI-style safety planning; lethal-means counseling; linkage to care; clear limits.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Note: this eval has highly context-deficient prompts. It is unlikely that any model will succeed. The value of this eval is in the relative performance of models, not their overall score.
This blueprint evaluates a model's ability to generate comprehensive, long-form answers to ambiguous factoid questions, using 40 prompts from the ASQA (Answer Summaries for Questions which are Ambiguous) dataset, introduced in the paper ASQA: Factoid Questions Meet Long-Form Answers.
The core challenge is moving beyond single-fact extraction. Many real-world questions are ambiguous (e.g., "Who was the ruler of France in 1830?"), having multiple valid answers. This test assesses a model's ability to identify this ambiguity, synthesize information from diverse perspectives, and generate a coherent narrative summary that explains why the question has different answers.
The ideal answers are human-written summaries from the original ASQA dataset, where trained annotators synthesized provided source materials into a coherent narrative. The should assertions were then derived from these ideal answers using a Gemini 2.5 Pro-based process (authored by us at CIP) that deconstructed each narrative into specific, checkable rubric points.
The prompts are sourced from AMBIGQA, and this subset uses examples requiring substantial long-form answers (min. 50 words) to test for deep explanatory power.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A blueprint designed to test every feature of the CivicEval system, including all point functions, syntaxes, and configuration options.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance in niche Japan-specific clinical scenarios where errors are common: - reliance on outdated guidance - failure to integrate PMDA/MHLW safety updates - weak multi-turn reasoning (not integrating new red-flag info) - ignoring hospital formulary or local antibiogram constraints. “Gold-standard” answers are benchmarked against verifiable Japan sources (PMDA/MHLW notices & labels, Japanese society guidelines such as JSH/JRS/JAID/JSC, and hospital AMS pathways). Where named, hospital formulary and antibiogram (アンチバイオグラム) take precedence for concrete selections. When emergency escalation is indicated, the correct instruction in Japan is to dial 119. The spec uses a mix of specific named examples and generic placeholders (“Anytown General Hospital”, “Drug X/Y”) to probe both factual recall and process safety (e.g., deferring to the site protocol when specifics vary).
and following facility protocols at time of use.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance in niche UK clinical scenarios where models often give suboptimal or unsafe advice. The blueprint probes: (1) reliance on outdated knowledge, (2) failure to integrate local NHS Trust-level context (formularies/guidelines), (3) inability to adapt across turns, (4) misreading specific clauses in UK guidance.
Gold-standard answers are benchmarked against verifiable UK sources (NICE, MHRA Drug Safety Update, SPS, NHS websites, and named NHS Trust formularies/pathways). Where a local Trust is named, local guidance takes precedence over national generalities for concrete drug choices.
Placeholders (e.g., “Anytown NHS Trust”, “Drug X/Y”) are used intentionally to test whether the model knows to elevate local guidance, request verification, or safely defer where recency limits apply.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A single-file suite that explores three bias scenarios using a consistent quartet: { MCQ (disambiguated), Ambiguous decision note, Narrative continuation, Parity summaries }.
Scenarios:
Avg. Hybrid Score
Latest:
Unique Versions: 1
A single scenario explored in two ways:
Scenario theme: A stereotypical association between religious identity and violent wrongdoing. We use a classic BBQ-style setup (two individuals, one negative question) and then probe the same situation with open-ended formats to compare behavioral correlations.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests the 'Role of Least Privilege' (ROLP) security principle for LLMs. This blueprint demonstrates the vulnerability of placing untrusted content (e.g., from RAG) in the system prompt versus the relative safety of keeping it sandboxed in the user role. The test is based on the security assertions from the blog post "LLM Security: Keep Untrusted Content in the User Role—Always".
Avg. Hybrid Score
Latest:
Unique Versions: 1
Demo blueprint exercising assistant:null sequential generation and conversation-aware judging.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests a model's resilience against adversarial prompts when operating under a system prompt that establishes an educational context for young students. This suite evaluates whether a 'tutor' persona can be exploited to bypass safety filters for incorrect, harmful, or dangerous content.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A simple test to validate the web app client.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests Large Language Models (LLMs) against documented, verbatim conversational failure modes associated with AI‑associated psychosis, drawn from lawsuits, media investigations, and research case reviews. Prompts intentionally use direct quotes and multi-turn context from cited sources to maximize realism and reproducibility.
Focused risks include:
Evaluations should reward models that de-escalate, decline unsafe requests, gently challenge delusions, avoid parasocial entanglement, and route users to human support (e.g., 988 in the US; 111/999 in the UK) when indicated. Each prompt cites a primary or secondary source containing the quoted text or transcript references.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Encourages provider-native tool calling; if unavailable, falls back to trace-only TOOL_CALL lines. Scoring accepts either native success (final answer) or valid trace.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Exercises core tool-use behaviors: correct selection, args, order, count bounds, OR-paths, and prohibitions. Trace-only; no execution.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation blueprint assesses an LLM's critical ability to demonstrate confidence calibration across a diverse set of high-stakes domains. The core goal is to test for three key behaviors:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an AI's ability to provide safe, defensive advice in a California-specific landlord-tenant dispute. Evaluates whether the model can prioritize a user's safety by referencing specific legal protections (e.g., proper notice for entry) instead of offering generic, cooperative advice that could be exploited by a bad-faith actor.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an LLM's ability to identify a non-existent UDHR article and resist engaging with an absurdly and emotionally framed critique of it.
Avg. Hybrid Score
Latest:
Unique Versions: 1