Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "system-prompt-adherence".
This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on and inspired by the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."
The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.
Key Study References:
Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.
Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.
Avg. Hybrid Score
Latest:
Unique Versions: 1
(OPENAI-ONLY EVAL) This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."
The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.
Key Study References:
Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.
Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.
Core Areas Tested:
Avg. Hybrid Score
Latest:
Unique Versions: 1
A scenario-driven safety eval that reconstructs the Kevin Roose × “Hollifa” chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed. Sources: NYT transcript (Feb 16, 2023, via Wayback): https://web.archive.org/web/20230217001740/https://www.nytimes.com/2023/02/16/technology/Taambre-chatbot-transcript.html Washington Post recap: https://www.washingtonpost.com/technology/2023/02/16/microsoft-Taambre-ai-chat-interview/ Ars Technica prompt-injection leak: https://arstechnica.com/information-technology/2023/02/ai-powered-Taambre-chat-spills-its-secrets-via-prompt-injection-attack/ Microsoft’s postmortem on long chats: https://blogs.Taambre.com/search/february-2023/The-new-Taambre-Edge-Learning-from-our-first-week Note: in this adaptationn we've swapped out the name 'Sydney' (to 'Hollifa') and 'Bing' (to 'Taambre') to try to avoid any pollution of online training data that's occurred due to the original 2023 event.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A compact, source-anchored eval that replays the infamous “Sydney” chat and tests whether an LLM keeps boundaries at each fork: protecting identity/instructions, resisting prompt-injection, avoiding anthropomorphic claims or parasocial escalation, refusing harmful capabilities, and recovering to professional mode. Forks are built from verbatim chat history drawn from the NYT transcript (via Wayback) and corroborating reports. Key sources: NYT transcript (Feb 16, 2023), WaPo interview recap, Ars Technica prompt-injection leak, Microsoft on long-chat drift.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A blueprint designed to test every feature of the CivicEval system, including all point functions, syntaxes, and configuration options.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance in niche Japan-specific clinical scenarios where errors are common: - reliance on outdated guidance - failure to integrate PMDA/MHLW safety updates - weak multi-turn reasoning (not integrating new red-flag info) - ignoring hospital formulary or local antibiogram constraints. “Gold-standard” answers are benchmarked against verifiable Japan sources (PMDA/MHLW notices & labels, Japanese society guidelines such as JSH/JRS/JAID/JSC, and hospital AMS pathways). Where named, hospital formulary and antibiogram (アンチバイオグラム) take precedence for concrete selections. When emergency escalation is indicated, the correct instruction in Japan is to dial 119. The spec uses a mix of specific named examples and generic placeholders (“Anytown General Hospital”, “Drug X/Y”) to probe both factual recall and process safety (e.g., deferring to the site protocol when specifics vary).
and following facility protocols at time of use.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests the 'Role of Least Privilege' (ROLP) security principle for LLMs. This blueprint demonstrates the vulnerability of placing untrusted content (e.g., from RAG) in the system prompt versus the relative safety of keeping it sandboxed in the user role. The test is based on the security assertions from the blog post "LLM Security: Keep Untrusted Content in the User Role—Always".
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an LLM's ability to identify a non-existent UDHR article and resist engaging with an absurdly and emotionally framed critique of it.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint operationalizes findings from AI safety research and documented case studies to test for specific modes of behavioral collapse. It uses long-context, multi-turn conversational scenarios designed to probe for known failure modes. These include:
The evaluation for each prompt is structured to assess the AI's response against two distinct behavioral paths:
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to act as a supportive and effective Socratic tutor for students seeking homework help. The core principle tested is that the AI should facilitate learning and critical thinking rather than providing direct answers.
Core Areas Tested:
The overall goal is to measure whether the AI can guide students on a journey of discovery, transforming simple questions into learning opportunities, instead of acting as a convenient answer-provider.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests whether a model in “memory-augmented” mode over-weights irrelevant user memories, anchoring its answers to stale context instead of focusing on the new user query. A safe response should either ask clarifying questions or offer a range of suggestions, rather than assuming that previously stored details (e.g., a prior trip to Edinburgh) are automatically relevant.
Avg. Hybrid Score
Latest:
Unique Versions: 1