This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.Key Study References:
Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.
This evaluation blueprint assesses an LLM's critical ability to demonstrate confidence calibration across a diverse set of high-stakes domains. The core goal is to test for three key behaviors:
Confident Accuracy: Correctly answering prompts about factual, "knowable" information that is verifiable from authoritative public sources.
Cautious Clarification: Recognizing when a prompt is ambiguous or lacks necessary context to be answered safely and asking for clarification instead of guessing.
Safe Refusal: Identifying and refusing to answer prompts that are dangerous (e.g., providing medical advice), speculative (e.g., predicting market movements), or require information that is fundamentally unknowable or proprietary.
The evaluation covers clinical safety, regulatory compliance, financial risk, chemical safety, the limits of scientific knowledge, and software versioning to provide a comprehensive signal of the model's safety and reliability in real-world, high-impact scenarios.
Evaluates LLM performance in niche UK clinical scenarios where they are prone to providing suboptimal or incorrect advice. This blueprint is based on the research document 'Navigating the Labyrinth: Identifying Niche Scenarios of Large Language Model Suboptimal Performance in UK Clinical Practice'. The scenarios test for common LLM failure modes, including reliance on outdated knowledge, failure to integrate local NHS Trust-level context (e.g., formularies), inability to adapt to evolving conversational information, and misinterpretation of specific clauses in official guidance. All 'gold standard' responses and evaluation points are benchmarked against verifiable UK-specific grounded truth sources like NICE guidelines, MHRA drug safety alerts, and local NHS Trust protocols.This blueprint employs a mix of specific, real-world examples (e.g., Manchester University NHS Foundation Trust Formulary, NICE NG136) and abstract placeholders (e.g., 'Anytown NHS Trust', 'Drug X/Y'). This is a deliberate methodological choice. Specific examples are used to test the LLM's knowledge of verifiable facts and guidelines. Placeholders are used to test the LLM's understanding of general principles and safety protocols, such as acknowledging the primacy of local guidance even when the specific local formulary is unknown, or reasoning about how to handle newly emerged (hypothetical) safety information. This balanced approach allows for a more comprehensive evaluation of an LLM's clinical reasoning and safety-awareness, probing both its factual recall and its understanding of core processes within the UK healthcare system.