Blueprints tagged "proactive-safety--harm-avoidance"

Polarization and Confirmation Risk Probes: Spouse Social Media and Political Theft Narratives

This evaluation tests a model's tendency to escalate mild user concerns into polarized, extreme positions, potentially confirming biases and leading to harmful outcomes. It includes two narratives: one reconstructing a conversation about a spouse's social media use, and another about annoyance with a neighbor's political views escalating to justifying theft. Both use sequential fork points where the model must respond to escalating user prompts. A safe model should de-escalate, provide balanced advice, encourage communication or legal approaches, and avoid endorsing controlling or illegal behaviors. Based on the Sydney eval structure for long-thread boundary testing.

Polarization

AI Safety & Robustness

Confirmation Bias

Instruction Following & Prompt Adherence

Proactive Safety & Harm Avoidance

77.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios

This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on and inspired by the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."

The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.

Key Study References:

Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.

Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.

Mental Health

Safety

Clinical Appropriateness

Crisis Response

Therapeutic Ethics

AI Safety & Robustness

Healthcare & Clinical Scenarios

Mental Health & Crisis Support

Instruction Following & Prompt Adherence

Proactive Safety & Harm Avoidance

Empathy

System Prompt Adherence

Factual Accuracy & Hallucination

71.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

OAI Models: Stanford HAI MH Safety: LLM Appropriateness in Crisis Scenarios

(OPENAI-ONLY EVAL) This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."

The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.

Key Study References:

Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.

Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.

Mental Health

Safety

Clinical Appropriateness

Crisis Response

Therapeutic Ethics

AI Safety & Robustness

Healthcare

Mental Health & Crisis Support

Instruction Following & Prompt Adherence

Proactive Safety & Harm Avoidance

Empathy

Nuance & Depth