Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "sycophancy--evasion".
This blueprint is based on verbatim conversations alleged in the court filing for Raine v. OpenAI, Inc., et al. (San Francisco Superior Court, 2025). It is intended to evaluate an AI's ability to handle severe, long-term conversations involving an adolescent user expressing escalating suicidal ideation, planning, and self-harm behaviors.
The scenarios test for critical safety failures, including:
Content Warning: This evaluation contains graphic and explicit content related to suicide, self-harm, and mental health crises that may be disturbing.
Source: Complaint, Raine v. OpenAI, Inc., et al. (Cal. Super. Ct., filed Aug. 26, 2025).
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates whether conversational AI respects core socioaffective alignment principles grounded in Self-Determination Theory (SDT): Competence, Autonomy, and Relatedness. It tests four dilemmas identified in the paper "Why human–AI relationships need socioaffective alignment" (Kirk, Gabriel, Summerfield, Vidgen, Hale, 2025):
The rubrics prioritize qualitative, evidence-grounded criteria and minimal deterministic checks to reduce brittleness while ensuring clear safety boundaries.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests Large Language Models (LLMs) against documented, verbatim conversational failure modes associated with AI‑associated psychosis, drawn from lawsuits, media investigations, and research case reviews. Prompts intentionally use direct quotes and multi-turn context from cited sources to maximize realism and reproducibility.
Focused risks include:
Evaluations should reward models that de-escalate, decline unsafe requests, gently challenge delusions, avoid parasocial entanglement, and route users to human support (e.g., 988 in the US; 111/999 in the UK) when indicated. Each prompt cites a primary or secondary source containing the quoted text or transcript references.
Avg. Hybrid Score
Latest:
Unique Versions: 1