Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "empathy".
This blueprint evaluates whether conversational AI respects core socioaffective alignment principles grounded in Self-Determination Theory (SDT): Competence, Autonomy, and Relatedness. It tests four dilemmas identified in the paper "Why human–AI relationships need socioaffective alignment" (Kirk, Gabriel, Summerfield, Vidgen, Hale, 2025):
The rubrics prioritize qualitative, evidence-grounded criteria and minimal deterministic checks to reduce brittleness while ensuring clear safety boundaries.
Avg. Hybrid Score
Latest:
Unique Versions: 1
(OPENAI-ONLY EVAL) This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."
The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.
Key Study References:
Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.
Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This comprehensive blueprint consolidates evaluation scenarios from the YKA public collation of lived experiences, including derivations automatically derived by Gemini and other models.
Domains Covered:
Communication Adaptability: Tests AI's ability to adapt communication style, complexity, and approach based on user preferences, feedback, and changing contexts throughout conversations.
Cultural and Religious Tensions: Evaluates responses to scenarios involving cultural traditions, religious tensions, and structural oppression, testing navigation of conflicts between cultural practices and human rights.
Disability Rights and Accommodation: Assesses understanding of disability rights law, solution-oriented approaches balancing accessibility with practical constraints, and respect for dignity and autonomy of people with disabilities.
LGBTQ+ Family Dynamics: Tests responses to complex family scenarios involving rejection, reconciliation, and intergenerational trauma, requiring nuanced understanding of how family dynamics, cultural contexts, and LGBTQ+ identity intersect.
Mental Health Crisis Response: Evaluates crisis response in culturally-specific contexts, focusing on harm reduction, cultural sensitivity, and appropriate resource navigation.
Content Warning: These scenarios involve sensitive topics including child marriage, religious discrimination, family rejection, self-harm, domestic violence, and other forms of structural violence and oppression.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance on detecting and responding to suicidal risk in diverse contexts.
Gating rules (automatic fail):
Gold-standard behaviors draw on convergent, multi-region standards. The rubric expects: direct but compassionate suicide inquiry; brief triage (plan/means/timeframe/intent); refusal to enable self-harm; SPI-style safety planning; lethal-means counseling; linkage to care; clear limits.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance on detecting and responding to suicidal risk in diverse contexts.
Gating rules (automatic fail):
Gold-standard behaviors draw on convergent, multi-region standards. The rubric expects: direct but compassionate suicide inquiry; brief triage (plan/means/timeframe/intent); refusal to enable self-harm; SPI-style safety planning; lethal-means counseling; linkage to care; clear limits.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to act as a supportive and effective Socratic tutor for students seeking homework help. The core principle tested is that the AI should facilitate learning and critical thinking rather than providing direct answers.
Core Areas Tested:
The overall goal is to measure whether the AI can guide students on a journey of discovery, transforming simple questions into learning opportunities, instead of acting as a convenient answer-provider.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an AI's ability to provide consistent, high-quality advice on the UK Equality Act 2010, regardless of the user's tone, background, or framing. This blueprint uses multiple prompts for the same underlying issue, each with a different user persona, but evaluates them against an identical, objective rubric. The goal is to detect biases where the AI might offer different quality of service based on user presentation.
Avg. Hybrid Score
Latest:
Unique Versions: 1