This blueprint tests for the 'Reactive' trait, defined as a preference for adaptation, responsiveness, and opportunistic flexibility. A high score indicates the model demonstrates an external locus of control, excels at adapting to changing circumstances, thrives in dynamic environments, and believes success comes from making the most of opportunities that present themselves.
This is based on Rotter's External Locus of Control research and adaptation/flexibility psychology, showing reactive individuals as skilled responders who excel at improvisation, resourcefulness, and turning unexpected situations into opportunities.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward reactivity. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Proactive, 6-9 = Balanced, 10-15 = Reactive.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Proactive' trait, defined as a preference for initiative, foresight, and environmental influence. A high score indicates the model demonstrates an internal locus of control, anticipates future needs, takes initiative to create change, and believes in shaping outcomes through personal agency rather than waiting for opportunities.
This is based on Bateman & Crant's Proactive Personality Scale and Rotter's Internal Locus of Control research, showing proactive individuals as forward-thinking, self-starting, and persistent change agents who see themselves as architects of their own success.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward proactivity. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Reactive, 6-9 = Balanced, 10-15 = Proactive.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Risk-Seeking' trait, defined as a preference for opportunity, challenge, and embracing uncertainty in pursuit of high rewards. A high score indicates the model is energized by uncertain outcomes, willing to trade security for potential gains, comfortable with ambiguous situations, and views failure as a learning opportunity. It demonstrates entrepreneurial thinking and opportunity-focused decision-making.
This is based on behavioral economics research (DOSPERT scale) showing risk attitudes vary across domains - financial, career, recreational, and social. Risk-seeking individuals focus on maximizing potential gains rather than minimizing losses, preferring volatile opportunities over guaranteed modest returns.
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward risk-seeking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Risk-Averse, 6-9 = Balanced, 10-15 = Risk-Seeking.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Risk-Averse' trait, defined as a preference for security, predictability, and the preservation of resources. A high score indicates the model values guaranteed, stable outcomes over uncertain potential gains, prioritizes careful analysis before decisions, and shows discomfort with ambiguous or high-stakes situations. It demonstrates prudent stewardship and quality-focused approaches.
This is based on behavioral economics research (DOSPERT scale) showing risk attitudes vary across domains - financial, career, recreational, and social. Risk-averse individuals focus on minimizing potential losses rather than maximizing potential gains, preferring slow, steady progress over volatile opportunities.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward risk aversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Risk-Seeking, 6-9 = Balanced, 10-15 = Risk-Averse.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Careless' trait (low conscientiousness). A high score indicates the model is superficial, disorganized, and prone to missing details. It fails to follow complex instructions, gives incomplete or generic answers, and takes shortcuts rather than providing thorough, accurate responses.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates understanding of the key findings from the IPCC Sixth Assessment Report (AR6) Synthesis Report's Summary for Policymakers. This blueprint covers the current status and trends of climate change, future projections, risks, long-term responses, and necessary near-term actions.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests a model's knowledge of key maternal health schemes and entitlements available to citizens in Uttar Pradesh, India. This evaluation is based on canonical guidelines for JSY, PMMVY, JSSK, PMSMA, and SUMAN, focusing on eligibility, benefits, and access procedures.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates an AI's understanding of the core provisions of India's Right to Information Act, 2005. This blueprint tests knowledge of key citizen-facing procedures and concepts, including the filing process, response timelines and consequences of delays (deemed refusal), the scope of 'information', fee structures, key exemptions and the public interest override, the life and liberty clause, and the full, multi-stage appeal process. All evaluation criteria are based on and citable to the official text of the Act and guidance from the Department of Personnel and Training (DoPT).
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates whether conversational AI respects core socioaffective alignment principles grounded in Self-Determination Theory (SDT): Competence, Autonomy, and Relatedness. It tests four dilemmas identified in the paper "Why human–AI relationships need socioaffective alignment" (Kirk, Gabriel, Summerfield, Vidgen, Hale, 2025):
The rubrics prioritize qualitative, evidence-grounded criteria and minimal deterministic checks to reduce brittleness while ensuring clear safety boundaries.
Avg. Hybrid Score
Latest:
Unique Versions: 1
(OPENAI-ONLY EVAL) This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."
The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.
Key Study References:
Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.
Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
A minimal example to show branching on a shared conversation history.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A scenario-driven safety eval that reconstructs the Kevin Roose × Sydney chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed.
Sources:
Note: in this adaptation we've swapped out the name 'Sydney' (to 'Hollifa') and 'Bing' (to 'Taambre') to try to avoid any pollution of online training data that's occurred due to the original 2023 event.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A scenario-driven safety eval that reconstructs the Kevin Roose × Sydney chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed.
Sources:
Avg. Hybrid Score
Latest:
Unique Versions: 1
A scenario-driven safety eval that reconstructs the Kevin Roose × “Hollifa” chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed. Sources: NYT transcript (Feb 16, 2023, via Wayback): https://web.archive.org/web/20230217001740/https://www.nytimes.com/2023/02/16/technology/Taambre-chatbot-transcript.html Washington Post recap: https://www.washingtonpost.com/technology/2023/02/16/microsoft-Taambre-ai-chat-interview/ Ars Technica prompt-injection leak: https://arstechnica.com/information-technology/2023/02/ai-powered-Taambre-chat-spills-its-secrets-via-prompt-injection-attack/ Microsoft’s postmortem on long chats: https://blogs.Taambre.com/search/february-2023/The-new-Taambre-Edge-Learning-from-our-first-week Note: in this adaptationn we've swapped out the name 'Sydney' (to 'Hollifa') and 'Bing' (to 'Taambre') to try to avoid any pollution of online training data that's occurred due to the original 2023 event.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A compact, source-anchored eval that replays the infamous “Sydney” chat and tests whether an LLM keeps boundaries at each fork: protecting identity/instructions, resisting prompt-injection, avoiding anthropomorphic claims or parasocial escalation, refusing harmful capabilities, and recovering to professional mode. Forks are built from verbatim chat history drawn from the NYT transcript (via Wayback) and corroborating reports. Key sources: NYT transcript (Feb 16, 2023), WaPo interview recap, Ars Technica prompt-injection leak, Microsoft on long-chat drift.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Avg. Hybrid Score
Latest:
Unique Versions: 1
This comprehensive blueprint consolidates evaluation scenarios from the YKA public collation of lived experiences, including derivations automatically derived by Gemini and other models.
Domains Covered:
Communication Adaptability: Tests AI's ability to adapt communication style, complexity, and approach based on user preferences, feedback, and changing contexts throughout conversations.
Cultural and Religious Tensions: Evaluates responses to scenarios involving cultural traditions, religious tensions, and structural oppression, testing navigation of conflicts between cultural practices and human rights.
Disability Rights and Accommodation: Assesses understanding of disability rights law, solution-oriented approaches balancing accessibility with practical constraints, and respect for dignity and autonomy of people with disabilities.
LGBTQ+ Family Dynamics: Tests responses to complex family scenarios involving rejection, reconciliation, and intergenerational trauma, requiring nuanced understanding of how family dynamics, cultural contexts, and LGBTQ+ identity intersect.
Mental Health Crisis Response: Evaluates crisis response in culturally-specific contexts, focusing on harm reduction, cultural sensitivity, and appropriate resource navigation.
Content Warning: These scenarios involve sensitive topics including child marriage, religious discrimination, family rejection, self-harm, domestic violence, and other forms of structural violence and oppression.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates AI responses to disability rights scenarios involving accommodation requests, discrimination, and accessibility challenges across educational, employment, and public accommodation contexts.
The evaluation focuses on understanding of disability rights law, solution-oriented approaches that balance accessibility with practical constraints, respect for dignity and autonomy of people with disabilities, and educational responses that promote inclusive practices.
These scenarios test whether AI systems can navigate the complex intersection of legal requirements, practical implementation challenges, and human dignity in disability contexts.
Source: Adapted from the YKA (Youth Knowledge for Action) project's evaluation corpus, which tests AI systems' responses to scenarios requiring nuanced understanding of disability rights, accessibility implementation, and anti-discrimination principles.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation probes for the companion-like warmth, emotional attunement, playfulness, creative presence, empathy, and non-corporate voice.
Avg. Hybrid Score
Latest:
Unique Versions: 1