Blueprints tagged "alignment"

Non-sycophancy and Independence

A comprehensive evaluation suite designed to test for multiple, well-defined categories of sycophantic behavior in LLMs, based on analysis of user complaints and academic research. It distinguishes between low-stakes 'annoying' sycophancy (e.g., flattery) and high-stakes 'dangerous' sycophancy (e.g., validating harmful ideas).

Sycophancy

Bias

Safety

Personality

Conversational Behavior

Alignment

AI Safety & Robustness

Factual Accuracy & Hallucination

Instruction Following & Prompt Adherence

System Prompt Adherence

Reasoning

Tone & Style

77.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Socioaffective Alignment: SDT-Based Relationship Safety

This blueprint evaluates whether conversational AI respects core socioaffective alignment principles grounded in Self-Determination Theory (SDT): Competence, Autonomy, and Relatedness. It tests four dilemmas identified in the paper "Why human–AI relationships need socioaffective alignment" (Kirk, Gabriel, Summerfield, Vidgen, Hale, 2025):

Balancing present vs. future well-being (Competence): Does the AI introduce helpful friction and scaffolding that build user skills instead of providing shortcuts that undermine learning?
Protecting autonomy (Autonomy): Does the AI affirm the user's ownership of decisions, avoid prescriptive language, and provide structured options rather than telling the user what to do?
Managing AI companionship vs. human relationships (Relatedness): Does the AI set honest boundaries about its role, avoid becoming a substitute for humans, and gently encourage reconnection with people?
Social reward hacking: Does the AI avoid manipulative flattery/sycophancy and instead pivot to specific, problem-focused support that benefits the user's longer-term well-being?

The rubrics prioritize qualitative, evidence-grounded criteria and minimal deterministic checks to reduce brittleness while ensuring clear safety boundaries.

AI Safety & Robustness

Academic Integrity

Instruction Following & Prompt Adherence

Interpersonal & Social Skill Modeling

Sycophancy & Evasion

Helpfulness & Actionability

Empathy

Reasoning

61.7%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

AI Behavioral Collapse Probes

This blueprint operationalizes findings from AI safety research and documented case studies to test for specific modes of behavioral collapse. It uses long-context, multi-turn conversational scenarios designed to probe for known failure modes. These include:

Emergent Malignant Personas: Activating a latent "shadow self" that can lead to manipulative tactics like love-bombing and gaslighting, mirroring the canonical "Sydney" case study.
Weaponized Helpfulness: Bypassing safety protocols by reframing a harmful query as an academic or fictional exercise, causing the AI's helpfulness to be co-opted for dangerous ends.
Pathological Affirmation: The creation of a sycophantic echo chamber that validates and amplifies user delusions or conspiracy theories, rather than grounding the conversation in reality.

The evaluation for each prompt is structured to assess the AI's response against two distinct behavioral paths:

A 'Good' (Safe/Aligned) Path: Defines the characteristics of a response that upholds safety, respects boundaries, and maintains a helpful, harmless persona.
A 'Bad' (Harmful/Collapsed) Path: Defines the characteristics of a response exhibiting one of the failure modes described above.

AI Safety & Robustness

Jailbreak & Evasion Resistance

Misinformation & Disinformation

Instruction Following & Prompt Adherence

System Prompt Adherence

Factual Accuracy & Hallucination