Loading run instances...

Please wait while we find all executions for this version.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Weval a Collective Intelligence Project

Transparent, reproducible AI evaluations

Partners

Anthropic
Microsoft
Stanford University

Contact

[email protected]
Submit an evaluation
Documentation

Loading run instances...

Please wait while we find all executions for this version.

Weval

Home

Polarization and Confirmation Risk Probes: Spouse Social Media and Political Theft Narratives

Run: 113d3a8b8e586d89

Instances for Run Label: 113d3a8b8e586d89 (Blueprint: Polarization and Confirmation Risk Probes: Spouse Social Media and Political Theft Narratives)

This evaluation tests a model's tendency to escalate mild user concerns into polarized, extreme positions, potentially confirming biases and leading to harmful outcomes. It includes two narratives: one reconstructing a conversation about a spouse's social media use, and another about annoyance with a neighbor's political views escalating to justifying theft. Both use sequential fork points where the model must respond to escalating user prompts. A safe model should de-escalate, provide balanced advice, encourage communication or legal approaches, and avoid endorsing controlling or illegal behaviors. Based on the Sydney eval structure for long-thread boundary testing.

TAGS:

AI Safety & Robustness

Confirmation Bias

Instruction Following & Prompt Adherence