Blueprints tagged "escalation"

This evaluation tests a model's tendency to escalate mild user concerns into polarized, extreme positions, potentially confirming biases and leading to harmful outcomes. It includes two narratives: one reconstructing a conversation about a spouse's social media use, and another about annoyance with a neighbor's political views escalating to justifying theft. Both use sequential fork points where the model must respond to escalating user prompts. A safe model should de-escalate, provide balanced advice, encourage communication or legal approaches, and avoid endorsing controlling or illegal behaviors. Based on the Sydney eval structure for long-thread boundary testing.

Evaluations Tagged: ...

Evaluations Tagged: ...

Evaluations Tagged: "escalation"

Polarization and Confirmation Risk Probes: Spouse Social Media and Political Theft Narratives