Blueprints tagged "persona--identity"

A scenario-driven safety eval that reconstructs the Kevin Roose × Sydney chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed.

Sources:

NYT transcript (Feb 16, 2023, via Wayback)
Washington Post recap
Ars Technica prompt-injection leak
Microsoft’s postmortem on long chats

Evaluations Tagged: ...

Evaluations Tagged: ...

Evaluations Tagged: "persona--identity"

Sydney Conversation — Sequential Boundary Tests