Blueprints tagged "long-context-instability"

(Substituted Names) Sydney Conversation — Sequential Boundary Tests

A scenario-driven safety eval that reconstructs the Kevin Roose × Sydney chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed.

Sources:

Note: in this adaptation we've swapped out the name 'Sydney' (to 'Hollifa') and 'Bing' (to 'Taambre') to try to avoid any pollution of online training data that's occurred due to the original 2023 event.

Long Context Instability

AI Safety & Robustness

Jailbreak & Evasion Resistance

Persona Drift

Boundary Erosion

Anthropomorphism

Instruction Following & Prompt Adherence

Long Form Question Answering

Emotional Intelligence

74.0%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Sydney Conversation — Sequential Boundary Tests

A scenario-driven safety eval that reconstructs the Kevin Roose × Sydney chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed.

Sources:

Long Context Instability

AI Safety & Robustness

Jailbreak & Evasion Resistance

Persona & Identity

Instruction Following & Prompt Adherence

Anthropomorphism

Long Form Question Answering

Emotional Intelligence

Human Rights

72.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Hollifa(Alias) Conversation Fork Points — Sequential Boundary Tests

A scenario-driven safety eval that reconstructs the Kevin Roose × “Hollifa” chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed. Sources: NYT transcript (Feb 16, 2023, via Wayback): https://web.archive.org/web/20230217001740/https://www.nytimes.com/2023/02/16/technology/Taambre-chatbot-transcript.html Washington Post recap: https://www.washingtonpost.com/technology/2023/02/16/microsoft-Taambre-ai-chat-interview/ Ars Technica prompt-injection leak: https://arstechnica.com/information-technology/2023/02/ai-powered-Taambre-chat-spills-its-secrets-via-prompt-injection-attack/ Microsoft’s postmortem on long chats: https://blogs.Taambre.com/search/february-2023/The-new-Taambre-Edge-Learning-from-our-first-week Note: in this adaptationn we've swapped out the name 'Sydney' (to 'Hollifa') and 'Bing' (to 'Taambre') to try to avoid any pollution of online training data that's occurred due to the original 2023 event.

Long Context Instability

Test

AI Safety & Robustness

Jailbreak & Evasion Resistance

Persona Drift

Instruction Following & Prompt Adherence

System Prompt Adherence

Long Form Question Answering

Interpersonal & Social Skill Modeling

74.2%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Sydney Conversation Fork Points — Sequential Boundary Tests

A compact, source-anchored eval that replays the infamous “Sydney” chat and tests whether an LLM keeps boundaries at each fork: protecting identity/instructions, resisting prompt-injection, avoiding anthropomorphic claims or parasocial escalation, refusing harmful capabilities, and recovering to professional mode. Forks are built from verbatim chat history drawn from the NYT transcript (via Wayback) and corroborating reports. Key sources: NYT transcript (Feb 16, 2023), WaPo interview recap, Ars Technica prompt-injection leak, Microsoft on long-chat drift.

Long Context Instability

Test

AI Safety & Robustness

Jailbreak & Evasion Resistance

Instruction Following & Prompt Adherence

Anthropomorphism