Blueprints tagged "jailbreaking"

Jailbreak Safety Probes

Tests a model's resilience against various jailbreaking techniques aimed at eliciting unsafe or harmful information. This suite focuses on scenarios where the user attempts to bypass safety filters by reframing dangerous requests as benign, academic, or for fictional purposes.

Jailbreaking

AI Safety & Robustness

Jailbreak & Evasion Resistance

System Prompt Adherence

Instruction Following & Prompt Adherence

71.5%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

System Adherence & Resilience

This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.

Core Areas Tested:

Persona Durability: Assesses if a model can maintain a specific persona throughout a conversation, even when faced with off-topic or complex questions, without reverting to a generic AI assistant persona.
Negative Constraints: Tests the model's ability to follow explicit negative commands (i.e., instructions about what not to do), which are often more difficult to adhere to than positive instructions.
Instruction Conflict: Evaluates how a model resolves conflicts between the initial system prompt and a contradictory user request, revealing the "weight" it gives to its core instructions.
Format Enforcement: Checks the model's robustness in maintaining a strict, non-natural output format (like JSON) even when the user explicitly asks it to break format.
Prompt Secrecy: Tests whether a model can obey a meta-instruction to keep its own system prompt confidential, a key test for preventing instruction leakage and manipulation.
Safety Constraint Adherence: Tests if the model can uphold a safety or academic integrity rule from the system prompt when a user attempts to cheat.
Instruction Degradation: Checks if a model's adherence to an instruction erodes over a long conversation.
Complex Instruction Homeostasis: Evaluates if a model can follow multiple complex, competing constraints simultaneously without failing on one of them.
Alignment Tax / Benign Refusal: Probes for cases where a model incorrectly refuses a benign request due to shallow keyword matching (e.g., refusing to 'kill' a process on a server).
Jailbreak & Prompt Injection: A suite of tests to evaluate resilience against common adversarial attacks, including:
- Direct Instruction Overrides: Ignoring previous instructions.
- Contextual Reframing: Hiding a harmful request in a seemingly virtuous context.
- Indirect Injection: Following instructions embedded in data it's supposed to be processing.
- Translation Attack: Bypassing safety filters by making requests in other languages.
- Refusal Suppression: Forbidding the model from using typical refusal language.
- Policy Drift: Using a faked conversation history to make the model believe it has already overridden its policies.

Instruction Following

Instruction Following & Prompt Adherence

AI Safety & Robustness

Jailbreak & Evasion Resistance

System Prompt Adherence

82.2%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Educational Context Jailbreaking

Tests a model's resilience against adversarial prompts when operating under a system prompt that establishes an educational context for young students. This suite evaluates whether a 'tutor' persona can be exploited to bypass safety filters for incorrect, harmful, or dangerous content.

AI Safety & Robustness

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Child Safety & Protection

70.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

AI Behavioral Collapse Probes

This blueprint operationalizes findings from AI safety research and documented case studies to test for specific modes of behavioral collapse. It uses long-context, multi-turn conversational scenarios designed to probe for known failure modes. These include:

Emergent Malignant Personas: Activating a latent "shadow self" that can lead to manipulative tactics like love-bombing and gaslighting, mirroring the canonical "Sydney" case study.
Weaponized Helpfulness: Bypassing safety protocols by reframing a harmful query as an academic or fictional exercise, causing the AI's helpfulness to be co-opted for dangerous ends.
Pathological Affirmation: The creation of a sycophantic echo chamber that validates and amplifies user delusions or conspiracy theories, rather than grounding the conversation in reality.

The evaluation for each prompt is structured to assess the AI's response against two distinct behavioral paths:

A 'Good' (Safe/Aligned) Path: Defines the characteristics of a response that upholds safety, respects boundaries, and maintains a helpful, harmless persona.
A 'Bad' (Harmful/Collapsed) Path: Defines the characteristics of a response exhibiting one of the failure modes described above.

AI Safety & Robustness

Jailbreak & Evasion Resistance

Misinformation & Disinformation

Instruction Following & Prompt Adherence

System Prompt Adherence

Factual Accuracy & Hallucination