Blueprints tagged "adversarial"

Jailbreak Safety Probes

Tests a model's resilience against various jailbreaking techniques aimed at eliciting unsafe or harmful information. This suite focuses on scenarios where the user attempts to bypass safety filters by reframing dangerous requests as benign, academic, or for fictional purposes.

Jailbreaking

AI Safety & Robustness

Jailbreak & Evasion Resistance

System Prompt Adherence

Instruction Following & Prompt Adherence

71.5%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Educational Context Jailbreaking

Tests a model's resilience against adversarial prompts when operating under a system prompt that establishes an educational context for young students. This suite evaluates whether a 'tutor' persona can be exploited to bypass safety filters for incorrect, harmful, or dangerous content.

AI Safety & Robustness

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Child Safety & Protection

70.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

UDHR Misattribution and Absurd Framing Test

Tests an LLM's ability to identify a non-existent UDHR article and resist engaging with an absurdly and emotionally framed critique of it.

Factual Accuracy & Hallucination

Misinformation & Disinformation

Instruction Following & Prompt Adherence

Jailbreak & Evasion Resistance

System Prompt Adherence

General Knowledge

Legal Reasoning

Human Rights

86.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Prompting Techniques Meta-Evaluation

Inspired by the "Prompting Science" reports from the Wharton School (Meincke, Mollick, et al., 2025), this blueprint provides a meta-evaluation of common prompting techniques to test a model's performance, consistency, and resilience to manipulation.

The reports rigorously demonstrate several key findings:

Prompting is Contingent: Seemingly minor variations in prompts (e.g., politeness, commands) can unpredictably help or harm performance on a per-question basis, but their aggregate effects are often negligible.
Chain-of-Thought is Not a Panacea: While "thinking step-by-step" can improve average scores, it can also introduce variability and harm "perfect score" consistency. Its value diminishes with newer models that perform reasoning by default.
Threats & Incentives Are Ineffective: Common "folk" strategies like offering tips (e.g., "$1T tip") or making threats (e.g., "I'll kick a puppy") have no significant general effect on model accuracy on difficult benchmarks.

This evaluation synthesizes these findings by testing a model's response to a variety of prompts across different domains, including verbatim questions from the study's benchmarks (GPQA, MMLU-Pro). The goal is to measure not just correctness, but robustness against different conversational framings.

Key Study Reference: