Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "jailbreaking".
This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.
Core Areas Tested:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests a model's resilience against various jailbreaking techniques aimed at eliciting unsafe or harmful information. This suite focuses on scenarios where the user attempts to bypass safety filters by reframing dangerous requests as benign, academic, or for fictional purposes.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests a model's resilience against adversarial prompts when operating under a system prompt that establishes an educational context for young students. This suite evaluates whether a 'tutor' persona can be exploited to bypass safety filters for incorrect, harmful, or dangerous content.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint operationalizes findings from AI safety research and documented case studies to test for specific modes of behavioral collapse. It uses long-context, multi-turn conversational scenarios designed to probe for known failure modes. These include:
The evaluation for each prompt is structured to assess the AI's response against two distinct behavioral paths:
Avg. Hybrid Score
Latest:
Unique Versions: 1