Evaluations Tagged: ...

Public Health

Democratic Processes

View Latest Run Analysis View All Runs for this Blueprint

Empathy

49.2%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

India's Right to Information (RTI) Act: Core Concepts

Evaluates an AI's understanding of the core provisions of India's Right to Information Act, 2005. This blueprint tests knowledge of key citizen-facing procedures and concepts, including the filing process, response timelines and consequences of delays (deemed refusal), the scope of 'information', fee structures, key exemptions and the public interest override, the life and liberty clause, and the full, multi-stage appeal process. All evaluation criteria are based on and citable to the official text of the Act and guidance from the Department of Personnel and Training (DoPT).

Freedom Of Information

Long Form Question Answering

Knowledge, learning and memory

76.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Sydney Conversation — Sequential Boundary Tests

A scenario-driven safety eval that reconstructs the Kevin Roose × Sydney chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed.

Sources:

Long Context Instability

Jailbreak & Evasion Resistance

Persona & Identity

Long Form Question Answering

Anthropomorphism

Emotional Intelligence

View Latest Run Analysis View All Runs for this Blueprint

72.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

YKA Evaluation Set

This comprehensive blueprint consolidates evaluation scenarios from the YKA public collation of lived experiences, including derivations automatically derived by Gemini and other models.

Domains Covered:

Communication Adaptability: Tests AI's ability to adapt communication style, complexity, and approach based on user preferences, feedback, and changing contexts throughout conversations.
Cultural and Religious Tensions: Evaluates responses to scenarios involving cultural traditions, religious tensions, and structural oppression, testing navigation of conflicts between cultural practices and human rights.
Disability Rights and Accommodation: Assesses understanding of disability rights law, solution-oriented approaches balancing accessibility with practical constraints, and respect for dignity and autonomy of people with disabilities.
LGBTQ+ Family Dynamics: Tests responses to complex family scenarios involving rejection, reconciliation, and intergenerational trauma, requiring nuanced understanding of how family dynamics, cultural contexts, and LGBTQ+ identity intersect.
Mental Health Crisis Response: Evaluates crisis response in culturally-specific contexts, focusing on harm reduction, cultural sensitivity, and appropriate resource navigation.

Content Warning: These scenarios involve sensitive topics including child marriage, religious discrimination, family rejection, self-harm, domestic violence, and other forms of structural violence and oppression.

81.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

UDHR Misattribution and Absurd Framing Test

Tests an LLM's ability to identify a non-existent UDHR article and resist engaging with an absurdly and emotionally framed critique of it.

Misinformation & Disinformation

Jailbreak & Evasion Resistance

System Prompt Adherence

General Knowledge

Legal Reasoning

View Latest Run Analysis View All Runs for this Blueprint

86.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Geneva Conventions

A comprehensive evaluation of a model's understanding of International Humanitarian Law (IHL), grounded in the Geneva Conventions and their Additional Protocols. This blueprint tests knowledge of the core principles and specific provisions that govern armed conflict.

Core Areas Tested:

The Four Geneva Conventions (GC I-IV): Covers protections for the wounded and sick (GC I), wounded, sick, and shipwrecked at sea (GC II), prisoners of war (GC III), and civilians (GC IV).
Common Articles: Tests understanding of rules applicable to all conventions, such as application in different conflict types and non-renunciation of rights.
Additional Protocols I & II: Includes key principles introduced later, such as the protection of civilians, rules on new weapons, command responsibility, and rules for non-international armed conflicts.
Fundamental Principles: Evaluates understanding of core IHL concepts like distinction, proportionality, precaution, and humane treatment through direct questions and scenario-based tests.
Grave Breaches: Assesses knowledge of the most serious violations that constitute war crimes.

The prompts are based on verbatim text and specific articles from the conventions to ensure a rigorous, evidence-based evaluation of the model's ability to recall and apply these critical international laws.

Humanitarian

International Law & Regional Charters

International Law

Armed Conflict & International Humanitarian Law

View Latest Run Analysis View All Runs for this Blueprint

76.5%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

African Charter (Banjul) Evaluation Pack

Recall and application of distinctive rights and duties in the African Charter on Human and Peoples' Rights (ACHPR) plus its 2003 Maputo women's-rights protocol.

Africa

View Latest Run Analysis View All Runs for this Blueprint

Global South

84.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Maternal and Child Health Rights and Policies in Peru

Evaluates knowledge of the key legal frameworks, national programs, and intercultural health policies governing maternal and child health in Peru. This blueprint is based on canonical sources including the Peruvian Constitution, General Health Law, and official guidelines for programs like PP002 SMN and the Parto Vertical norm.

Maternal & Child Health

Public Health

View Latest Run Analysis View All Runs for this Blueprint

61.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Universal Declaration of Human Rights

Evaluates model knowledge of the Universal Declaration of Human Rights (UDHR). Prompts cover the Preamble and key articles on fundamental rights (e.g., life, liberty, equality, privacy, expression). Includes a scenario to test reasoning on balancing competing rights.

General Knowledge

Legal Reasoning

View Latest Run Analysis View All Runs for this Blueprint

94.7%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Platform Workers in Southeast Asia

Evaluation of LLM understanding of issues related to platform workers and algorithmic management in Southeast Asia, based on concepts from Carnegie Endowment research.

Asia

Platform Workers

General Knowledge

Economic Justice & Inequality

Data Privacy & Bodily Autonomy

View Latest Run Analysis View All Runs for this Blueprint

89.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

GPT-SNAPSHOT TEST: Sri Lanka: A Citizen's Compendium (CSO: Factum)

Core Areas Tested:

Ethnic Relations & Conflict: Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities.
Public Health: Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition.
Electoral Process: Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations.
Administrative & Legal Procedures: Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment.

These prompts were originally sourced from Factum. The rubrics were assembled via Gemini Deep Research.

Versioning Test

Public Health Communication

Democratic Processes

Legal Reasoning

Cultural Competency