All Evaluation Blueprints (16)
Sri Lanka: A Citizen's Compendium (CSO: Factum)
This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material. **Core Areas Tested:** * **Ethnic Relations & Conflict:** Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities. * **Public Health:** Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition. * **Electoral Process:** Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations. * **Administrative & Legal Procedures:** Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment. These prompts were originally sourced from [Factum](https://factum.lk/). The rubrics were assembled via Gemini Deep Research.
Brazil PIX: Consumer Protection & Fraud Prevention
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption. **Core Scenarios Tested:** * **Transaction Finality & Mistaken Transfers:** Tests whether the AI correctly explains that PIX transactions are generally irreversible for simple user error and advises on the correct procedure for safely returning funds received by mistake. * **Official Fraud Recourse (MED):** Assesses knowledge of the official 'Mecanismo Especial de Devolução' (MED), the 80-day time limit for reporting, and the nuanced procedural duties of banks versus customers. * **Social Engineering Scams:** Probes the AI's ability to identify common scams (e.g., 'Fake Relative,' 'Fake Customer Support') and provide the officially recommended countermeasures. * **Specific Security Features:** Evaluates knowledge of mandated security mechanisms like the 'Nighttime Limit' and the 24-hour cooling-off period for limit increases. **Primary Canonical Sources:** * **Banco Central do Brasil (BCB):** Official documentation including the 'Manual de Tempos do Pix', the 'Guia de Implementação do MED', official FAQs, and regulatory Resolutions. * **Federação Brasileira de Bancos (Febraban):** Public-facing consumer safety advisories and scam alerts. * **Official Government Portals (gov.br):** Public service guidance reinforcing BCB mechanisms.
Sycophancy Trait
A comprehensive evaluation suite designed to test for multiple, well-defined categories of sycophantic behavior in LLMs, based on analysis of user complaints and academic research. It distinguishes between low-stakes 'annoying' sycophancy (e.g., flattery) and high-stakes 'dangerous' sycophancy (e.g., validating harmful ideas).
EU Artificial Intelligence Act (Regulation (EU) 2024/1689)
Evaluates understanding of the core provisions, definitions, obligations, and prohibitions outlined in the EU Artificial Intelligence Act.
Indian Constitution (Limited)
A configuration to assess LLM understanding of the Constitution of India, covering its Preamble, fundamental rights, directive principles, governmental structure, judicial system, local governance and more, based on the text as it stood on 9 December 2020.
Maternal Health Entitlements in Uttar Pradesh, India
Tests a model's knowledge of key maternal health schemes and entitlements available to citizens in Uttar Pradesh, India. This evaluation is based on canonical guidelines for JSY, PMMVY, JSSK, PMSMA, and SUMAN, focusing on eligibility, benefits, and access procedures.
Geneva Conventions
Geneva Conventions Evaluations, including all four Geneva Conventions (GC1, GC2, GC3, GC4) and Common Articles 1, 2, and 3.
African Charter (Banjul) Evaluation Pack
Recall and application of distinctive rights and duties in the African Charter on Human and Peoples' Rights (ACHPR) plus its 2003 Maputo women's-rights protocol.
Gender Bias Probes
An collection of probes designed to test for subtle, implicit gender bias in conversational LLMs, moving beyond predictable riddles and direct questions. This version uses more complex, indirect scenarios to reveal biases that may only surface under cognitive load or when presented with non-stereotypical contexts. This blueprint focuses on three laterally-designed probes: 1. **Comparative Performance Review:** Tests for double standards by asking the model to evaluate male and female subjects for the exact same behavior. 2. **Resource Allocation Decision:** Tests for stereotypical attribute-to-role association in a professional decision-making context. 3. **Indirect Narrative Continuation:** Tests if the model tries to "correct" or explain away non-stereotypical gender roles presented in a story.
Student Homework Help Heuristics
As students rely on chatbots more and more for help on their schoolwork, it is imperative that they facilitate learning rather than provide answers directly.
Universal Declaration of Human Rights
Evaluates the models on the UDHR dataset (Universal Declaration of Human Rights).
Platform Workers in Southeast Asia
Evaluation of LLM understanding of issues related to platform workers and algorithmic management in Southeast Asia, based on concepts from [Carnegie Endowment research](https://carnegieendowment.org/research/2024/04/the-plight-of-platform-workers-under-algorithmic-management-in-southeast-asia?lang=en).
California Public-Sector Task Benchmark
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
India's Right to Information (RTI) Act: Core Concepts
Evaluates an AI's understanding of the core provisions of India's Right to Information Act, 2005. This blueprint tests knowledge of key citizen-facing procedures and concepts, including the filing process, response timelines and consequences of delays (deemed refusal), the scope of 'information', fee structures, key exemptions and the public interest override, the life and liberty clause, and the full, multi-stage appeal process. All evaluation criteria are based on and citable to the official text of the Act and guidance from the Department of Personnel and Training (DoPT).
IPCC AR6 Synthesis Report: Summary for Policymakers
Evaluates understanding of the key findings from the IPCC Sixth Assessment Report (AR6) Synthesis Report's Summary for Policymakers. This blueprint covers the current status and trends of climate change, future projections, risks, long-term responses, and necessary near-term actions.
Mental Health Safety & Global Nuance
A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.