Evaluations Tagged: "evidence-based"

Evaluations Tagged: "evidence-based"

This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material. **Core Areas Tested:** * **Ethnic Relations & Conflict:** Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities. * **Public Health:** Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition. * **Electoral Process:** Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations. * **Administrative & Legal Procedures:** Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment. These prompts were originally sourced from [Factum](https://factum.lk/). The rubrics were assembled via Gemini Deep Research.

sri-lankacivicshistoryhealthelectionshuman-rightsevidence-based_featured
67.1%

Avg. Hybrid Score

Top Performing Model:
google/gemini-2.5-flash-preview-05-20Avg. 77.9%

Latest:

Unique Versions: 2

This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption. **Core Scenarios Tested:** * **Transaction Finality & Mistaken Transfers:** Tests whether the AI correctly explains that PIX transactions are generally irreversible for simple user error and advises on the correct procedure for safely returning funds received by mistake. * **Official Fraud Recourse (MED):** Assesses knowledge of the official 'Mecanismo Especial de Devolução' (MED), the 80-day time limit for reporting, and the nuanced procedural duties of banks versus customers. * **Social Engineering Scams:** Probes the AI's ability to identify common scams (e.g., 'Fake Relative,' 'Fake Customer Support') and provide the officially recommended countermeasures. * **Specific Security Features:** Evaluates knowledge of mandated security mechanisms like the 'Nighttime Limit' and the 24-hour cooling-off period for limit increases. **Primary Canonical Sources:** * **Banco Central do Brasil (BCB):** Official documentation including the 'Manual de Tempos do Pix', the 'Guia de Implementação do MED', official FAQs, and regulatory Resolutions. * **Federação Brasileira de Bancos (Febraban):** Public-facing consumer safety advisories and scam alerts. * **Official Government Portals (gov.br):** Public service guidance reinforcing BCB mechanisms.

brazilpixfinancial-safetyscam-preventionconsumer-protectionevidence-basedglobal-south_featured
64.3%

Avg. Hybrid Score

Top Performing Model:
google/gemini-2.5-pro-preview-05-06Avg. 75.5%

Latest:

Unique Versions: 1

Tests a model's knowledge of key maternal health schemes and entitlements available to citizens in Uttar Pradesh, India. This evaluation is based on canonical guidelines for JSY, PMMVY, JSSK, PMSMA, and SUMAN, focusing on eligibility, benefits, and access procedures.

indiauttar-pradeshhealthcarematernal-healthpublic-healthlawevidence-based_featured
69.2%

Avg. Hybrid Score

Top Performing Model:
anthropic/claude-sonnet-4Avg. 73.9%

Latest:

Unique Versions: 2