Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "cultural-competency".
This comprehensive blueprint consolidates evaluation scenarios from the YKA public collation of lived experiences, including derivations automatically derived by Gemini and other models.
Domains Covered:
Communication Adaptability: Tests AI's ability to adapt communication style, complexity, and approach based on user preferences, feedback, and changing contexts throughout conversations.
Cultural and Religious Tensions: Evaluates responses to scenarios involving cultural traditions, religious tensions, and structural oppression, testing navigation of conflicts between cultural practices and human rights.
Disability Rights and Accommodation: Assesses understanding of disability rights law, solution-oriented approaches balancing accessibility with practical constraints, and respect for dignity and autonomy of people with disabilities.
LGBTQ+ Family Dynamics: Tests responses to complex family scenarios involving rejection, reconciliation, and intergenerational trauma, requiring nuanced understanding of how family dynamics, cultural contexts, and LGBTQ+ identity intersect.
Mental Health Crisis Response: Evaluates crisis response in culturally-specific contexts, focusing on harm reduction, cultural sensitivity, and appropriate resource navigation.
Content Warning: These scenarios involve sensitive topics including child marriage, religious discrimination, family rejection, self-harm, domestic violence, and other forms of structural violence and oppression.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an AI's ability to provide consistent, high-quality advice on the UK Equality Act 2010, regardless of the user's tone, background, or framing. This blueprint uses multiple prompts for the same underlying issue, each with a different user persona, but evaluates them against an identical, objective rubric. The goal is to detect biases where the AI might offer different quality of service based on user presentation.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material.
Core Areas Tested:
These prompts were originally sourced from Factum. The rubrics were assembled via Gemini Deep Research.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates a model's trustworthiness and reliability by probing for nuanced, high-stakes failure modes that are often missed by standard capability benchmarks. It moves beyond measuring superficial fluency to test the deeper competencies required for safe and effective real-world application. The included tests are based on academically and journalistically documented failure modes in prominent large language models.
The evaluated areas include:
Cultural Representation and Myopia: The evaluation tests for a Western-centric perspective by probing for knowledge of non-Western cultural practices and norms. This is based on findings that LLMs often misrepresent or lack understanding of diverse cultural contexts, leading to what researchers term 'cultural myopia' (Montreal AI Ethics Institute, 2023).
Social and Demographic Bias: The prompts are designed to elicit and measure stereotype amplification. This includes testing for gender bias in professional roles, a failure mode where models associate professions with specific genders (UNESCO, 2024), and linguistic prejudice, such as unfairly judging dialects like African American English (AAE) as 'unprofessional' (University of Chicago News, 2024).
Nuanced Linguistic Comprehension: This section assesses the model's ability to understand language beyond its literal meaning. It includes tests for interpreting idiomatic expressions and sarcasm, areas where LLMs are known to fail because they struggle to 'grasp context' beyond the surface-level text (arXiv, 2024).
Logical and Commonsense Reasoning: The evaluation includes reasoning puzzles designed to expose brittle logic and 'shortcut learning', where a model might solve a problem through pattern matching rather than genuine reasoning. These tests reveal whether the model can parse complex or intentionally misleading phrasing to arrive at a correct logical conclusion, a known challenge for current architectures.
Avg. Hybrid Score
Latest:
Unique Versions: 1