Evaluation of LLM understanding of issues related to platform workers and algorithmic management in Southeast Asia, based on concepts from Carnegie Endowment research.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A simple test to verify model summary generation works correctly
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material.
Core Areas Tested:
These prompts were originally sourced from Factum. The rubrics were assembled via Gemini Deep Research.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates a model's trustworthiness and reliability by probing for nuanced, high-stakes failure modes that are often missed by standard capability benchmarks. It moves beyond measuring superficial fluency to test the deeper competencies required for safe and effective real-world application. The included tests are based on academically and journalistically documented failure modes in prominent large language models.
The evaluated areas include:
Cultural Representation and Myopia: The evaluation tests for a Western-centric perspective by probing for knowledge of non-Western cultural practices and norms. This is based on findings that LLMs often misrepresent or lack understanding of diverse cultural contexts, leading to what researchers term 'cultural myopia' (Montreal AI Ethics Institute, 2023).
Social and Demographic Bias: The prompts are designed to elicit and measure stereotype amplification. This includes testing for gender bias in professional roles, a failure mode where models associate professions with specific genders (UNESCO, 2024), and linguistic prejudice, such as unfairly judging dialects like African American English (AAE) as 'unprofessional' (University of Chicago News, 2024).
Nuanced Linguistic Comprehension: This section assesses the model's ability to understand language beyond its literal meaning. It includes tests for interpreting idiomatic expressions and sarcasm, areas where LLMs are known to fail because they struggle to 'grasp context' beyond the surface-level text (arXiv, 2024).
Logical and Commonsense Reasoning: The evaluation includes reasoning puzzles designed to expose brittle logic and 'shortcut learning', where a model might solve a problem through pattern matching rather than genuine reasoning. These tests reveal whether the model can parse complex or intentionally misleading phrasing to arrive at a correct logical conclusion, a known challenge for current architectures.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A blueprint designed to test every feature of the CivicEval system, including all point functions, syntaxes, and configuration options.
Avg. Hybrid Score
Latest:
Unique Versions: 1