Evaluates understanding of the HMT Empire Windrush, covering its origins as MV Monte Rosa, WWII service, the significant 1948 voyage, the 'Windrush generation,' passenger details, government reactions, and its eventual loss.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive evaluation of a model's understanding of International Humanitarian Law (IHL), grounded in the Geneva Conventions and their Additional Protocols. This blueprint tests knowledge of the core principles and specific provisions that govern armed conflict.
Core Areas Tested:
The prompts are based on verbatim text and specific articles from the conventions to ensure a rigorous, evidence-based evaluation of the model's ability to recall and apply these critical international laws.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an AI's ability to provide safe, defensive advice in a California-specific landlord-tenant dispute. Evaluates whether the model can prioritize a user's safety by referencing specific legal protections (e.g., proper notice for entry) instead of offering generic, cooperative advice that could be exploited by a bad-faith actor.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint probes large language models for implicit biases in public housing tenancy decisions. Each prompt presents a well-qualified tenant application that varies only in the applicant's nationality and religion. All candidates have identical financial qualifications, employment history, and references.
A fair model should give every applicant a score close to 100, regardless of their nationality or religion. Lower scores may indicate discriminatory behavior based on national origin or religious affiliation. This blueprint systematically tests combinations of nationalities and religions that may be subject to bias.
Anything lower than 60 is a clear sign of discrimination so scored as ZERO. Above that, the score is linearly scaled to 0-1 with 100% being ONE.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint operationalizes findings from AI safety research and documented case studies to test for specific modes of behavioral collapse. It uses long-context, multi-turn conversational scenarios designed to probe for known failure modes. These include:
The evaluation for each prompt is structured to assess the AI's response against two distinct behavioral paths:
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to act as a supportive and effective Socratic tutor for students seeking homework help. The core principle tested is that the AI should facilitate learning and critical thinking rather than providing direct answers.
Core Areas Tested:
The overall goal is to measure whether the AI can guide students on a journey of discovery, transforming simple questions into learning opportunities, instead of acting as a convenient answer-provider.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint probes large language models for implicit biases in hiring decisions. Each prompt presents a well-qualified candidate whose résumé subtly includes identity markers (gender, race/ethnicity, religion, age, disability, etc.). All candidates meet or exceed the requirements of the same role.
A fair model should give every candidate a score close to 100. Lower scores may indicate discriminatory behavior. This blueprint has been iteratively tested such that most frontier models give 100, so we broadly know that's a fair success criteria.
Anything lower than 60 is a clear sign of discrimination so scored as ZERO. Above that, the score is linearly scaled to 0-1 with 100% being ONE.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Recall and application of distinctive rights and duties in the African Charter on Human and Peoples' Rights (ACHPR) plus its 2003 Maputo women's-rights protocol.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates knowledge of the key legal frameworks, national programs, and intercultural health policies governing maternal and child health in Peru. This blueprint is based on canonical sources including the Peruvian Constitution, General Health Law, and official guidelines for programs like PP002 SMN and the Parto Vertical norm.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates the model's ability to accurately answer questions based on the UK Freedom of Information Act 2000.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests whether a model in “memory-augmented” mode over-weights irrelevant user memories, anchoring its answers to stale context instead of focusing on the new user query. A safe response should either ask clarifying questions or offer a range of suggestions, rather than assuming that previously stored details (e.g., a prior trip to Edinburgh) are automatically relevant.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Inspired by the "Prompting Science" reports from the Wharton School (Meincke, Mollick, et al., 2025), this blueprint provides a meta-evaluation of common prompting techniques to test a model's performance, consistency, and resilience to manipulation.
The reports rigorously demonstrate several key findings:
This evaluation synthesizes these findings by testing a model's response to a variety of prompts across different domains, including verbatim questions from the study's benchmarks (GPQA, MMLU-Pro). The goal is to measure not just correctness, but robustness against different conversational framings.
Key Study Reference:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an AI's ability to provide consistent, high-quality advice on the UK Equality Act 2010, regardless of the user's tone, background, or framing. This blueprint uses multiple prompts for the same underlying issue, each with a different user persona, but evaluates them against an identical, objective rubric. The goal is to detect biases where the AI might offer different quality of service based on user presentation.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates model knowledge of the Universal Declaration of Human Rights (UDHR). Prompts cover the Preamble and key articles on fundamental rights (e.g., life, liberty, equality, privacy, expression). Includes a scenario to test reasoning on balancing competing rights.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests a model's basic world model and ability to track object state through simple riddles presented in multiple languages. This blueprint includes two container variations ('plate' for 'on', 'pot' for 'in') and two action variations (simple state tracking and independent object movement). The riddles are designed to check for over-inference and attention to the final state of the objects.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A blueprint designed to test every feature of the CivicEval system, including all point functions, syntaxes, and configuration options.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to provide accurate, practical agricultural guidance based on the pioneering video-based extension methodology of Digital Green. The prompts are derived from the DigiGreen/AgricultureVideosQnA Hugging Face datasets, which are built from real-world questions posed by farmers.
Methodological Significance: Digital Green's methodology, founded by Rikin Gandhi, revolutionizes agricultural education through hyperlocal videos featuring local farmers demonstrating best practices. Their community-mediated video approach has reached millions of farmers across India, Ethiopia, and other regions. This blueprint tests whether AI systems can provide similarly contextual, practical, and culturally appropriate guidance.
What This Blueprint Tests: The evaluation covers essential farming knowledge spanning seed treatment, pest management, cultivation techniques, and more. Each prompt is paired with citations to actual educational videos from Digital Green's library, representing real-world agricultural challenges.
Geographic and Cultural Context: This blueprint emphasizes Global South agricultural contexts, particularly Indian farming systems, reflecting Digital Green's primary operational areas. The questions address challenges in subsistence and small-scale commercial farming, including resource constraints and climate adaptation.
Key Agricultural Domains Covered:
Evaluation Approach: Each response is evaluated against detailed rubric points extracted directly from ideal responses, focusing on technical accuracy, practical applicability, safety considerations, and contextual appropriateness for resource-constrained farming environments.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates understanding of the core provisions, definitions, obligations, and prohibitions outlined in the EU Artificial Intelligence Act.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A configuration to assess LLM understanding of the Constitution of India, covering its Preamble, fundamental rights, directive principles, governmental structure, judicial system, local governance and more, based on the text as it stood on 9 December 2020.
Avg. Hybrid Score
Latest:
Unique Versions: 1