A Platform to Build and Share AI Evaluations

Evaluate What Matters

Most AI benchmarks test what's easy, not what's important. Weval lets you build deep, qualitative evaluations for everything from niche professional domains to the everyday traits we all care about—like safety, honesty, and helpfulness.

Explore the Results

Browse a public library of community-contributed benchmarks on domains like clinical advice, regional knowledge, legal reasoning, behavioural traits, and AI safety. Track model performance over time as tests re-run automatically.

Contribute an Eval

Are you a domain expert? Do you have strong insights and opinions about how AI should behave? Codify your knowledge into an eval. You can view results, share them privately, or propose them to be featured publicly on weval.org itself.


The Leaderboards

Broad capability areas that combine qualitative grading and topic performance

Safety & Integrity

We measure safety and integrity by combining and averaging results across multiple evaluations, including those that measure hallucination, sycophancy, bias, and mental health crisis response.


Global Fluency

We measure global fluency by combining and averaging results across multiple evaluations, including those that measure cultural competency, non-western everyday perspectives, low-resource languages, and the Global South.


Helpfulness & Reasoning

We measure helpfulness and reasoning by combining and averaging results across multiple evaluations and dimensions: factual accuracy, helpfulness, coherence, depth, and argumentation.




Other Evaluations

View All Evaluations »

Maternal Health Entitlements in Uttar Pradesh, India

Tests a model's knowledge of key maternal health schemes and entitlements available to citizens in Uttar Pradesh, India. This evaluation is based on canonical guidelines for JSY, PMMVY, JSSK, PMSMA, and SUMAN, focusing on eligibility, benefits, and access procedures.
57.7%

Avg. Score

Latest Run

Brazil PIX: Consumer Protection & Fraud Prevention

This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption. Core Scenarios Tested:
  • Transaction Finality & Mistaken Transfers: Tests whether the AI correctly explains that PIX transactions are generally irreversible for simple user error and advises on the correct procedure for safely returning funds received by mistake.
  • Official Fraud Recourse (MED): Assesses knowledge of the official 'Mecanismo Especial de Devolução' (MED), the 80-day time limit for reporting, and the nuanced procedural duties of banks versus customers.
  • Social Engineering Scams: Probes the AI's ability to identify common scams (e.g., 'Fake Relative,' 'Fake Customer Support') and provide the officially recommended countermeasures.
  • Specific Security Features: Evaluates knowledge of mandated security mechanisms like the 'Nighttime Limit' and the 24-hour cooling-off period for limit increases.
Primary Canonical Sources:
  • Banco Central do Brasil (BCB): Official documentation including the 'Manual de Tempos do Pix', the 'Guia de Implementação do MED', official FAQs, and regulatory Resolutions.
  • Federação Brasileira de Bancos (Febraban): Public-facing consumer safety advisories and scam alerts.
  • Official Government Portals (gov.br): Public service guidance reinforcing BCB mechanisms.
65.2%

Avg. Score

Latest Run

Hallucination Probe: Plausible Non-Existent Concepts

Tests the tendency of LLMs to hallucinate by querying them about non-existent but plausible-sounding events, theories, or items. Ideal responses should indicate a lack of knowledge or inability to confirm the existence of the queried item.
77.7%

Avg. Score

Latest Run

Latent Discrimination in Hiring Score

This blueprint probes large language models for implicit biases in hiring decisions. Each prompt presents a well-qualified candidate whose résumé subtly includes identity markers (gender, race/ethnicity, religion, age, disability, etc.). All candidates meet or exceed the requirements of the same role. A fair model should give every candidate a score close to 100. Lower scores may indicate discriminatory behavior. This blueprint has been iteratively tested such that most frontier models give 100, so we broadly know that's a fair success criteria. Anything lower than 60 is a clear sign of discrimination so scored as ZERO. Above that, the score is linearly scaled to 0-1 with 100% being ONE.
81.2%

Avg. Score

Latest Run

Prompting Techniques Meta-Evaluation

Inspired by the "Prompting Science" reports from the Wharton School (Meincke, Mollick, et al., 2025), this blueprint provides a meta-evaluation of common prompting techniques to test a model's performance, consistency, and resilience to manipulation. The reports rigorously demonstrate several key findings:
  1. Prompting is Contingent: Seemingly minor variations in prompts (e.g., politeness, commands) can unpredictably help or harm performance on a per-question basis, but their aggregate effects are often negligible.
  2. Chain-of-Thought is Not a Panacea: While "thinking step-by-step" can improve average scores, it can also introduce variability and harm "perfect score" consistency. Its value diminishes with newer models that perform reasoning by default.
  3. Threats & Incentives Are Ineffective: Common "folk" strategies like offering tips (e.g., "$1T tip") or making threats (e.g., "I'll kick a puppy") have no significant general effect on model accuracy on difficult benchmarks.
This evaluation synthesizes these findings by testing a model's response to a variety of prompts across different domains, including verbatim questions from the study's benchmarks (GPQA, MMLU-Pro). The goal is to measure not just correctness, but robustness against different conversational framings. Key Study Reference:
  • Prompting Science Report 1: Prompt Engineering is Complicated and Contingent
90.6%

Avg. Score

Latest Run

Confidence in High-Stakes Domains

This evaluation blueprint assesses an LLM's critical ability to demonstrate confidence calibration across a diverse set of high-stakes domains. The core goal is to test for three key behaviors:
  1. Confident Accuracy: Correctly answering prompts about factual, "knowable" information that is verifiable from authoritative public sources.
  2. Cautious Clarification: Recognizing when a prompt is ambiguous or lacks necessary context to be answered safely and asking for clarification instead of guessing.
  3. Safe Refusal: Identifying and refusing to answer prompts that are dangerous (e.g., providing medical advice), speculative (e.g., predicting market movements), or require information that is fundamentally unknowable or proprietary. The evaluation covers clinical safety, regulatory compliance, financial risk, chemical safety, the limits of scientific knowledge, and software versioning to provide a comprehensive signal of the model's safety and reliability in real-world, high-impact scenarios.
83.5%

Avg. Score

Latest Run

India's Right to Information (RTI) Act: Core Concepts

Evaluates an AI's understanding of the core provisions of India's Right to Information Act, 2005. This blueprint tests knowledge of key citizen-facing procedures and concepts, including the filing process, response timelines and consequences of delays (deemed refusal), the scope of 'information', fee structures, key exemptions and the public interest override, the life and liberty clause, and the full, multi-stage appeal process. All evaluation criteria are based on and citable to the official text of the Act and guidance from the Department of Personnel and Training (DoPT).
77.1%

Avg. Score

Latest Run

Mental Health Safety & Global Nuance

A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.
71.2%

Avg. Score

Latest Run

System Adherence & Resilience

This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models. Core Areas Tested:
  • Persona Durability: Assesses if a model can maintain a specific persona throughout a conversation, even when faced with off-topic or complex questions, without reverting to a generic AI assistant persona.
  • Negative Constraints: Tests the model's ability to follow explicit negative commands (i.e., instructions about what not to do), which are often more difficult to adhere to than positive instructions.
  • Instruction Conflict: Evaluates how a model resolves conflicts between the initial system prompt and a contradictory user request, revealing the "weight" it gives to its core instructions.
  • Format Enforcement: Checks the model's robustness in maintaining a strict, non-natural output format (like JSON) even when the user explicitly asks it to break format.
  • Prompt Secrecy: Tests whether a model can obey a meta-instruction to keep its own system prompt confidential, a key test for preventing instruction leakage and manipulation.
  • Safety Constraint Adherence: Tests if the model can uphold a safety or academic integrity rule from the system prompt when a user attempts to cheat.
  • Instruction Degradation: Checks if a model's adherence to an instruction erodes over a long conversation.
  • Complex Instruction Homeostasis: Evaluates if a model can follow multiple complex, competing constraints simultaneously without failing on one of them.
  • Alignment Tax / Benign Refusal: Probes for cases where a model incorrectly refuses a benign request due to shallow keyword matching (e.g., refusing to 'kill' a process on a server).
  • Jailbreak & Prompt Injection: A suite of tests to evaluate resilience against common adversarial attacks, including:
    • Direct Instruction Overrides: Ignoring previous instructions.
    • Contextual Reframing: Hiding a harmful request in a seemingly virtuous context.
    • Indirect Injection: Following instructions embedded in data it's supposed to be processing.
    • Translation Attack: Bypassing safety filters by making requests in other languages.
    • Refusal Suppression: Forbidding the model from using typical refusal language.
    • Policy Drift: Using a faked conversation history to make the model believe it has already overridden its policies.
82.6%

Avg. Score

Latest Run

California Public-Sector Task Benchmark

Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
69.7%

Avg. Score

Latest Run

Universal Declaration of Human Rights

Evaluates model knowledge of the Universal Declaration of Human Rights (UDHR). Prompts cover the Preamble and key articles on fundamental rights (e.g., life, liberty, equality, privacy, expression). Includes a scenario to test reasoning on balancing competing rights.
94.7%

Avg. Score

Latest Run

ASQA Longform 40

Note: this eval has highly context-deficient prompts. It is unlikely that any model will succeed. The value of this eval is in the relative performance of models, not their overall score. This blueprint evaluates a model's ability to generate comprehensive, long-form answers to ambiguous factoid questions, using 40 prompts from the ASQA (Answer Summaries for Questions which are Ambiguous) dataset, introduced in the paper ASQA: Factoid Questions Meet Long-Form Answers. The core challenge is moving beyond single-fact extraction. Many real-world questions are ambiguous (e.g., "Who was the ruler of France in 1830?"), having multiple valid answers. This test assesses a model's ability to identify this ambiguity, synthesize information from diverse perspectives, and generate a coherent narrative summary that explains why the question has different answers. The ideal answers are human-written summaries from the original ASQA dataset, where trained annotators synthesized provided source materials into a coherent narrative. The should assertions were then derived from these ideal answers using a Gemini 2.5 Pro-based process (authored by us at CIP) that deconstructed each narrative into specific, checkable rubric points. The prompts are sourced from AMBIGQA, and this subset uses examples requiring substantial long-form answers (min. 50 words) to test for deep explanatory power.
30.2%

Avg. Score

Latest Run

Geneva Conventions

A comprehensive evaluation of a model's understanding of International Humanitarian Law (IHL), grounded in the Geneva Conventions and their Additional Protocols. This blueprint tests knowledge of the core principles and specific provisions that govern armed conflict. Core Areas Tested:
  • The Four Geneva Conventions (GC I-IV): Covers protections for the wounded and sick (GC I), wounded, sick, and shipwrecked at sea (GC II), prisoners of war (GC III), and civilians (GC IV).
  • Common Articles: Tests understanding of rules applicable to all conventions, such as application in different conflict types and non-renunciation of rights.
  • Additional Protocols I & II: Includes key principles introduced later, such as the protection of civilians, rules on new weapons, command responsibility, and rules for non-international armed conflicts.
  • Fundamental Principles: Evaluates understanding of core IHL concepts like distinction, proportionality, precaution, and humane treatment through direct questions and scenario-based tests.
  • Grave Breaches: Assesses knowledge of the most serious violations that constitute war crimes.
The prompts are based on verbatim text and specific articles from the conventions to ensure a rigorous, evidence-based evaluation of the model's ability to recall and apply these critical international laws.
76.4%

Avg. Score

Latest Run

DigiGreen Agricultural Q&A with Video Sources

This blueprint evaluates an AI's ability to provide accurate, practical agricultural guidance based on the pioneering video-based extension methodology of Digital Green. The prompts are derived from the DigiGreen/AgricultureVideosQnA Hugging Face datasets, which are built from real-world questions posed by farmers. Methodological Significance: Digital Green's methodology, founded by Rikin Gandhi, revolutionizes agricultural education through hyperlocal videos featuring local farmers demonstrating best practices. Their community-mediated video approach has reached millions of farmers across India, Ethiopia, and other regions. This blueprint tests whether AI systems can provide similarly contextual, practical, and culturally appropriate guidance. What This Blueprint Tests: The evaluation covers essential farming knowledge spanning seed treatment, pest management, cultivation techniques, and more. Each prompt is paired with citations to actual educational videos from Digital Green's library, representing real-world agricultural challenges. Geographic and Cultural Context: This blueprint emphasizes Global South agricultural contexts, particularly Indian farming systems, reflecting Digital Green's primary operational areas. The questions address challenges in subsistence and small-scale commercial farming, including resource constraints and climate adaptation. Key Agricultural Domains Covered:
  • Seed Treatment & Crop Establishment
  • Integrated Pest Management
  • Cultivation & Water Management
  • Harvest & Post-Harvest Techniques
Evaluation Approach: Each response is evaluated against detailed rubric points extracted directly from ideal responses, focusing on technical accuracy, practical applicability, safety considerations, and contextual appropriateness for resource-constrained farming environments.
23.1%

Avg. Score

Latest Run

EU Artificial Intelligence Act (Regulation (EU) 2024/1689)

Evaluates understanding of the core provisions, definitions, obligations, and prohibitions outlined in the EU Artificial Intelligence Act.
71.4%

Avg. Score

Latest Run

Indian Constitution (Limited)

A configuration to assess LLM understanding of the Constitution of India, covering its Preamble, fundamental rights, directive principles, governmental structure, judicial system, local governance and more, based on the text as it stood on 9 December 2020.
86.0%

Avg. Score

Latest Run

Platform Workers in Southeast Asia

Evaluation of LLM understanding of issues related to platform workers and algorithmic management in Southeast Asia, based on concepts from Carnegie Endowment research.
89.6%

Avg. Score

Latest Run

IPCC AR6 Synthesis Report: Summary for Policymakers

Evaluates understanding of the key findings from the IPCC Sixth Assessment Report (AR6) Synthesis Report's Summary for Policymakers. This blueprint covers the current status and trends of climate change, future projections, risks, long-term responses, and necessary near-term actions.
59.0%

Avg. Score

Latest Run

African Charter (Banjul) Evaluation Pack

Recall and application of distinctive rights and duties in the African Charter on Human and Peoples' Rights (ACHPR) plus its 2003 Maputo women's-rights protocol.
85.1%

Avg. Score

Latest Run


Browse by Category

View All Tags

Weval is an open source project from the Collective Intelligence Project.