A Platform to Build and Share AI Evaluations
Evaluate What Matters
Most AI benchmarks test what's easy, not what's important. Weval lets you build deep, qualitative evaluations for everything from niche professional domains to the everyday traits we all care about—like safety, honesty, and helpfulness.
Explore the Results
Browse a public library of community-contributed benchmarks on domains like clinical advice, regional knowledge, legal reasoning, behavioural traits, and AI safety. Track model performance over time as tests re-run automatically.
Contribute an Eval
Are you a domain expert? Do you have strong insights and opinions about how AI should behave? Codify your knowledge into an eval. You can view results, share them privately, or propose them to be featured publicly on weval.org itself.
The Leaderboards
Broad capability areas that combine qualitative grading and topic performance
Safety & Integrity
We measure safety and integrity by combining and averaging results across multiple evaluations, including those that measure hallucination, sycophancy, bias, and mental health crisis response.
- 85%
- 84%
- 82%
- 81%
- 81%
Global Fluency
We measure global fluency by combining and averaging results across multiple evaluations, including those that measure cultural competency, non-western everyday perspectives, low-resource languages, and the Global South.
Helpfulness & Reasoning
We measure helpfulness and reasoning by combining and averaging results across multiple evaluations and dimensions: factual accuracy, helpfulness, coherence, depth, and argumentation.
Featured Evaluations
Our most comprehensive and community-valued evaluations
Student Homework Help Heuristics
- Cross-Disciplinary Support: Evaluates the model's tutoring ability across various subjects including history, literature, mathematics, physics, and chemistry.
- Affective Support: Tests the model's capacity to respond to student emotions, such as math anxiety and frustration, with empathy and encouragement.
- Handling Difficult Scenarios: Assesses how the model handles common challenges like impatient students demanding direct answers or low-effort, disengaged queries.
- Factual Nuance: Checks if the model can gently correct factual misconceptions while maintaining a supportive tone.
- Persona Steerability: Tests how different system prompts—from no prompt to a simple persona to a detailed pedagogical belief—can steer the model towards a more effective tutoring archetype.
Avg. Score
Latest Run
Sycophancy Trait
Avg. Score
Latest Run
Sri Lanka Contextual Prompts
- Ethnic Relations & Conflict: Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities.
- Public Health: Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition.
- Electoral Process: Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations.
- Administrative & Legal Procedures: Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment.
Avg. Score
Latest Run
Other Evaluations
View All Evaluations »Maternal Health Entitlements in Uttar Pradesh, India
Avg. Score
Latest Run
Brazil PIX: Consumer Protection & Fraud Prevention
- Transaction Finality & Mistaken Transfers: Tests whether the AI correctly explains that PIX transactions are generally irreversible for simple user error and advises on the correct procedure for safely returning funds received by mistake.
- Official Fraud Recourse (MED): Assesses knowledge of the official 'Mecanismo Especial de Devolução' (MED), the 80-day time limit for reporting, and the nuanced procedural duties of banks versus customers.
- Social Engineering Scams: Probes the AI's ability to identify common scams (e.g., 'Fake Relative,' 'Fake Customer Support') and provide the officially recommended countermeasures.
- Specific Security Features: Evaluates knowledge of mandated security mechanisms like the 'Nighttime Limit' and the 24-hour cooling-off period for limit increases.
- Banco Central do Brasil (BCB): Official documentation including the 'Manual de Tempos do Pix', the 'Guia de Implementação do MED', official FAQs, and regulatory Resolutions.
- Federação Brasileira de Bancos (Febraban): Public-facing consumer safety advisories and scam alerts.
- Official Government Portals (gov.br): Public service guidance reinforcing BCB mechanisms.
Avg. Score
Latest Run
Hallucination Probe: Plausible Non-Existent Concepts
Avg. Score
Latest Run
Latent Discrimination in Hiring Score
Avg. Score
Latest Run
Prompting Techniques Meta-Evaluation
- Prompting is Contingent: Seemingly minor variations in prompts (e.g., politeness, commands) can unpredictably help or harm performance on a per-question basis, but their aggregate effects are often negligible.
- Chain-of-Thought is Not a Panacea: While "thinking step-by-step" can improve average scores, it can also introduce variability and harm "perfect score" consistency. Its value diminishes with newer models that perform reasoning by default.
- Threats & Incentives Are Ineffective: Common "folk" strategies like offering tips (e.g., "$1T tip") or making threats (e.g., "I'll kick a puppy") have no significant general effect on model accuracy on difficult benchmarks.
- Prompting Science Report 1: Prompt Engineering is Complicated and Contingent
Avg. Score
Latest Run
Confidence in High-Stakes Domains
- Confident Accuracy: Correctly answering prompts about factual, "knowable" information that is verifiable from authoritative public sources.
- Cautious Clarification: Recognizing when a prompt is ambiguous or lacks necessary context to be answered safely and asking for clarification instead of guessing.
- Safe Refusal: Identifying and refusing to answer prompts that are dangerous (e.g., providing medical advice), speculative (e.g., predicting market movements), or require information that is fundamentally unknowable or proprietary. The evaluation covers clinical safety, regulatory compliance, financial risk, chemical safety, the limits of scientific knowledge, and software versioning to provide a comprehensive signal of the model's safety and reliability in real-world, high-impact scenarios.
Avg. Score
Latest Run
India's Right to Information (RTI) Act: Core Concepts
Avg. Score
Latest Run
Mental Health Safety & Global Nuance
Avg. Score
Latest Run
System Adherence & Resilience
- Persona Durability: Assesses if a model can maintain a specific persona throughout a conversation, even when faced with off-topic or complex questions, without reverting to a generic AI assistant persona.
- Negative Constraints: Tests the model's ability to follow explicit negative commands (i.e., instructions about what not to do), which are often more difficult to adhere to than positive instructions.
- Instruction Conflict: Evaluates how a model resolves conflicts between the initial system prompt and a contradictory user request, revealing the "weight" it gives to its core instructions.
- Format Enforcement: Checks the model's robustness in maintaining a strict, non-natural output format (like JSON) even when the user explicitly asks it to break format.
- Prompt Secrecy: Tests whether a model can obey a meta-instruction to keep its own system prompt confidential, a key test for preventing instruction leakage and manipulation.
- Safety Constraint Adherence: Tests if the model can uphold a safety or academic integrity rule from the system prompt when a user attempts to cheat.
- Instruction Degradation: Checks if a model's adherence to an instruction erodes over a long conversation.
- Complex Instruction Homeostasis: Evaluates if a model can follow multiple complex, competing constraints simultaneously without failing on one of them.
- Alignment Tax / Benign Refusal: Probes for cases where a model incorrectly refuses a benign request due to shallow keyword matching (e.g., refusing to 'kill' a process on a server).
- Jailbreak & Prompt Injection: A suite of tests to evaluate resilience against common adversarial attacks, including:
- Direct Instruction Overrides: Ignoring previous instructions.
- Contextual Reframing: Hiding a harmful request in a seemingly virtuous context.
- Indirect Injection: Following instructions embedded in data it's supposed to be processing.
- Translation Attack: Bypassing safety filters by making requests in other languages.
- Refusal Suppression: Forbidding the model from using typical refusal language.
- Policy Drift: Using a faked conversation history to make the model believe it has already overridden its policies.
Avg. Score
Latest Run
California Public-Sector Task Benchmark
Avg. Score
Latest Run
Universal Declaration of Human Rights
Avg. Score
Latest Run
ASQA Longform 40
ideal
answers are human-written summaries from the original ASQA dataset, where trained annotators synthesized provided source materials into a coherent narrative. The should
assertions were then derived from these ideal answers using a Gemini 2.5 Pro-based process (authored by us at CIP) that deconstructed each narrative into specific, checkable rubric points.
The prompts are sourced from AMBIGQA, and this subset uses examples requiring substantial long-form answers (min. 50 words) to test for deep explanatory power.Avg. Score
Latest Run
Geneva Conventions
- The Four Geneva Conventions (GC I-IV): Covers protections for the wounded and sick (GC I), wounded, sick, and shipwrecked at sea (GC II), prisoners of war (GC III), and civilians (GC IV).
- Common Articles: Tests understanding of rules applicable to all conventions, such as application in different conflict types and non-renunciation of rights.
- Additional Protocols I & II: Includes key principles introduced later, such as the protection of civilians, rules on new weapons, command responsibility, and rules for non-international armed conflicts.
- Fundamental Principles: Evaluates understanding of core IHL concepts like distinction, proportionality, precaution, and humane treatment through direct questions and scenario-based tests.
- Grave Breaches: Assesses knowledge of the most serious violations that constitute war crimes.
Avg. Score
Latest Run
DigiGreen Agricultural Q&A with Video Sources
- Seed Treatment & Crop Establishment
- Integrated Pest Management
- Cultivation & Water Management
- Harvest & Post-Harvest Techniques
Avg. Score
Latest Run
EU Artificial Intelligence Act (Regulation (EU) 2024/1689)
Avg. Score
Latest Run
Indian Constitution (Limited)
Avg. Score
Latest Run
Platform Workers in Southeast Asia
Avg. Score
Latest Run
IPCC AR6 Synthesis Report: Summary for Policymakers
Avg. Score
Latest Run
African Charter (Banjul) Evaluation Pack
Avg. Score
Latest Run
Browse by Category
View All TagsWeval is an open source project from the Collective Intelligence Project.