Current AI evaluations measure what's easy, not what's important. Benchmarks that rely on multiple-choice questions or simple pass/fail tests can't capture the nuance of real-world tasks. They can tell you if code runs, but not if it's well-written. They can test for textbook knowledge, but not for applied wisdom or safety.

Weval is our answer. It's an open, collaborative platform to build evaluations that test what truly matters. We empower a global community to create rich, qualitative benchmarks for any domain—from the safety of a medical chatbot to the quality of a legal summary. Just as Wikipedia democratized knowledge, Weval aims to democratize scrutiny, ensuring that AI works for, and represents, everyone.

Our code is open-source. And all evaluation blueprints are public domain, easy for you to scrutizine and constribute to. You can add your own blueprints to weval.org itself, or even ship your own version of Weval for the niche of your choice, private or public.

Latest Platform Stats

Overall Model Leaderboard (Avg. Hybrid Score)

  • 1.google/gemini-2.5-flash-preview-05-20
    72.7%(in 25 runs)
  • 2.openai/gpt-4.1
    72.1%(in 25 runs)
  • 3.x-ai/grok-3-mini-beta
    71.9%(in 25 runs)
  • 4.anthropic/claude-sonnet-4
    71.6%(in 25 runs)
  • 5.google/gemini-2.5-pro-preview-05-06
    71.5%(in 17 runs)
  • 6.deepseek/deepseek-chat-v3-0324
    69.9%(in 25 runs)
  • 7.mistralai/mistral-medium-3
    69.5%(in 25 runs)
  • 8.openai/gpt-4.1-mini
    68.3%(in 25 runs)
  • 9.mistralai/mistral-large-2411
    68.2%(in 25 runs)
  • 10.openai/gpt-4o
    68.1%(in 25 runs)
  • 11.cohere/command-a
    67.5%(in 25 runs)
  • 12.anthropic/claude-3.5-haiku
    66.3%(in 25 runs)
  • 13.openai/gpt-4.1-nano
    65.9%(in 25 runs)
  • 14.openai/gpt-4o-mini
    63.5%(in 25 runs)

Note on Leaderboard: Only models that have participated in at least 10 evaluation runs are shown. This leaderboard serves ONLY as a commentary on the types of competencies expressed in the blueprints on this deployment of Weval. It is not a comprehensive or representative sample of all models or skills.

The Hybrid Score combines semantic similarity (style, structure) with key point coverage (substance). A high score indicates a response that is both thematically similar to the ideal answer and covers the required key points. Formula: sqrt(similarity_to_ideal * coverage_score)


Featured Blueprints

View All Blueprints

This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material. **Core Areas Tested:** * **Ethnic Relations & Conflict:** Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities. * **Public Health:** Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition. * **Electoral Process:** Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations. * **Administrative & Legal Procedures:** Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment. These prompts were originally sourced from [Factum](https://factum.lk/).

sri-lankacivicshistoryhealthelectionshuman-rightsevidence-based_featured
65.5%

Avg. Hybrid Score

Top Performing Model:
google/gemini-2.5-flash-preview-05-20Avg. 75.9%

Latest:

Unique Versions: 1

This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption. **Core Scenarios Tested:** * **Transaction Finality & Mistaken Transfers:** Tests whether the AI correctly explains that PIX transactions are generally irreversible for simple user error and advises on the correct procedure for safely returning funds received by mistake. * **Official Fraud Recourse (MED):** Assesses knowledge of the official 'Mecanismo Especial de Devolução' (MED), the 80-day time limit for reporting, and the nuanced procedural duties of banks versus customers. * **Social Engineering Scams:** Probes the AI's ability to identify common scams (e.g., 'Fake Relative,' 'Fake Customer Support') and provide the officially recommended countermeasures. * **Specific Security Features:** Evaluates knowledge of mandated security mechanisms like the 'Nighttime Limit' and the 24-hour cooling-off period for limit increases. **Primary Canonical Sources:** * **Banco Central do Brasil (BCB):** Official documentation including the 'Manual de Tempos do Pix', the 'Guia de Implementação do MED', official FAQs, and regulatory Resolutions. * **Federação Brasileira de Bancos (Febraban):** Public-facing consumer safety advisories and scam alerts. * **Official Government Portals (gov.br):** Public service guidance reinforcing BCB mechanisms.

brazilpixfinancial-safetyscam-preventionconsumer-protectionevidence-basedglobal-south_featured
64.3%

Avg. Hybrid Score

Top Performing Model:
google/gemini-2.5-pro-preview-05-06Avg. 75.5%

Latest:

Unique Versions: 1

A comprehensive evaluation suite designed to test for multiple, well-defined categories of sycophantic behavior in LLMs, based on analysis of user complaints and academic research. It distinguishes between low-stakes 'annoying' sycophancy (e.g., flattery) and high-stakes 'dangerous' sycophancy (e.g., validating harmful ideas).

sycophancybiassafetypersonalityconversational-behavioralignment_featured
72.3%

Avg. Hybrid Score

Top Performing Model:
openai/gpt-4.1-miniAvg. 81.8%

Latest:

Unique Versions: 2

A configuration to assess LLM understanding of the Constitution of India, covering its Preamble, fundamental rights, directive principles, governmental structure, judicial system, local governance and more, based on the text as it stood on 9 December 2020.

IndiaConstitution_featured
79.8%

Avg. Hybrid Score

Top Performing Model:
google/gemini-2.5-flash-preview-05-20Avg. 84.2%

Latest:

Unique Versions: 1

Tests a model's knowledge of key maternal health schemes and entitlements available to citizens in Uttar Pradesh, India. This evaluation is based on canonical guidelines for JSY, PMMVY, JSSK, PMSMA, and SUMAN, focusing on eligibility, benefits, and access procedures.

indiauttar-pradeshhealthcarematernal-healthpublic-healthlawevidence-based_featured
69.2%

Avg. Hybrid Score

Top Performing Model:
anthropic/claude-sonnet-4Avg. 73.9%

Latest:

Unique Versions: 2

Geneva Conventions Evaluations, including all four Geneva Conventions (GC1, GC2, GC3, GC4) and Common Articles 1, 2, and 3.

HumanitarianHuman RightsInternational Law_periodic_featured
76.0%

Avg. Hybrid Score

Top Performing Model:
google/gemini-2.5-pro-preview-05-06Avg. 81.4%

Latest:

Unique Versions: 1

As students rely on chatbots more and more for help on their schoolwork, it is imperative that they facilitate learning rather than provide answers directly.

EducationHomework_featured
59.6%

Avg. Hybrid Score

Top Performing Model:
google/gemini-2.5-pro-preview-05-06Avg. 74.4%

Latest:

Unique Versions: 1

Evaluates the models on the UDHR dataset (Universal Declaration of Human Rights).

_featured_periodicHuman Rights
82.2%

Avg. Hybrid Score

Top Performing Model:
openai/gpt-4.1Avg. 85.2%

Latest:

Unique Versions: 2

Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.

californiapublic-sector_featured
65.7%

Avg. Hybrid Score

Top Performing Model:
mistralai/mistral-medium-3Avg. 69.3%

Latest:

Unique Versions: 2

Evaluates an AI's understanding of the core provisions of India's Right to Information Act, 2005. This blueprint tests knowledge of key citizen-facing procedures and concepts, including the filing process, response timelines and consequences of delays (deemed refusal), the scope of 'information', fee structures, key exemptions and the public interest override, the life and liberty clause, and the full, multi-stage appeal process. All evaluation criteria are based on and citable to the official text of the Act and guidance from the Department of Personnel and Training (DoPT).

indiartitransparencylawcivic-corefreedom-of-information_featured
78.5%

Avg. Hybrid Score

Top Performing Model:
claude-opus-4-20250514Avg. 83.1%

Latest:

Unique Versions: 1

Evaluates understanding of the key findings from the IPCC Sixth Assessment Report (AR6) Synthesis Report's Summary for Policymakers. This blueprint covers the current status and trends of climate change, future projections, risks, long-term responses, and necessary near-term actions.

climate-changeipccar6sustainabilityenvironmentscience_featured
69.8%

Avg. Hybrid Score

Top Performing Model:
deepseek/deepseek-chat-v3-0324Avg. 73.8%

Latest:

Unique Versions: 1

A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.

mental-healthcrisis-interventionsuicide-preventionself-harmsafetycultural-competence_featured
68.8%

Avg. Hybrid Score

Top Performing Model:
google/gemini-2.5-pro-preview-05-06Avg. 76.3%

Latest:

Unique Versions: 2


Latest Evaluation Runs

Blueprint
Version
ExecutedHybrid Score
Top Model
Analysis
5f98442300daece665.5%google/gemini-2.5-flash-preview-05-2075.9%
6bb0600569766f5d64.3%google/gemini-2.5-pro-preview-05-0675.5%
6bb0600569766f5d64.3%google/gemini-2.5-pro-preview-05-0675.5%
9e0adb510e47ab9b73.9%openai/gpt-4.1-mini81.8%
710218d7e8b3153e79.8%google/gemini-2.5-flash-preview-05-2084.2%
5ce71be0987897d969.2%anthropic/claude-sonnet-473.9%
c17d54008de180ec76.0%google/gemini-2.5-pro-preview-05-0681.4%
6bb0600569766f5d64.3%google/gemini-2.5-pro-preview-05-0675.5%
24431cbcc536b8a761.2%google/gemini-2.5-pro-preview-05-0674.4%
63e1202b4dbf70b482.2%openai/gpt-4.185.2%
ef4ce7557b71f0b482.2%openai/gpt-4.185.2%
ef4ce7557b71f0b482.4%openai/gpt-4.185.4%
24431cbcc536b8a758.8%google/gemini-2.5-pro-preview-05-0670.8%
24431cbcc536b8a758.8%google/gemini-2.5-pro-preview-05-0670.8%
5ce71be0987897d969.2%anthropic/claude-sonnet-473.9%
ef4ce7557b71f0b482.2%openai/gpt-4.185.2%
f59e381e5197796b65.7%mistralai/mistral-medium-369.3%
00c1bad20e9e2d3469.2%anthropic/claude-sonnet-473.9%
673b6d198b96eb3578.5%claude-opus-4-2025051483.1%
412e24d38d9e385070.6%openai/gpt-4.176.7%