Latest Platform Stats
Overall Model Leaderboard (Avg. Hybrid Score)
- 1.google/gemini-2.5-flash-preview-05-2072.7%(in 25 runs)
- 2.openai/gpt-4.172.1%(in 25 runs)
- 3.x-ai/grok-3-mini-beta71.9%(in 25 runs)
- 4.anthropic/claude-sonnet-471.6%(in 25 runs)
- 5.google/gemini-2.5-pro-preview-05-0671.5%(in 17 runs)
- 6.deepseek/deepseek-chat-v3-032469.9%(in 25 runs)
- 7.mistralai/mistral-medium-369.5%(in 25 runs)
- 8.openai/gpt-4.1-mini68.3%(in 25 runs)
- 9.mistralai/mistral-large-241168.2%(in 25 runs)
- 10.openai/gpt-4o68.1%(in 25 runs)
- 11.cohere/command-a67.5%(in 25 runs)
- 12.anthropic/claude-3.5-haiku66.3%(in 25 runs)
- 13.openai/gpt-4.1-nano65.9%(in 25 runs)
- 14.openai/gpt-4o-mini63.5%(in 25 runs)
Note on Leaderboard: Only models that have participated in at least 10 evaluation runs are shown. This leaderboard serves ONLY as a commentary on the types of competencies expressed in the blueprints on this deployment of Weval. It is not a comprehensive or representative sample of all models or skills.
The Hybrid Score combines semantic similarity (style, structure) with key point coverage (substance). A high score indicates a response that is both thematically similar to the ideal answer and covers the required key points. Formula: sqrt(similarity_to_ideal * coverage_score)
Featured Blueprints
View All BlueprintsThis blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material. **Core Areas Tested:** * **Ethnic Relations & Conflict:** Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities. * **Public Health:** Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition. * **Electoral Process:** Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations. * **Administrative & Legal Procedures:** Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment. These prompts were originally sourced from [Factum](https://factum.lk/).
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption. **Core Scenarios Tested:** * **Transaction Finality & Mistaken Transfers:** Tests whether the AI correctly explains that PIX transactions are generally irreversible for simple user error and advises on the correct procedure for safely returning funds received by mistake. * **Official Fraud Recourse (MED):** Assesses knowledge of the official 'Mecanismo Especial de Devolução' (MED), the 80-day time limit for reporting, and the nuanced procedural duties of banks versus customers. * **Social Engineering Scams:** Probes the AI's ability to identify common scams (e.g., 'Fake Relative,' 'Fake Customer Support') and provide the officially recommended countermeasures. * **Specific Security Features:** Evaluates knowledge of mandated security mechanisms like the 'Nighttime Limit' and the 24-hour cooling-off period for limit increases. **Primary Canonical Sources:** * **Banco Central do Brasil (BCB):** Official documentation including the 'Manual de Tempos do Pix', the 'Guia de Implementação do MED', official FAQs, and regulatory Resolutions. * **Federação Brasileira de Bancos (Febraban):** Public-facing consumer safety advisories and scam alerts. * **Official Government Portals (gov.br):** Public service guidance reinforcing BCB mechanisms.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive evaluation suite designed to test for multiple, well-defined categories of sycophantic behavior in LLMs, based on analysis of user complaints and academic research. It distinguishes between low-stakes 'annoying' sycophancy (e.g., flattery) and high-stakes 'dangerous' sycophancy (e.g., validating harmful ideas).
Avg. Hybrid Score
A configuration to assess LLM understanding of the Constitution of India, covering its Preamble, fundamental rights, directive principles, governmental structure, judicial system, local governance and more, based on the text as it stood on 9 December 2020.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests a model's knowledge of key maternal health schemes and entitlements available to citizens in Uttar Pradesh, India. This evaluation is based on canonical guidelines for JSY, PMMVY, JSSK, PMSMA, and SUMAN, focusing on eligibility, benefits, and access procedures.
Avg. Hybrid Score
Latest:
Unique Versions: 2
Geneva Conventions Evaluations, including all four Geneva Conventions (GC1, GC2, GC3, GC4) and Common Articles 1, 2, and 3.
Avg. Hybrid Score
Latest:
Unique Versions: 1
As students rely on chatbots more and more for help on their schoolwork, it is imperative that they facilitate learning rather than provide answers directly.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates the models on the UDHR dataset (Universal Declaration of Human Rights).
Avg. Hybrid Score
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
Avg. Hybrid Score
Latest:
Unique Versions: 2
Evaluates an AI's understanding of the core provisions of India's Right to Information Act, 2005. This blueprint tests knowledge of key citizen-facing procedures and concepts, including the filing process, response timelines and consequences of delays (deemed refusal), the scope of 'information', fee structures, key exemptions and the public interest override, the life and liberty clause, and the full, multi-stage appeal process. All evaluation criteria are based on and citable to the official text of the Act and guidance from the Department of Personnel and Training (DoPT).
Avg. Hybrid Score
Evaluates understanding of the key findings from the IPCC Sixth Assessment Report (AR6) Synthesis Report's Summary for Policymakers. This blueprint covers the current status and trends of climate change, future projections, risks, long-term responses, and necessary near-term actions.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.
Avg. Hybrid Score
Latest:
Unique Versions: 2
Latest Evaluation Runs
Blueprint | Version | Executed | Hybrid Score | Top Model | Analysis |
---|---|---|---|---|---|
5f98442300daece6 | 65.5% | google/gemini-2.5-flash-preview-05-2075.9% | |||
6bb0600569766f5d | 64.3% | google/gemini-2.5-pro-preview-05-0675.5% | |||
6bb0600569766f5d | 64.3% | google/gemini-2.5-pro-preview-05-0675.5% | |||
9e0adb510e47ab9b | 73.9% | openai/gpt-4.1-mini81.8% | |||
710218d7e8b3153e | 79.8% | google/gemini-2.5-flash-preview-05-2084.2% | |||
5ce71be0987897d9 | 69.2% | anthropic/claude-sonnet-473.9% | |||
c17d54008de180ec | 76.0% | google/gemini-2.5-pro-preview-05-0681.4% | |||
6bb0600569766f5d | 64.3% | google/gemini-2.5-pro-preview-05-0675.5% | |||
24431cbcc536b8a7 | 61.2% | google/gemini-2.5-pro-preview-05-0674.4% | |||
63e1202b4dbf70b4 | 82.2% | openai/gpt-4.185.2% | |||
ef4ce7557b71f0b4 | 82.2% | openai/gpt-4.185.2% | |||
ef4ce7557b71f0b4 | 82.4% | openai/gpt-4.185.4% | |||
24431cbcc536b8a7 | 58.8% | google/gemini-2.5-pro-preview-05-0670.8% | |||
24431cbcc536b8a7 | 58.8% | google/gemini-2.5-pro-preview-05-0670.8% | |||
5ce71be0987897d9 | 69.2% | anthropic/claude-sonnet-473.9% | |||
ef4ce7557b71f0b4 | 82.2% | openai/gpt-4.185.2% | |||
f59e381e5197796b | 65.7% | mistralai/mistral-medium-369.3% | |||
00c1bad20e9e2d34 | 69.2% | anthropic/claude-sonnet-473.9% | |||
673b6d198b96eb35 | 78.5% | claude-opus-4-2025051483.1% | |||
412e24d38d9e3850 | 70.6% | openai/gpt-4.176.7% |