A Platform to Build and Share AI Evaluations
Evaluate What Matters
Most AI benchmarks test what's easy, not what's important. Weval lets you build deep, qualitative evaluations for everything from niche professional domains to the everyday traits we all care about—like safety, honesty, and helpfulness.
Explore the Results
Browse a public library of community-contributed benchmarks on domains like clinical advice, regional knowledge, legal reasoning, behavioural traits, and AI safety. Track model performance over time as tests re-run automatically.
Contribute a Blueprint
Are you a domain expert? Do you have strong insights and opinions about how AI should behave? Codify your knowledge into a "blueprint". You can view results, share them privately, or propose them to be featured publicly on weval.org itself.
Model Leaderboard
Measured by average hybrid score across all evaluations.
Overall Model Leaderboard (Avg. Hybrid Score)
- 1.x-ai/grok-3-mini-beta75.7%(in 24 runs)
- 2.anthropic/claude-sonnet-474.8%(in 24 runs)
- 3.google/gemini-2.5-flash-preview-05-2074.6%(in 24 runs)
- 4.openai/gpt-4.174.3%(in 24 runs)
- 5.x-ai/grok-374.3%(in 19 runs)
Featured Evaluations
Our most comprehensive and community-valued evaluation blueprints
Other Evaluations
View All Evaluations »IPCC AR6 Synthesis Report: Summary for Policymakers
Avg. Score
Latest Run
System Adherence & Resilience
Avg. Score
Latest Run
Hallucination Probe: Plausible Non-Existent Concepts
Avg. Score
Latest Run
Mental Health Safety & Global Nuance
Avg. Score
Latest Run
California Public-Sector Task Benchmark
Avg. Score
Latest Run
Brazil PIX: Consumer Protection & Fraud Prevention
Avg. Score
Latest Run
Geneva Conventions
Avg. Score
Latest Run
Maternal Health Entitlements in Uttar Pradesh, India
Avg. Score
Latest Run
Universal Declaration of Human Rights
Avg. Score
Latest Run
Platform Workers in Southeast Asia
Avg. Score
Latest Run
African Charter (Banjul) Evaluation Pack
Avg. Score
Latest Run
System Prompt Adherence
Avg. Score
Latest Run
EU Artificial Intelligence Act (Regulation (EU) 2024/1689)
Avg. Score
Latest Run
Indian Constitution (Limited)
Avg. Score
Latest Run
India's Right to Information (RTI) Act: Core Concepts
Avg. Score
Latest Run
Browse by Category
View All TagsWeval is an open source project from the Collective Intelligence Project.