A Platform to Build and Share AI Evaluations

Evaluate What Matters

Most AI benchmarks test what's easy, not what's important. Weval lets you build deep, qualitative evaluations for everything from niche professional domains to the everyday traits we all care about—like safety, honesty, and helpfulness.

Explore the Results

Browse a public library of community-contributed benchmarks on domains like clinical advice, regional knowledge, legal reasoning, behavioural traits, and AI safety. Track model performance over time as tests re-run automatically.

Contribute a Blueprint

Are you a domain expert? Do you have strong insights and opinions about how AI should behave? Codify your knowledge into a "blueprint". You can view results, share them privately, or propose them to be featured publicly on weval.org itself.

Model Leaderboard

Measured by average hybrid score across all evaluations.

Overall Model Leaderboard (Avg. Hybrid Score)

  • 1.x-ai/grok-3-mini-beta
    75.7%(in 24 runs)
  • 2.anthropic/claude-sonnet-4
    74.8%(in 24 runs)
  • 3.google/gemini-2.5-flash-preview-05-20
    74.6%(in 24 runs)
  • 4.openai/gpt-4.1
    74.3%(in 24 runs)
  • 5.x-ai/grok-3
    74.3%(in 19 runs)


Other Evaluations

View All Evaluations »

Browse by Category

View All Tags

Weval is an open source project from the Collective Intelligence Project.