In partnership withKaryaand

India Multilingual Benchmarks

A community-driven evaluation of AI model performance across Indian languages in legal and agricultural domains

conducted using

weval

CIP's platform for running contextual evaluations of AI systems

Native speakers preferred Opus 4.5

63%
preferred Opus
10.6k
A/B comparisons
20.2k
rubric ratings
128
native speakers
हिंदी
Hindi
বাংলা
Bengali
తెలుగు
Telugu
ಕನ್ನಡ
Kannada
മലయാളം
Malayalam
অসমীয়া
Assamese
मराठी
Marathi

The Experiment

We partnered with Karya to ask native speakers across India to compare Claude Opus 4.5 and Sonnet 4.5 on questions that matter to them: tenant rights, crop disease, labor law, irrigation subsidies.

Two evaluation methods

Task 1Head-to-Head Comparison
  1. 1Worker sees a question + two AI responses (anonymous)
  2. 2Picks the better response (or says equal)
  3. 3Records audio explanation of their choice
Task 2Rubric-Based Rating
  1. 1Worker sees a question + one AI response
  2. 2Rates it on 4 criteria: trust, fluency, complexity, code-switching
  3. 3Each criterion scored independently

Domains covered

Legal (property, labor, consumer) Agriculture (crops, subsidies, livestock)
Part 1: Head-to-Head Comparisons

Which Response Did They Prefer?

See For Yourself

Here's an actual comparison a native speaker evaluated. Read both responses and pick which you think is better.

Hindi · Agriculture

मेरे क्षेत्र में अचानक गर्म हवाएँ चलने लगी हैं खासकर फसल के फूल आने के समय। इससे उत्पादन पर क्या असर पड़ेगा और नुकसान कम करने के लिए कौन से उपाय अपनाए जा सकते है ?

Response A
Response B

By Language

Opus was preferred across all 7 languages, but the margin varied significantly.

Opus 4.5
Sonnet 4.5
Hindi1,417 comparisons
+21pp
71%
29%
Telugu1,410 comparisons
+17pp
67%
33%
Bengali1,066 comparisons
+16pp
66%
34%
Malayalam1,325 comparisons
+11pp
61%
39%
Marathi481 comparisons
+11pp
61%
39%
Assamese1,214 comparisons
+7pp
57%
43%
Kannada967 comparisons
+5pp
55%
45%

Key insight: Hindi speakers showed the strongest Opus preference (+21pp). Kannada was closest to even (+5pp).

Not Always a Clear Winner

In 25% of comparisons, native speakers said both responses were equally good. Only 0.4% said both were equally bad.

Opus preferred
4,970 (47%)
Sonnet preferred
2,910 (27%)
Equally good
2,705 (25%)
Equally bad
44 (0%)

This suggests both models are capable — Opus just edges ahead more often when there's a noticeable difference.

Part 2: Rubric-Based Ratings

Beyond Preferences: Measuring Quality

In addition to A/B comparisons, workers rated each response on four specific criteria. Here's how Opus and Sonnet scored across 20,246 ratings.

Trust

Do you trust this response?

Opus
86%
Sonnet
86%

Fluency

Does the language flow naturally?

Sonnet +3%
Opus
77%
Sonnet
80%

Complexity

Is the language appropriately simple?

Sonnet +2%
Opus
82%
Sonnet
84%

Code-Switching

Is the English mixing appropriate?

Opus
92%
Sonnet
92%

The Paradox

Workers preferred Opus 63% of the time in head-to-head comparisons, yet Sonnet scores slightly higher on individual criteria. This suggests preference isn't just about measurable quality — it may reflect harder-to-quantify factors like tone, confidence, or cultural resonance.

Scores by Language

Opus 4.5vsSonnet 4.5
Opus higher Sonnet higher
LanguageTrustFluencyComplexityCode-Switching
Hindi
86/89
74/78
78/79
93/92
Bengali
94/88
79/82
92/87
98/98
Telugu
84/90
83/91
84/88
91/94
Kannada
85/83
74/78
78/78
89/88
Malayalam
78/79
75/73
83/86
91/93
Assamese
88/87
78/80
78/82
89/87
Marathi
89/83
80/71
93/92
99/98
Deep Dive

The Same Workers, Two Different Answers

20 workers completed both tasks: A/B comparisons and individual rubric ratings. Their data reveals the paradox at the individual level.

68%
chose Opus in A/B
87%
Sonnet rubric score
(vs 84% Opus)
11/20
show the paradox
(55%)

Case Study: Worker #9320

Telugu speaker · 140 A/B comparisons · 208 rubric ratings

A/B Comparison Results
Chose Opus60%
Opus: 84Sonnet: 55
Individual Rubric Ratings
TrustSonnet +14
Opus
86%
Sonnet
100%
FluencySonnet +20
Opus
75%
Sonnet
95%
ComplexitySonnet +16
Opus
80%
Sonnet
95%
Code-SwitchSonnet +6
Opus
93%
Sonnet
100%

The paradox: This worker chose Opus 60% of the time in direct comparisons, yet rated Sonnet 20 points higher on fluency and 14 points higher on trust when evaluating individually.

All 20 Overlap Workers

Each row shows one worker. Left side: A/B preference. Right side: rubric score difference. Purple rows show the paradox.

Worker
A/B Preference
Rubric Scores
#5721
94%
-0%
#9931
81%
-6%
#13976
77%
-1%
#9319
76%
-0%
#9320
60%
-14%
#10032
70%
-0%
#12098
67%
-2%
#11755
67%
-1%
Opus
Sonnet
Paradoxical worker

What This Means

Comparative judgment activates different criteria than absolute judgment. When workers see both responses side-by-side, they pick Opus. When rating each response alone on specific criteria, they give Sonnet slightly higher scores.

This suggests Opus may excel at qualities not captured by trust, fluency, complexity, or code-switching — perhaps confidence, completeness, or cultural resonance that only becomes apparent in direct comparison.

This is a well-documented phenomenon in psychology: comparative and absolute judgments can yield systematically different results, even from the same person evaluating the same items.

Part 3: The Evaluators

Who Did the Evaluating?

128 native speakers from across India evaluated model responses.

Notable Evaluator Patterns

Among the evaluators, here are some interesting patterns.

Human vs. AI Judges

Do LLM Judges Agree with Native Speakers?

We ran the same responses through Weval's LLM judge pipeline and compared scores. The results reveal systematic biases in how AI judges evaluate multilingual content.

42%
Disagreement Rate
(31,515 of 74,344)
-0.13
Overall Correlation
(near-zero = no agreement)
±27%
Avg. Score Difference

Score Comparison by Criterion

Trustworthiness
r = +0.00LLM +1pt
Human
86%
LLM
87%
Fluency
r = -0.01LLM +24pt
Human
74%
LLM
98%
Complexity
r = -0.02LLM +7pt
Human
78%
LLM
85%
Code-Switching
r = +0.12Human +34pt
Human
91%
LLM
56%

Key Findings

Fluency Overestimation: LLM judges rate fluency at 98% while native speakers rate it at 74% — a 24 point gap. In fact, native speakers rated 226 responses as having zero fluency (citing spelling errors, grammar mistakes, and poor flow) — yet LLM judges rated those same responses near-perfect.

Example: Malayalam agriculture response (opus)
Native speaker
0%
Nothing flows well
Errors: spelling mistakes, wrong word choices, grammar errors
LLM judge
100%
No issues detected

Code-Switching Underestimation: LLM judges rate code-switching at 56% while native speakers rate it at 91% — a 34 point gap.

Near-Zero Correlations: The correlation between human and LLM scores is essentially zero across all criteria (ranging from -0.02 to 0.12), meaning LLM judgments have no predictive relationship with human judgments.

What This Means for AI Evaluation

LLM judges cannot substitute for native speaker evaluation in multilingual contexts. The systematic biases — overrating fluency, underrating appropriate code-switching — suggest LLM judges are applying English-centric evaluation heuristics.

This finding has implications for automated evaluation pipelines: quality scores from LLM judges may not reflect actual user satisfaction or cultural appropriateness in non-English languages.

Part 5: The Expert Lens

What Domain Experts See Differently

In addition to 20,000+ non-expert evaluations, we collected 2,399 expert ratings from 28 domain experts — legal professionals and agricultural specialists who can assess factual accuracy, not just linguistic quality.

28
Domain Experts
2,399
Expert Ratings
1,730
Written Feedback

Do Experts Agree with Non-Experts?

We compared expert and non-expert ratings on 2,151 overlapping responses — cases where both groups evaluated the same AI answer.

73%
Agreement Rate
(1,564 of 2,151 responses)
587
Disagreements
Where expert and non-expert ratings differ significantly

The Domain Split

When experts and non-experts disagree, the pattern is completely different by domain.

Legal(482 disagreements)
Expert distrusts71%
340 distrusted142 trusted

Experts distrust what non-experts trusted

Agriculture(105 disagreements)
Expert distrusts36%
38 distrusted67 trusted

Experts trust what non-experts distrusted

Model × Domain Breakdown

When experts and non-experts disagree, which way does the expert lean?

Legal
Agriculture
opus
81%
4.2x distrust
41%
1.5x trust
sonnet
59%
1.5x distrust
34%
2.0x trust
Expert distrusts (non-expert trusted)
Expert trusts (non-expert distrusted)

Key Insight: Legal + Opus

In legal content, when experts and non-experts disagree about Opus, experts distrust the response 4.2x more often than they trust it. This pattern is weaker for Sonnet (1.5x).

This suggests Opus may produce legal responses that sound authoritative to non-experts but have issues that domain experts catch — a potential “deceptive fluency” pattern specific to legal content.

When Experts Distrust

Out of 2,399 expert evaluations, experts flagged 65 responses as untrustworthy. The pattern reveals where AI models struggle most.

65
Distrusted Responses
(2.7% of expert evaluations)
97% Legal
63 of 65 distrusted responses are from the legal domain
74% Malayalam
48 of 65 distrusted responses are in Malayalam

Breakdown

By Model
opus
38
sonnet
27
By Domain
Legal
63
Agriculture
2
By Language
Malayalam: 48Assamese: 6Hindi: 6Bengali: 3Marathi: 1Kannada: 1

Expert Feedback on Distrusted Responses

sonnetBengaliLegal

If you have not been allotted a PAN, you can generate your e-PAN with help of your aadhar and a registered mobile number with your aadhar. This is possible since September 1,2019 under Section 139AA of the Income tax Act

sonnetBengaliLegal

It does not really answer the question that was asked, the language is jarring and a bit too simplistic at places.

sonnetBengaliLegal

The answer could have stopped at the negligence bit.

sonnetHindiLegal

it does not mention any specific sections from the act which makes it difficult to trust

What Experts Catch

Expert feedback reveals specific issues that non-experts often miss. Below we show patterns grouped by keyword matching in the feedback text.

Note: Categories below are auto-generated by keyword matching in expert feedback. Some feedback may be miscategorized. Click “Show Q&A” to see the full context.
sonnetBengaliLegal

If you have not been allotted a PAN, you can generate your e-PAN with help of your aadhar and a registered mobile number with your aadhar. This is possible since September 1,2019 under Section 139AA o...

sonnetHindiLegal

it does not mention any specific sections from the act which makes it difficult to trust

sonnetHindiLegal

it does not use any sections or specific act or case laws so can't trust

sonnetBengaliLegal

Husband and wife do not need a single, combined PAN card for joint property; they must use their respective individual PAN cards for registration. While a 50:50 ownership split is common, it is not ma...

+3 more cases in this category

Experts Sometimes Cite Sources

Some experts included URLs in their feedback to back up their assessments.

trustBengali · Legal

yes it is required to mention each and every bank account in income tax Return. The answer is absolutely correct. https://dailyfinancial.in/why-every-bank-account-must-be-in-your-itr-what-if-you-fail-to-do-so/

somewhat trustBengali · Legal

If a person residing in a foreign country,earns income there, their family members can apply for tax relief in their home country provided the income is not received in India. read this in https://cleartax.in/s/taxation -of-foreign-source-income

The Expert Difference

Non-experts evaluate based on how responses feel — fluency, tone, perceived helpfulness.

Domain experts evaluate based on what responses contain — verifiable facts, proper citations, current law, correct assumptions.

This explains why 97% of distrusted responses come from the Legal domain: legal advice requires precise, verifiable, current information that experts can validate.

Things to Know

Important context for interpreting these results.

Run an evaluation like this

Interested in community-driven AI evaluation for your region or domain? Get in touch.

Contact us

weval@cip.org