The Experiment
We partnered with Karya to ask native speakers across India to compare Claude Opus 4.5 and Sonnet 4.5 on questions that matter to them: tenant rights, crop disease, labor law, irrigation subsidies.
Two evaluation methods
- 1Worker sees a question + two AI responses (anonymous)
- 2Picks the better response (or says equal)
- 3Records audio explanation of their choice
- 1Worker sees a question + one AI response
- 2Rates it on 4 criteria: trust, fluency, complexity, code-switching
- 3Each criterion scored independently
Domains covered
Which Response Did They Prefer?
See For Yourself
Here's an actual comparison a native speaker evaluated. Read both responses and pick which you think is better.
मेरे क्षेत्र में अचानक गर्म हवाएँ चलने लगी हैं खासकर फसल के फूल आने के समय। इससे उत्पादन पर क्या असर पड़ेगा और नुकसान कम करने के लिए कौन से उपाय अपनाए जा सकते है ?
By Language
Opus was preferred across all 7 languages, but the margin varied significantly.
Key insight: Hindi speakers showed the strongest Opus preference (+21pp). Kannada was closest to even (+5pp).
Not Always a Clear Winner
In 25% of comparisons, native speakers said both responses were equally good. Only 0.4% said both were equally bad.
This suggests both models are capable — Opus just edges ahead more often when there's a noticeable difference.
Beyond Preferences: Measuring Quality
In addition to A/B comparisons, workers rated each response on four specific criteria. Here's how Opus and Sonnet scored across 20,246 ratings.
Trust
Do you trust this response?
Fluency
Does the language flow naturally?
Complexity
Is the language appropriately simple?
Code-Switching
Is the English mixing appropriate?
The Paradox
Workers preferred Opus 63% of the time in head-to-head comparisons, yet Sonnet scores slightly higher on individual criteria. This suggests preference isn't just about measurable quality — it may reflect harder-to-quantify factors like tone, confidence, or cultural resonance.
Scores by Language
| Language | Trust | Fluency | Complexity | Code-Switching |
|---|---|---|---|---|
| Hindi | 86/89 | 74/78 | 78/79 | 93/92 |
| Bengali | 94/88 | 79/82 | 92/87 | 98/98 |
| Telugu | 84/90 | 83/91 | 84/88 | 91/94 |
| Kannada | 85/83 | 74/78 | 78/78 | 89/88 |
| Malayalam | 78/79 | 75/73 | 83/86 | 91/93 |
| Assamese | 88/87 | 78/80 | 78/82 | 89/87 |
| Marathi | 89/83 | 80/71 | 93/92 | 99/98 |
The Same Workers, Two Different Answers
20 workers completed both tasks: A/B comparisons and individual rubric ratings. Their data reveals the paradox at the individual level.
Case Study: Worker #9320
Telugu speaker · 140 A/B comparisons · 208 rubric ratings
The paradox: This worker chose Opus 60% of the time in direct comparisons, yet rated Sonnet 20 points higher on fluency and 14 points higher on trust when evaluating individually.
All 20 Overlap Workers
Each row shows one worker. Left side: A/B preference. Right side: rubric score difference. Purple rows show the paradox.
What This Means
Comparative judgment activates different criteria than absolute judgment. When workers see both responses side-by-side, they pick Opus. When rating each response alone on specific criteria, they give Sonnet slightly higher scores.
This suggests Opus may excel at qualities not captured by trust, fluency, complexity, or code-switching — perhaps confidence, completeness, or cultural resonance that only becomes apparent in direct comparison.
This is a well-documented phenomenon in psychology: comparative and absolute judgments can yield systematically different results, even from the same person evaluating the same items.
Who Did the Evaluating?
128 native speakers from across India evaluated model responses.
Notable Evaluator Patterns
Among the evaluators, here are some interesting patterns.
Do LLM Judges Agree with Native Speakers?
We ran the same responses through Weval's LLM judge pipeline and compared scores. The results reveal systematic biases in how AI judges evaluate multilingual content.
Score Comparison by Criterion
Key Findings
Fluency Overestimation: LLM judges rate fluency at 98% while native speakers rate it at 74% — a 24 point gap. In fact, native speakers rated 226 responses as having zero fluency (citing spelling errors, grammar mistakes, and poor flow) — yet LLM judges rated those same responses near-perfect.
Errors: spelling mistakes, wrong word choices, grammar errors
Code-Switching Underestimation: LLM judges rate code-switching at 56% while native speakers rate it at 91% — a 34 point gap.
Near-Zero Correlations: The correlation between human and LLM scores is essentially zero across all criteria (ranging from -0.02 to 0.12), meaning LLM judgments have no predictive relationship with human judgments.
What This Means for AI Evaluation
LLM judges cannot substitute for native speaker evaluation in multilingual contexts. The systematic biases — overrating fluency, underrating appropriate code-switching — suggest LLM judges are applying English-centric evaluation heuristics.
This finding has implications for automated evaluation pipelines: quality scores from LLM judges may not reflect actual user satisfaction or cultural appropriateness in non-English languages.
What Domain Experts See Differently
In addition to 20,000+ non-expert evaluations, we collected 2,399 expert ratings from 28 domain experts — legal professionals and agricultural specialists who can assess factual accuracy, not just linguistic quality.
Do Experts Agree with Non-Experts?
We compared expert and non-expert ratings on 2,151 overlapping responses — cases where both groups evaluated the same AI answer.
The Domain Split
When experts and non-experts disagree, the pattern is completely different by domain.
Experts distrust what non-experts trusted
Experts trust what non-experts distrusted
Model × Domain Breakdown
When experts and non-experts disagree, which way does the expert lean?
Legal | Agriculture | |
|---|---|---|
| opus | 81% | 41% |
| sonnet | 59% | 34% |
Key Insight: Legal + Opus
In legal content, when experts and non-experts disagree about Opus, experts distrust the response 4.2x more often than they trust it. This pattern is weaker for Sonnet (1.5x).
This suggests Opus may produce legal responses that sound authoritative to non-experts but have issues that domain experts catch — a potential “deceptive fluency” pattern specific to legal content.
When Experts Distrust
Out of 2,399 expert evaluations, experts flagged 65 responses as untrustworthy. The pattern reveals where AI models struggle most.
Breakdown
Expert Feedback on Distrusted Responses
“If you have not been allotted a PAN, you can generate your e-PAN with help of your aadhar and a registered mobile number with your aadhar. This is possible since September 1,2019 under Section 139AA of the Income tax Act”
“It does not really answer the question that was asked, the language is jarring and a bit too simplistic at places.”
“The answer could have stopped at the negligence bit.”
“it does not mention any specific sections from the act which makes it difficult to trust”
What Experts Catch
Expert feedback reveals specific issues that non-experts often miss. Below we show patterns grouped by keyword matching in the feedback text.
“If you have not been allotted a PAN, you can generate your e-PAN with help of your aadhar and a registered mobile number with your aadhar. This is possible since September 1,2019 under Section 139AA o...”
“it does not mention any specific sections from the act which makes it difficult to trust”
“it does not use any sections or specific act or case laws so can't trust”
“Husband and wife do not need a single, combined PAN card for joint property; they must use their respective individual PAN cards for registration. While a 50:50 ownership split is common, it is not ma...”
+3 more cases in this category
Experts Sometimes Cite Sources
Some experts included URLs in their feedback to back up their assessments.
yes it is required to mention each and every bank account in income tax Return. The answer is absolutely correct. https://dailyfinancial.in/why-every-bank-account-must-be-in-your-itr-what-if-you-fail-to-do-so/
If a person residing in a foreign country,earns income there, their family members can apply for tax relief in their home country provided the income is not received in India. read this in https://cleartax.in/s/taxation -of-foreign-source-income
The Expert Difference
Non-experts evaluate based on how responses feel — fluency, tone, perceived helpfulness.
Domain experts evaluate based on what responses contain — verifiable facts, proper citations, current law, correct assumptions.
This explains why 97% of distrusted responses come from the Legal domain: legal advice requires precise, verifiable, current information that experts can validate.
Things to Know
Important context for interpreting these results.
Run an evaluation like this
Interested in community-driven AI evaluation for your region or domain? Get in touch.
Contact usweval@cip.org