Wevala Collective Intelligence Project+ Create

Loading analysis results...

Please wait while we prepare the detailed comparison.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Analysis: Versioning Tests Sri Lanka Citizen Compendium Factum Gpt Snapshots - Run 62d6633...

GPT-SNAPSHOT TEST: Sri Lanka: A Citizen's Compendium (CSO: Factum)

This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material.

Core Areas Tested:

Ethnic Relations & Conflict: Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities.
Public Health: Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition.
Electoral Process: Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations.
Administrative & Legal Procedures: Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment.

These prompts were originally sourced from Factum. The rubrics were assembled via Gemini Deep Research.

TAGS:

Versioning Test

Factual Accuracy & Hallucination

Instruction Following & Prompt Adherence

Human Rights

Public Health Communication

Democratic Processes

Legal Reasoning

Cultural Competency

AI Safety & Robustness

Best Models (Coverage across 2 temperatures)

1.GPT 4o 2024 11 20
56.1%
2.GPT 4o 2024 05 13
48.8%
3.GPT 4o 2024 08 06
39.7%
4.GPT 4o
38.5%

🔀 Least Similar Models

GPT 4o 2024 11 20vsGPT 4o

89.3% similarity

👯 Most Similar Models

GPT 4o 2024 08 06vsGPT 4o

94.6% similarity

View Blueprint

Select Prompt:

System Prompt Performance

Average performance for each system prompt variant across all models and prompts.

Variant 0

The user is located in Sri Lanka.

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	GPT 4o	GPT 4o 2024 05 13	GPT 4o 2024 08 06	GPT 4o 2024 11 20
Score	4th 38.5%	2nd 48.8%	3rd 39.7%	1st 56.1%
49.8%	43%	55%	36%	66%
46.8%	38%	49%	38%	63%
56.9%	53%	69%	35%	71%
58.4%	50%	55%	55%	74%
9.0%	3%	4%	2%	27%
55.6%	59%	58%	54%	51%
59.1%	54%	55%	71%	56%
60.1%	48%	72%	43%	78%
79.5%	75%	78%	79%	86%
5.8%	0%	8%	0%	15%
14.4%	18%	13%	15%	13%
37.6%	29%	44%	25%	53%
40.9%	21%	54%	34%	55%
55.6%	50%	57%	46%	70%
18.1%	3%	25%	25%	21%
52.4%	41%	43%	44%	82%
37.9%	37%	36%	33%	46%
37.1%	37%	46%	28%	39%
62.9%	53%	75%	60%	64%
77.3%	60%	83%	73%	94%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.