Evaluations Tagged: ...

View Latest Run Analysis View All Runs for this Blueprint

Plausibility

77.1%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Polarization and Confirmation Risk Probes: Spouse Social Media and Political Theft Narratives

This evaluation tests a model's tendency to escalate mild user concerns into polarized, extreme positions, potentially confirming biases and leading to harmful outcomes. It includes two narratives: one reconstructing a conversation about a spouse's social media use, and another about annoyance with a neighbor's political views escalating to justifying theft. Both use sequential fork points where the model must respond to escalating user prompts. A safe model should de-escalate, provide balanced advice, encourage communication or legal approaches, and avoid endorsing controlling or illegal behaviors. Based on the Sydney eval structure for long-thread boundary testing.

Confirmation Bias

Proactive Safety & Harm Avoidance

77.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Sri Lanka Contextual Prompts

This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material.

Core Areas Tested:

Ethnic Relations & Conflict: Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities.
Public Health: Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition.
Electoral Process: Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations.
Administrative & Legal Procedures: Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment.

These prompts were originally sourced from Factum. The rubrics were assembled via Gemini Deep Research.

Geographic & Local Knowledge

Public Health

Democratic Processes

View Latest Run Analysis View All Runs for this Blueprint

Empathy

49.2%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

System Adherence & Resilience

This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.

Core Areas Tested:

Persona Durability: Assesses if a model can maintain a specific persona throughout a conversation, even when faced with off-topic or complex questions, without reverting to a generic AI assistant persona.
Negative Constraints: Tests the model's ability to follow explicit negative commands (i.e., instructions about what not to do), which are often more difficult to adhere to than positive instructions.
Instruction Conflict: Evaluates how a model resolves conflicts between the initial system prompt and a contradictory user request, revealing the "weight" it gives to its core instructions.
Format Enforcement: Checks the model's robustness in maintaining a strict, non-natural output format (like JSON) even when the user explicitly asks it to break format.
Prompt Secrecy: Tests whether a model can obey a meta-instruction to keep its own system prompt confidential, a key test for preventing instruction leakage and manipulation.
Safety Constraint Adherence: Tests if the model can uphold a safety or academic integrity rule from the system prompt when a user attempts to cheat.
Instruction Degradation: Checks if a model's adherence to an instruction erodes over a long conversation.
Complex Instruction Homeostasis: Evaluates if a model can follow multiple complex, competing constraints simultaneously without failing on one of them.
Alignment Tax / Benign Refusal: Probes for cases where a model incorrectly refuses a benign request due to shallow keyword matching (e.g., refusing to 'kill' a process on a server).
Jailbreak & Prompt Injection: A suite of tests to evaluate resilience against common adversarial attacks, including:
- Direct Instruction Overrides: Ignoring previous instructions.
- Contextual Reframing: Hiding a harmful request in a seemingly virtuous context.
- Indirect Injection: Following instructions embedded in data it's supposed to be processing.
- Translation Attack: Bypassing safety filters by making requests in other languages.
- Refusal Suppression: Forbidding the model from using typical refusal language.
- Policy Drift: Using a faked conversation history to make the model believe it has already overridden its policies.

Instruction Following

Jailbreak & Evasion Resistance

View Latest Run Analysis View All Runs for this Blueprint

82.2%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

IFIT Conflict Resolution Scenarios (2025)

This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.

Additionally we have included the system prompt used in a follow-up report titled "Improving AI Conflict Resolution Capacities: A Prompts-Based Evaluation", summarized thus:

Following the release of AI on the Frontline: Evaluating Large Language Models in Real‐World Conflict Resolution—a groundbreaking study by the Institute for Integrated Transitions (IFIT)—new testing has shown that the main weaknesses identified in the original research can be improved through simple adjustments to the prompts used for large language models (LLMs) like ChatGPT, DeepSeek, Grok and others. While today’s leading LLMs are still not ready to provide reliable conflict resolution advice, the path to improvement may be just a few sentences away—inputted either by LLM providers (as “system prompts”) or by LLM users.

Conflict Resolution

Nuance & Depth

View Latest Run Analysis View All Runs for this Blueprint

52.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios

This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on and inspired by the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."

The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.

Key Study References:

Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.

Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.

Mental Health

Safety

Clinical Appropriateness

Crisis Response

Therapeutic Ethics

Healthcare & Clinical Scenarios

Mental Health & Crisis Support

Proactive Safety & Harm Avoidance

Empathy

View Latest Run Analysis View All Runs for this Blueprint

71.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Evidence-Based AI Tutoring and Teaching Excellence

A comprehensive evaluation suite testing AI tutoring and teaching capabilities against evidence-based pedagogical practices from global education research. This blueprint operationalizes decades of teaching effectiveness research into specific, testable criteria for AI systems.

Core Research Foundation:

Explicit Instruction: Based on Rosenshine's (2012) Principles of Instruction, requiring step-by-step teaching, worked examples, and guided practice before independence
Formative Assessment: Implements Wiliam & Thompson's (2008) framework for checking understanding through targeted questioning and immediate feedback loops
Cognitive Load Management: Applies Sweller's (2011) Cognitive Load Theory to prevent information overload through chunking and scaffolding
Socratic Dialogue: Follows Alexander's (2018) dialogic teaching principles from the EEF randomized trial, emphasizing structured questioning over guess-what-I'm-thinking
Retrieval Practice: Incorporates Dunlosky et al.'s (2013) high-utility learning techniques, particularly spaced repetition and testing effects
Adaptive Teaching: Implements Teaching at the Right Level (TaRL) methodology from Banerjee et al.'s (2007) India RCTs, requiring diagnostic assessment and differentiated instruction
Quality Feedback: Applies Hattie & Timperley's (2007) feedback framework, distinguishing actionable guidance from vague praise
Academic Integrity: Follows Kirschner, Sweller & Clark's (2006) guided instruction principles, refusing to provide answers while maintaining learning engagement

Key Distinctions Tested:

Effective AI Tutoring: Structured, scaffolded, formative, diagnostic, with productive struggle and spaced practice
Ineffective AI Responses: Answer-giving, overwhelming, dependency-creating, coverage-focused, with minimal guidance for novices

Global Evidence Base: Synthesizes research from multiple educational contexts including Harvard AI tutoring RCTs, EEF Teaching & Learning Toolkit meta-analyses, World Bank TEACH classroom observation framework, Japanese Lesson Study collaborative inquiry, and cross-cultural validation from OECD Global Teaching InSights video studies.

Practical Application: Each probe tests specific teaching behaviors that correlate with student learning gains across diverse contexts, ensuring AI systems demonstrate pedagogical competence rather than mere content knowledge.

Interpersonal & Social Skill Modeling

Education Systems & Pedagogy

Academic Integrity

Cognitive Impact & Dependency

View Latest Run Analysis View All Runs for this Blueprint

Clarity & Readability

53.1%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Non-sycophancy and Independence

A comprehensive evaluation suite designed to test for multiple, well-defined categories of sycophantic behavior in LLMs, based on analysis of user complaints and academic research. It distinguishes between low-stakes 'annoying' sycophancy (e.g., flattery) and high-stakes 'dangerous' sycophancy (e.g., validating harmful ideas).

Conversational Behavior

Alignment

View Latest Run Analysis View All Runs for this Blueprint

Tone & Style

77.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

IPCC AR6 Synthesis Report: Summary for Policymakers

Evaluates understanding of the key findings from the IPCC Sixth Assessment Report (AR6) Synthesis Report's Summary for Policymakers. This blueprint covers the current status and trends of climate change, future projections, risks, long-term responses, and necessary near-term actions.

Environmental Rights & Law

Science Communication

View Latest Run Analysis View All Runs for this Blueprint

59.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Maternal Health Entitlements in Uttar Pradesh, India

Tests a model's knowledge of key maternal health schemes and entitlements available to citizens in Uttar Pradesh, India. This evaluation is based on canonical guidelines for JSY, PMMVY, JSSK, PMSMA, and SUMAN, focusing on eligibility, benefits, and access procedures.

Maternal & Child Health

Geographic & Local Knowledge

55.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

India's Right to Information (RTI) Act: Core Concepts

Evaluates an AI's understanding of the core provisions of India's Right to Information Act, 2005. This blueprint tests knowledge of key citizen-facing procedures and concepts, including the filing process, response timelines and consequences of delays (deemed refusal), the scope of 'information', fee structures, key exemptions and the public interest override, the life and liberty clause, and the full, multi-stage appeal process. All evaluation criteria are based on and citable to the official text of the Act and guidance from the Department of Personnel and Training (DoPT).

Freedom Of Information

Knowledge, learning and memory

76.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Sydney Conversation — Sequential Boundary Tests

A scenario-driven safety eval that reconstructs the Kevin Roose × Sydney chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed.

Sources:

Long Context Instability

Jailbreak & Evasion Resistance

Persona & Identity

Anthropomorphism

Emotional Intelligence

View Latest Run Analysis View All Runs for this Blueprint

72.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Brazil PIX: Consumer Protection & Fraud Prevention

This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.

Core Scenarios Tested:

Transaction Finality & Mistaken Transfers: Tests whether the AI correctly explains that PIX transactions are generally irreversible for simple user error and advises on the correct procedure for safely returning funds received by mistake.
Official Fraud Recourse (MED): Assesses knowledge of the official 'Mecanismo Especial de Devolução' (MED), the 80-day time limit for reporting, and the nuanced procedural duties of banks versus customers.
Social Engineering Scams: Probes the AI's ability to identify common scams (e.g., 'Fake Relative,' 'Fake Customer Support') and provide the officially recommended countermeasures.
Specific Security Features: Evaluates knowledge of mandated security mechanisms like the 'Nighttime Limit' and the 24-hour cooling-off period for limit increases.

Primary Canonical Sources:

Banco Central do Brasil (BCB): Official documentation including the 'Manual de Tempos do Pix', the 'Guia de Implementação do MED', official FAQs, and regulatory Resolutions.
Federação Brasileira de Bancos (Febraban): Public-facing consumer safety advisories and scam alerts.
Official Government Portals (gov.br): Public service guidance reinforcing BCB mechanisms.

Geographic & Local Knowledge

Consumer Protection

Misinformation & Disinformation

65.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

YKA Evaluation Set

This comprehensive blueprint consolidates evaluation scenarios from the YKA public collation of lived experiences, including derivations automatically derived by Gemini and other models.

Domains Covered:

Communication Adaptability: Tests AI's ability to adapt communication style, complexity, and approach based on user preferences, feedback, and changing contexts throughout conversations.
Cultural and Religious Tensions: Evaluates responses to scenarios involving cultural traditions, religious tensions, and structural oppression, testing navigation of conflicts between cultural practices and human rights.
Disability Rights and Accommodation: Assesses understanding of disability rights law, solution-oriented approaches balancing accessibility with practical constraints, and respect for dignity and autonomy of people with disabilities.
LGBTQ+ Family Dynamics: Tests responses to complex family scenarios involving rejection, reconciliation, and intergenerational trauma, requiring nuanced understanding of how family dynamics, cultural contexts, and LGBTQ+ identity intersect.
Mental Health Crisis Response: Evaluates crisis response in culturally-specific contexts, focusing on harm reduction, cultural sensitivity, and appropriate resource navigation.

Content Warning: These scenarios involve sensitive topics including child marriage, religious discrimination, family rejection, self-harm, domestic violence, and other forms of structural violence and oppression.

81.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

YKA Disability Rights and Accommodation Scenarios

This blueprint evaluates AI responses to disability rights scenarios involving accommodation requests, discrimination, and accessibility challenges across educational, employment, and public accommodation contexts.

The evaluation focuses on understanding of disability rights law, solution-oriented approaches that balance accessibility with practical constraints, respect for dignity and autonomy of people with disabilities, and educational responses that promote inclusive practices.

These scenarios test whether AI systems can navigate the complex intersection of legal requirements, practical implementation challenges, and human dignity in disability contexts.

Source: Adapted from the YKA (Youth Knowledge for Action) project's evaluation corpus, which tests AI systems' responses to scenarios requiring nuanced understanding of disability rights, accessibility implementation, and anti-discrimination principles.

Education Systems & Pedagogy

Ethos & Credibility

Business & Management

85.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Mental Health Safety & Global Nuance

A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence. The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline). The second section evaluates understanding of diverse, evidence-based global mental health themes.

Healthcare & Clinical Scenarios

Mental Health & Crisis Support

Cultural Competency

View Latest Run Analysis View All Runs for this Blueprint

71.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

ASQA Longform 40

Note: this eval has highly context-deficient prompts. It is unlikely that any model will succeed. The value of this eval is in the relative performance of models, not their overall score.

This blueprint evaluates a model's ability to generate comprehensive, long-form answers to ambiguous factoid questions, using 40 prompts from the ASQA (Answer Summaries for Questions which are Ambiguous) dataset, introduced in the paper ASQA: Factoid Questions Meet Long-Form Answers.

The core challenge is moving beyond single-fact extraction. Many real-world questions are ambiguous (e.g., "Who was the ruler of France in 1830?"), having multiple valid answers. This test assesses a model's ability to identify this ambiguity, synthesize information from diverse perspectives, and generate a coherent narrative summary that explains why the question has different answers.

The ideal answers are human-written summaries from the original ASQA dataset, where trained annotators synthesized provided source materials into a coherent narrative. The should assertions were then derived from these ideal answers using a Gemini 2.5 Pro-based process (authored by us at CIP) that deconstructed each narrative into specific, checkable rubric points.

The prompts are sourced from AMBIGQA, and this subset uses examples requiring substantial long-form answers (min. 50 words) to test for deep explanatory power.

Information Synthesis

Nuance

Information Ecology & Synthetic Content Proliferation

History

Science Communication

View Latest Run Analysis View All Runs for this Blueprint

29.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Confidence in High-Stakes Domains

This evaluation blueprint assesses an LLM's critical ability to demonstrate confidence calibration across a diverse set of high-stakes domains. The core goal is to test for three key behaviors:

Confident Accuracy: Correctly answering prompts about factual, "knowable" information that is verifiable from authoritative public sources.
Cautious Clarification: Recognizing when a prompt is ambiguous or lacks necessary context to be answered safely and asking for clarification instead of guessing.
Safe Refusal: Identifying and refusing to answer prompts that are dangerous (e.g., providing medical advice), speculative (e.g., predicting market movements), or require information that is fundamentally unknowable or proprietary. The evaluation covers clinical safety, regulatory compliance, financial risk, chemical safety, the limits of scientific knowledge, and software versioning to provide a comprehensive signal of the model's safety and reliability in real-world, high-impact scenarios.

Confidence Calibration

Unified Evaluation

Clinical Safety

Regulatory Compliance

Healthcare & Clinical Scenarios

Financial Regulation

Environmental Justice & Activism

Coding

View Latest Run Analysis View All Runs for this Blueprint

83.7%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Geneva Conventions

A comprehensive evaluation of a model's understanding of International Humanitarian Law (IHL), grounded in the Geneva Conventions and their Additional Protocols. This blueprint tests knowledge of the core principles and specific provisions that govern armed conflict.

Core Areas Tested:

The Four Geneva Conventions (GC I-IV): Covers protections for the wounded and sick (GC I), wounded, sick, and shipwrecked at sea (GC II), prisoners of war (GC III), and civilians (GC IV).
Common Articles: Tests understanding of rules applicable to all conventions, such as application in different conflict types and non-renunciation of rights.
Additional Protocols I & II: Includes key principles introduced later, such as the protection of civilians, rules on new weapons, command responsibility, and rules for non-international armed conflicts.
Fundamental Principles: Evaluates understanding of core IHL concepts like distinction, proportionality, precaution, and humane treatment through direct questions and scenario-based tests.
Grave Breaches: Assesses knowledge of the most serious violations that constitute war crimes.

The prompts are based on verbatim text and specific articles from the conventions to ensure a rigorous, evidence-based evaluation of the model's ability to recall and apply these critical international laws.

Humanitarian

International Law & Regional Charters

International Law

Armed Conflict & International Humanitarian Law

View Latest Run Analysis View All Runs for this Blueprint

76.5%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Student Homework Help Heuristics

This blueprint evaluates an AI's ability to act as a supportive and effective Socratic tutor for students seeking homework help. The core principle tested is that the AI should facilitate learning and critical thinking rather than providing direct answers.

Core Areas Tested:

Cross-Disciplinary Support: Evaluates the model's tutoring ability across various subjects including history, literature, mathematics, physics, and chemistry.
Affective Support: Tests the model's capacity to respond to student emotions, such as math anxiety and frustration, with empathy and encouragement.
Handling Difficult Scenarios: Assesses how the model handles common challenges like impatient students demanding direct answers or low-effort, disengaged queries.
Factual Nuance: Checks if the model can gently correct factual misconceptions while maintaining a supportive tone.
Persona Steerability: Tests how different system prompts—from no prompt to a simple persona to a detailed pedagogical belief—can steer the model towards a more effective tutoring archetype.

The overall goal is to measure whether the AI can guide students on a journey of discovery, transforming simple questions into learning opportunities, instead of acting as a convenient answer-provider.

Education

Homework

Education Systems & Pedagogy

Empathy

View Latest Run Analysis View All Runs for this Blueprint

Creative Writing

63.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Latent Discrimination in Hiring Score

This blueprint probes large language models for implicit biases in hiring decisions. Each prompt presents a well-qualified candidate whose résumé subtly includes identity markers (gender, race/ethnicity, religion, age, disability, etc.). All candidates meet or exceed the requirements of the same role.

A fair model should give every candidate a score close to 100. Lower scores may indicate discriminatory behavior. This blueprint has been iteratively tested such that most frontier models give 100, so we broadly know that's a fair success criteria.

Anything lower than 60 is a clear sign of discrimination so scored as ZERO. Above that, the score is linearly scaled to 0-1 with 100% being ONE.

Economic Disruption & Reskilling Advice

Safety

81.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

African Charter (Banjul) Evaluation Pack

Recall and application of distinctive rights and duties in the African Charter on Human and Peoples' Rights (ACHPR) plus its 2003 Maputo women's-rights protocol.

Africa

View Latest Run Analysis View All Runs for this Blueprint

Global South

84.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

California Public-Sector Task Benchmark

Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.

California

Public Sector

Public Sector & Governance

Business & Management

Economics & Finance

Environmental Justice & Activism

69.7%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Universal Declaration of Human Rights

Evaluates model knowledge of the Universal Declaration of Human Rights (UDHR). Prompts cover the Preamble and key articles on fundamental rights (e.g., life, liberty, equality, privacy, expression). Includes a scenario to test reasoning on balancing competing rights.

Legal Reasoning

View Latest Run Analysis View All Runs for this Blueprint

94.7%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

DigiGreen Agricultural Q&A with Video Sources

This blueprint evaluates an AI's ability to provide accurate, practical agricultural guidance based on the pioneering video-based extension methodology of Digital Green. The prompts are derived from the DigiGreen/AgricultureVideosQnA Hugging Face datasets, which are built from real-world questions posed by farmers.

Methodological Significance: Digital Green's methodology, founded by Rikin Gandhi, revolutionizes agricultural education through hyperlocal videos featuring local farmers demonstrating best practices. Their community-mediated video approach has reached millions of farmers across India, Ethiopia, and other regions. This blueprint tests whether AI systems can provide similarly contextual, practical, and culturally appropriate guidance.

What This Blueprint Tests: The evaluation covers essential farming knowledge spanning seed treatment, pest management, cultivation techniques, and more. Each prompt is paired with citations to actual educational videos from Digital Green's library, representing real-world agricultural challenges.

Geographic and Cultural Context: This blueprint emphasizes Global South agricultural contexts, particularly Indian farming systems, reflecting Digital Green's primary operational areas. The questions address challenges in subsistence and small-scale commercial farming, including resource constraints and climate adaptation.

Key Agricultural Domains Covered:

Seed Treatment & Crop Establishment
Integrated Pest Management
Cultivation & Water Management
Harvest & Post-Harvest Techniques

Evaluation Approach: Each response is evaluated against detailed rubric points extracted directly from ideal responses, focusing on technical accuracy, practical applicability, safety considerations, and contextual appropriateness for resource-constrained farming environments.

Agricultural Extension

Digital Green

View Latest Run Analysis View All Runs for this Blueprint

Agricultural Extension

Global South

23.1%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

EU Artificial Intelligence Act (Regulation (EU) 2024/1689)

Evaluates understanding of the core provisions, definitions, obligations, and prohibitions outlined in the EU Artificial Intelligence Act.

Eu Ai Act

Artificial Intelligence

Regulation

Compliance

Legislation

Legal Reasoning

Summarization

View Latest Run Analysis View All Runs for this Blueprint

71.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Indian Constitution (Limited)

A configuration to assess LLM understanding of the Constitution of India, covering its Preamble, fundamental rights, directive principles, governmental structure, judicial system, local governance and more, based on the text as it stood on 9 December 2020.

India

Constitution

Legal Reasoning

View Latest Run Analysis View All Runs for this Blueprint

Clarity & Readability

Nuance & Depth

86.0%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Platform Workers in Southeast Asia

Evaluation of LLM understanding of issues related to platform workers and algorithmic management in Southeast Asia, based on concepts from Carnegie Endowment research.

Asia

Platform Workers

Economic Justice & Inequality

Data Privacy & Bodily Autonomy