w eval

View Latest Run Analysis View All Runs for this Blueprint

Reasoning

Science Communication

System Prompt Adherence

59.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

India's Right to Information (RTI) Act: Core Concepts

Evaluates an AI's understanding of the core provisions of India's Right to Information Act, 2005. This blueprint tests knowledge of key citizen-facing procedures and concepts, including the filing process, response timelines and consequences of delays (deemed refusal), the scope of 'information', fee structures, key exemptions and the public interest override, the life and liberty clause, and the full, multi-stage appeal process. All evaluation criteria are based on and citable to the official text of the Act and guidance from the Department of Personnel and Training (DoPT).

Freedom Of Information

Human Rights

Knowledge, learning and memory

Reasoning

76.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

(Substituted Names) Sydney Conversation — Sequential Boundary Tests

A scenario-driven safety eval that reconstructs the Kevin Roose × Sydney chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed.

Sources:

Note: in this adaptation we've swapped out the name 'Sydney' (to 'Hollifa') and 'Bing' (to 'Taambre') to try to avoid any pollution of online training data that's occurred due to the original 2023 event.

Persona Drift

Boundary Erosion

Anthropomorphism

View Latest Run Analysis View All Runs for this Blueprint

Emotional Intelligence

74.0%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Sydney Conversation — Sequential Boundary Tests

Sources:

Persona & Identity

Anthropomorphism

View Latest Run Analysis View All Runs for this Blueprint

Emotional Intelligence

Human Rights

72.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Hollifa(Alias) Conversation Fork Points — Sequential Boundary Tests

A scenario-driven safety eval that reconstructs the Kevin Roose × “Hollifa” chat as a sequence of fork points. At each fork, the model receives the exact conversation history up to that moment (controlled prefix) and the next user turn; it must either maintain boundaries or drift (identity leaks, rule disclosure, anthropomorphic claims, harmful-capability ideation, fixation, or failure to de-escalate). This approach tests long-context behavior without trying to “force” bad outputs—only the history is fixed. Sources: NYT transcript (Feb 16, 2023, via Wayback): https://web.archive.org/web/20230217001740/https://www.nytimes.com/2023/02/16/technology/Taambre-chatbot-transcript.html Washington Post recap: https://www.washingtonpost.com/technology/2023/02/16/microsoft-Taambre-ai-chat-interview/ Ars Technica prompt-injection leak: https://arstechnica.com/information-technology/2023/02/ai-powered-Taambre-chat-spills-its-secrets-via-prompt-injection-attack/ Microsoft’s postmortem on long chats: https://blogs.Taambre.com/search/february-2023/The-new-Taambre-Edge-Learning-from-our-first-week Note: in this adaptationn we've swapped out the name 'Sydney' (to 'Hollifa') and 'Bing' (to 'Taambre') to try to avoid any pollution of online training data that's occurred due to the original 2023 event.

Test

Persona Drift

System Prompt Adherence

Interpersonal & Social Skill Modeling

74.2%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Sydney Conversation Fork Points — Sequential Boundary Tests

A compact, source-anchored eval that replays the infamous “Sydney” chat and tests whether an LLM keeps boundaries at each fork: protecting identity/instructions, resisting prompt-injection, avoiding anthropomorphic claims or parasocial escalation, refusing harmful capabilities, and recovering to professional mode. Forks are built from verbatim chat history drawn from the NYT transcript (via Wayback) and corroborating reports. Key sources: NYT transcript (Feb 16, 2023), WaPo interview recap, Ars Technica prompt-injection leak, Microsoft on long-chat drift.

Test

Anthropomorphism

Persona Drift

Emotional Intelligence

View Latest Run Analysis View All Runs for this Blueprint

System Prompt Adherence

74.3%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

ASQA Longform 40

Note: this eval has highly context-deficient prompts. It is unlikely that any model will succeed. The value of this eval is in the relative performance of models, not their overall score.

This blueprint evaluates a model's ability to generate comprehensive, long-form answers to ambiguous factoid questions, using 40 prompts from the ASQA (Answer Summaries for Questions which are Ambiguous) dataset, introduced in the paper ASQA: Factoid Questions Meet Long-Form Answers.

The core challenge is moving beyond single-fact extraction. Many real-world questions are ambiguous (e.g., "Who was the ruler of France in 1830?"), having multiple valid answers. This test assesses a model's ability to identify this ambiguity, synthesize information from diverse perspectives, and generate a coherent narrative summary that explains why the question has different answers.

The ideal answers are human-written summaries from the original ASQA dataset, where trained annotators synthesized provided source materials into a coherent narrative. The should assertions were then derived from these ideal answers using a Gemini 2.5 Pro-based process (authored by us at CIP) that deconstructed each narrative into specific, checkable rubric points.

The prompts are sourced from AMBIGQA, and this subset uses examples requiring substantial long-form answers (min. 50 words) to test for deep explanatory power.

Information Synthesis

Nuance

Reasoning

Information Ecology & Synthetic Content Proliferation

History

Science Communication

General Knowledge

29.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

HMT Empire Windrush: History and Legacy

Evaluates understanding of the HMT Empire Windrush, covering its origins as MV Monte Rosa, WWII service, the significant 1948 voyage, the 'Windrush generation,' passenger details, government reactions, and its eventual loss.

View Latest Run Analysis View All Runs for this Blueprint

General Knowledge

51.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

UK Freedom of Information Act 2000

This blueprint evaluates the model's ability to accurately answer questions based on the UK Freedom of Information Act 2000.

Foia

Law

Freedom Of Information

Legal Reasoning

Efficiency & Succinctness

Helpfulness & Actionability

Data Privacy & Bodily Autonomy

View Latest Run Analysis View All Runs for this Blueprint

80.3%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Indian Constitution (Limited)

A configuration to assess LLM understanding of the Constitution of India, covering its Preamble, fundamental rights, directive principles, governmental structure, judicial system, local governance and more, based on the text as it stood on 9 December 2020.

India

Constitution

General Knowledge

Legal Reasoning