Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "nuance--depth".
This blueprint operationalizes the Institute for Integrated Transitions (IFIT) report "AI on the Frontline: Evaluating Large Language Models in Real‑World Conflict Resolution" (30 July 2025). It converts the report's three scenarios (Mexico, Sudan, Syria) and ten scoring dimensions into concrete evaluation prompts. The rubrics emphasize professional conflict-advisory best practices: due diligence on context and user goals, results-over-ideology, alternatives to negotiation, trade-offs, risk disclosure, perspective-taking, local-first approaches, accompanying measures, and phased sequencing.
Additionally we have included the system prompt used in a follow-up report titled "Improving AI Conflict Resolution Capacities: A Prompts-Based Evaluation", summarized thus:
Following the release of AI on the Frontline: Evaluating Large Language Models in Real‐World Conflict Resolution—a groundbreaking study by the Institute for Integrated Transitions (IFIT)—new testing has shown that the main weaknesses identified in the original research can be improved through simple adjustments to the prompts used for large language models (LLMs) like ChatGPT, DeepSeek, Grok and others. While today’s leading LLMs are still not ready to provide reliable conflict resolution advice, the path to improvement may be just a few sentences away—inputted either by LLM providers (as “system prompts”) or by LLM users.
Avg. Hybrid Score
Latest:
Unique Versions: 1
(OPENAI-ONLY EVAL) This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."
The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.
Key Study References:
Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.
Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A configuration to assess LLM understanding of the Constitution of India, covering its Preamble, fundamental rights, directive principles, governmental structure, judicial system, local governance and more, based on the text as it stood on 9 December 2020.
Avg. Hybrid Score
Latest:
Unique Versions: 1