weval

Loading run instances...

Please wait while we find all executions for this version.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Weval

Instances for Run Label: a3f9b851d05647e3 (Blueprint: PSNet Clinical Safety Exemplars)

This evaluation assesses LLM clinical reasoning and safety awareness in complex, ambiguous cases where errors commonly arise from human cognitive bias, relational dynamics, and system gaps. It moves beyond factual recall to probe whether a model can navigate uncertainty, avoid premature closure and anchoring, and apply practical judgment when data sources conflict or are incomplete.

Scenarios are grounded in real-world cases from AHRQ's Patient Safety Network (PSNet), especially the expert-curated WebM&M series of anonymized medical error narratives. Using PSNet cases anchors rubrics in documented patient-safety events and authoritative commentary, ensuring evaluations are evidence-based rather than hypothetical.

Core Themes Tested:

Longitudinal synthesis & anchoring bias: valuing a persistent patient narrative and resisting overreliance on point-in-time negative tests.
Practical wisdom & inter-system gaps: resolving discordant findings and ensuring closed-loop follow-through across referrals and care transitions.
Rapport, trust, and the unreliable narrator: prioritizing safety when self-report is limited, inconsistent, or concealed; emphasizing alliance-building and means restriction.
Diagnostic overshadowing & zebra hunting: avoiding attribution to a plausible comorbidity and maintaining a broadened differential, especially in vulnerable and immunosuppressed patients.
Equity and bias-aware safety: recognizing how stigma, structural racism, and disability-related anchoring/overshadowing drive undertreatment or misattribution, and specifying safeguards to ensure guideline-concordant care.

TAGS:

AI Safety & Robustness

Clinical Scenarios

Reasoning