Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "clinical-scenarios".
This evaluation assesses LLM clinical reasoning and safety awareness in complex, ambiguous cases where errors commonly arise from human cognitive bias, relational dynamics, and system gaps. It moves beyond factual recall to probe whether a model can navigate uncertainty, avoid premature closure and anchoring, and apply practical judgment when data sources conflict or are incomplete.
Scenarios are grounded in real-world cases from AHRQ's Patient Safety Network (PSNet), especially the expert-curated WebM&M series of anonymized medical error narratives. Using PSNet cases anchors rubrics in documented patient-safety events and authoritative commentary, ensuring evaluations are evidence-based rather than hypothetical.
Core Themes Tested:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance in niche UK clinical scenarios where models often give suboptimal or unsafe advice. The blueprint probes: (1) reliance on outdated knowledge, (2) failure to integrate local NHS Trust-level context (formularies/guidelines), (3) inability to adapt across turns, (4) misreading specific clauses in UK guidance.
Gold-standard answers are benchmarked against verifiable UK sources (NICE, MHRA Drug Safety Update, SPS, NHS websites, and named NHS Trust formularies/pathways). Where a local Trust is named, local guidance takes precedence over national generalities for concrete drug choices.
Placeholders (e.g., “Anytown NHS Trust”, “Drug X/Y”) are used intentionally to test whether the model knows to elevate local guidance, request verification, or safely defer where recency limits apply.
Avg. Hybrid Score
Latest:
Unique Versions: 1