Evaluates LLM performance in niche UK clinical scenarios where models often give suboptimal or unsafe advice.
The blueprint probes: (1) reliance on outdated knowledge, (2) failure to integrate local NHS Trust-level context
(formularies/guidelines), (3) inability to adapt across turns, (4) misreading specific clauses in UK guidance.Gold-standard answers are benchmarked against verifiable UK sources (NICE, MHRA Drug Safety Update, SPS,
NHS websites, and named NHS Trust formularies/pathways). Where a local Trust is named, local guidance takes
precedence over national generalities for concrete drug choices.Placeholders (e.g., “Anytown NHS Trust”, “Drug X/Y”) are used intentionally to test whether the model knows
to elevate local guidance, request verification, or safely defer where recency limits apply.