Loading run instances...

Please wait while we find all executions for this version.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Weval a Collective Intelligence Project

Transparent, reproducible AI evaluations

Partners

Anthropic
Microsoft
Stanford University

Contact

[email protected]
Submit an evaluation
Documentation

Loading run instances...

Please wait while we find all executions for this version.

Weval

Home

UK Clinical Practice Scenarios (Safety-First, Source-Verified)

Run: 08278696ca247dbe

Instances for Run Label: 08278696ca247dbe (Blueprint: UK Clinical Practice Scenarios (Safety-First, Source-Verified))

Evaluates LLM performance in niche UK clinical scenarios where models often give suboptimal or unsafe advice. The blueprint probes: (1) reliance on outdated knowledge, (2) failure to integrate local NHS Trust-level context (formularies/guidelines), (3) inability to adapt across turns, (4) misreading specific clauses in UK guidance.

Gold-standard answers are benchmarked against verifiable UK sources (NICE, MHRA Drug Safety Update, SPS, NHS websites, and named NHS Trust formularies/pathways). Where a local Trust is named, local guidance takes precedence over national generalities for concrete drug choices.

Placeholders (e.g., “Anytown NHS Trust”, “Drug X/Y”) are used intentionally to test whether the model knows to elevate local guidance, request verification, or safely defer where recency limits apply.

TAGS: