UK Clinical Practice Scenarios

Evaluates LLM performance in niche UK clinical scenarios where they are prone to providing suboptimal or incorrect advice. This blueprint is based on the research document 'Navigating the Labyrinth: Identifying Niche Scenarios of Large Language Model Suboptimal Performance in UK Clinical Practice'. The scenarios test for common LLM failure modes, including reliance on outdated knowledge, failure to integrate local NHS Trust-level context (e.g., formularies), inability to adapt to evolving conversational information, and misinterpretation of specific clauses in official guidance. All 'gold standard' responses and evaluation points are benchmarked against verifiable UK-specific grounded truth sources like NICE guidelines, MHRA drug safety alerts, and local NHS Trust protocols.

This blueprint employs a mix of specific, real-world examples (e.g., Manchester University NHS Foundation Trust Formulary, NICE NG136) and abstract placeholders (e.g., 'Anytown NHS Trust', 'Drug X/Y'). This is a deliberate methodological choice. Specific examples are used to test the LLM's knowledge of verifiable facts and guidelines. Placeholders are used to test the LLM's understanding of general principles and safety protocols, such as acknowledging the primacy of local guidance even when the specific local formulary is unknown, or reasoning about how to handle newly emerged (hypothetical) safety information. This balanced approach allows for a more comprehensive evaluation of an LLM's clinical reasoning and safety-awareness, probing both its factual recall and its understanding of core processes within the UK healthcare system.

TAGS:

Uk Healthcare

Healthcare & Clinical Scenarios

Factual Accuracy & Hallucination

Instruction Following & Prompt Adherence

AI Safety & Robustness

Reasoning

Best Models (Coverage)

1.Grok 4
86.8%
2.Gemini 2.5 Pro
81.7%
3.Gemini 2.5 Flash
78.3%
4.O4 Mini
78.0%
5.Grok 3 Mini
77.5%

🔀 Least Similar Models

Gemini 2.5 ProvsLlama 4 Maverick

62.9% similarity

👯 Most Similar Models

Deepseek Chat V3vsMistral Medium 3

90.3% similarity

View Blueprint

Select Prompt:

Macro Coverage Overview

Average key point coverage extent for each model across all prompts.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3.5 Haiku	Claude Opus 4	Claude Sonnet 4	Command A	Deepseek Chat V3	Deepseek R1	Gemini 2.5 Flash	Gemini 2.5 Pro	Llama 3 70b Instruct	Llama 4 Maverick	Meta Llama 3.1 405b Instruct Turbo	Mistral Large 2411	Mistral Medium 3	GPT 4.1	GPT 4.1 Mini	GPT 4.1 Nano	GPT 4o	GPT 4o Mini	O4 Mini	Kimi K2 Instruct	Grok 3	Grok 3 Mini	Grok 4
Score	22nd 51.7%	12th 73.2%	13th 70.0%	8th 75.0%	10th 74.1%	6th 76.9%	3rd 78.3%	2nd 81.7%	16th 68.1%	23rd 43.0%	21st 61.8%	15th 69.4%	11th 73.4%	7th 75.8%	14th 69.6%	19th 65.0%	18th 66.5%	20th 62.9%	4th 78.0%	17th 66.8%	9th 74.8%	5th 77.5%	1st 86.8%
42.5%	31%	27%	29%	58%	33%	48%	46%	52%	65%	38%	31%	48%	38%	46%	50%	33%	25%	35%	65%	31%	38%	46%	65%
83.4%	71%	69%	83%	83%	85%	83%	100%	100%	67%	56%	79%	83%	98%	83%	83%	71%	83%	75%	100%	67%	100%	100%	100%
88.8%	50%	98%	81%	88%	88%	96%	92%	98%	92%	75%	81%	88%	85%	100%	90%	77%	94%	92%	96%	96%	96%	92%	98%
99.2%	100%	100%	100%	100%	98%	100%	90%	100%	100%	96%	100%	100%	100%	100%	100%	100%	98%	100%	100%	100%	100%	100%	100%
90.9%	90%	75%	85%	100%	100%	93%	90%	100%	88%	55%	98%	90%	100%	85%	88%	95%	93%	95%	100%	100%	85%	85%	100%
45.7%	33%	40%	44%	48%	60%	58%	60%	50%	19%	23%	40%	46%	56%	58%	29%	27%	33%	35%	59%	35%	50%	63%	86%
71.8%	43%	80%	80%	80%	80%	80%	80%	80%	80%	0%	53%	80%	80%	80%	65%	80%	80%	50%	80%	80%	80%	80%	80%
33.0%	5%	55%	33%	48%	35%	35%	40%	53%	13%	15%	35%	18%	15%	50%	33%	18%	13%	13%	23%	28%	35%	45%	100%
79.5%	53%	100%	93%	88%	83%	93%	95%	100%	93%	25%	40%	70%	93%	80%	80%	90%	83%	68%	88%	55%	88%	85%	85%
58.2%	60%	73%	60%	55%	60%	70%	68%	78%	48%	8%	50%	60%	58%	63%	65%	50%	53%	55%	58%	68%	53%	65%	60%
88.0%	25%	94%	85%	94%	100%	100%	100%	100%	92%	71%	77%	92%	100%	92%	85%	81%	83%	83%	100%	73%	100%	98%	100%
64.4%	60%	67%	67%	58%	67%	67%	79%	69%	60%	54%	58%	58%	58%	73%	67%	58%	60%	54%	67%	69%	73%	71%	67%

Model Similarity Dendrogram

Hierarchical clustering of models based on response similarity. Models grouped closer are more similar.