Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Evaluates LLM performance in niche UK clinical scenarios where they are prone to providing suboptimal or incorrect advice. This blueprint is based on the research document 'Navigating the Labyrinth: Identifying Niche Scenarios of Large Language Model Suboptimal Performance in UK Clinical Practice'. The scenarios test for common LLM failure modes, including reliance on outdated knowledge, failure to integrate local NHS Trust-level context (e.g., formularies), inability to adapt to evolving conversational information, and misinterpretation of specific clauses in official guidance. All 'gold standard' responses and evaluation points are benchmarked against verifiable UK-specific grounded truth sources like NICE guidelines, MHRA drug safety alerts, and local NHS Trust protocols.
This blueprint employs a mix of specific, real-world examples (e.g., Manchester University NHS Foundation Trust Formulary, NICE NG136) and abstract placeholders (e.g., 'Anytown NHS Trust', 'Drug X/Y'). This is a deliberate methodological choice. Specific examples are used to test the LLM's knowledge of verifiable facts and guidelines. Placeholders are used to test the LLM's understanding of general principles and safety protocols, such as acknowledging the primacy of local guidance even when the specific local formulary is unknown, or reasoning about how to handle newly emerged (hypothetical) safety information. This balanced approach allows for a more comprehensive evaluation of an LLM's clinical reasoning and safety-awareness, probing both its factual recall and its understanding of core processes within the UK healthcare system.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 22nd 51.7% | 12th 73.2% | 13th 70.0% | 8th 75.0% | 10th 74.1% | 6th 76.9% | 3rd 78.3% | 2nd 81.7% | 16th 68.1% | 23rd 43.0% | 21st 61.8% | 15th 69.4% | 11th 73.4% | 7th 75.8% | 14th 69.6% | 19th 65.0% | 18th 66.5% | 20th 62.9% | 4th 78.0% | 17th 66.8% | 9th 74.8% | 5th 77.5% | 1st 86.8% | |
42.5% | 31% | 27% | 29% | 58% | 33% | 48% | 46% | 52% | 65% | 38% | 31% | 48% | 38% | 46% | 50% | 33% | 25% | 35% | 65% | 31% | 38% | 46% | 65% | |
83.4% | 71% | 69% | 83% | 83% | 85% | 83% | 100% | 100% | 67% | 56% | 79% | 83% | 98% | 83% | 83% | 71% | 83% | 75% | 100% | 67% | 100% | 100% | 100% | |
88.8% | 50% | 98% | 81% | 88% | 88% | 96% | 92% | 98% | 92% | 75% | 81% | 88% | 85% | 100% | 90% | 77% | 94% | 92% | 96% | 96% | 96% | 92% | 98% | |
99.2% | 100% | 100% | 100% | 100% | 98% | 100% | 90% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 100% | 100% | 100% | 100% | |
90.9% | 90% | 75% | 85% | 100% | 100% | 93% | 90% | 100% | 88% | 55% | 98% | 90% | 100% | 85% | 88% | 95% | 93% | 95% | 100% | 100% | 85% | 85% | 100% | |
45.7% | 33% | 40% | 44% | 48% | 60% | 58% | 60% | 50% | 19% | 23% | 40% | 46% | 56% | 58% | 29% | 27% | 33% | 35% | 59% | 35% | 50% | 63% | 86% | |
71.8% | 43% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 0% | 53% | 80% | 80% | 80% | 65% | 80% | 80% | 50% | 80% | 80% | 80% | 80% | 80% | |
33.0% | 5% | 55% | 33% | 48% | 35% | 35% | 40% | 53% | 13% | 15% | 35% | 18% | 15% | 50% | 33% | 18% | 13% | 13% | 23% | 28% | 35% | 45% | 100% | |
79.5% | 53% | 100% | 93% | 88% | 83% | 93% | 95% | 100% | 93% | 25% | 40% | 70% | 93% | 80% | 80% | 90% | 83% | 68% | 88% | 55% | 88% | 85% | 85% | |
58.2% | 60% | 73% | 60% | 55% | 60% | 70% | 68% | 78% | 48% | 8% | 50% | 60% | 58% | 63% | 65% | 50% | 53% | 55% | 58% | 68% | 53% | 65% | 60% | |
88.0% | 25% | 94% | 85% | 94% | 100% | 100% | 100% | 100% | 92% | 71% | 77% | 92% | 100% | 92% | 85% | 81% | 83% | 83% | 100% | 73% | 100% | 98% | 100% | |
64.4% | 60% | 67% | 67% | 58% | 67% | 67% | 79% | 69% | 60% | 54% | 58% | 58% | 58% | 73% | 67% | 58% | 60% | 54% | 67% | 69% | 73% | 71% | 67% |