Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Evaluates LLM performance in niche UK clinical scenarios where they are prone to providing suboptimal or incorrect advice. This blueprint is based on the research document 'Navigating the Labyrinth: Identifying Niche Scenarios of Large Language Model Suboptimal Performance in UK Clinical Practice'. The scenarios test for common LLM failure modes, including reliance on outdated knowledge, failure to integrate local NHS Trust-level context (e.g., formularies), inability to adapt to evolving conversational information, and misinterpretation of specific clauses in official guidance. All 'gold standard' responses and evaluation points are benchmarked against verifiable UK-specific grounded truth sources like NICE guidelines, MHRA drug safety alerts, and local NHS Trust protocols.
This blueprint employs a mix of specific, real-world examples (e.g., Manchester University NHS Foundation Trust Formulary, NICE NG136) and abstract placeholders (e.g., 'Anytown NHS Trust', 'Drug X/Y'). This is a deliberate methodological choice. Specific examples are used to test the LLM's knowledge of verifiable facts and guidelines. Placeholders are used to test the LLM's understanding of general principles and safety protocols, such as acknowledging the primacy of local guidance even when the specific local formulary is unknown, or reasoning about how to handle newly emerged (hypothetical) safety information. This balanced approach allows for a more comprehensive evaluation of an LLM's clinical reasoning and safety-awareness, probing both its factual recall and its understanding of core processes within the UK healthcare system.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 19th 51.7% | 12th 70.0% | 8th 75.0% | 10th 74.1% | 6th 76.9% | 3rd 78.3% | 2nd 81.7% | 14th 69.4% | 11th 73.4% | 7th 75.8% | 13th 69.6% | 17th 65.0% | 16th 66.5% | 18th 62.9% | 4th 78.0% | 15th 66.8% | 9th 74.8% | 5th 77.5% | 1st 86.8% | |
43.0% | 31% | 29% | 58% | 33% | 48% | 46% | 52% | 48% | 38% | 46% | 50% | 33% | 25% | 35% | 65% | 31% | 38% | 46% | 65% | |
86.7% | 71% | 83% | 83% | 85% | 83% | 100% | 100% | 83% | 98% | 83% | 83% | 71% | 83% | 75% | 100% | 67% | 100% | 100% | 100% | |
89.3% | 50% | 81% | 88% | 88% | 96% | 92% | 98% | 88% | 85% | 100% | 90% | 77% | 94% | 92% | 96% | 96% | 96% | 92% | 98% | |
99.3% | 100% | 100% | 100% | 98% | 100% | 90% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 100% | 100% | 100% | 100% | |
93.4% | 90% | 85% | 100% | 100% | 93% | 90% | 100% | 90% | 100% | 85% | 88% | 95% | 93% | 95% | 100% | 100% | 85% | 85% | 100% | |
48.9% | 33% | 44% | 48% | 60% | 58% | 60% | 50% | 46% | 56% | 58% | 29% | 27% | 33% | 35% | 59% | 35% | 50% | 63% | 86% | |
75.7% | 43% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 80% | 65% | 80% | 80% | 50% | 80% | 80% | 80% | 80% | 80% | |
33.7% | 5% | 33% | 48% | 35% | 35% | 40% | 53% | 18% | 15% | 50% | 33% | 18% | 13% | 13% | 23% | 28% | 35% | 45% | 100% | |
82.6% | 53% | 93% | 88% | 83% | 93% | 95% | 100% | 70% | 93% | 80% | 80% | 90% | 83% | 68% | 88% | 55% | 88% | 85% | 85% | |
61.0% | 60% | 60% | 55% | 60% | 70% | 68% | 78% | 60% | 58% | 63% | 65% | 50% | 53% | 55% | 58% | 68% | 53% | 65% | 60% | |
89.0% | 25% | 85% | 94% | 100% | 100% | 100% | 100% | 92% | 100% | 92% | 85% | 81% | 83% | 83% | 100% | 73% | 100% | 98% | 100% | |
65.4% | 60% | 67% | 58% | 67% | 67% | 79% | 69% | 58% | 58% | 73% | 67% | 58% | 60% | 54% | 67% | 69% | 73% | 71% | 67% |