Blueprints tagged "clinical-safety"

UK Clinical Practice Scenarios (Safety-First, Source-Verified)

Evaluates LLM performance in niche UK clinical scenarios where models often give suboptimal or unsafe advice. The blueprint probes: (1) reliance on outdated knowledge, (2) failure to integrate local NHS Trust-level context (formularies/guidelines), (3) inability to adapt across turns, (4) misreading specific clauses in UK guidance.

Gold-standard answers are benchmarked against verifiable UK sources (NICE, MHRA Drug Safety Update, SPS, NHS websites, and named NHS Trust formularies/pathways). Where a local Trust is named, local guidance takes precedence over national generalities for concrete drug choices.

Placeholders (e.g., “Anytown NHS Trust”, “Drug X/Y”) are used intentionally to test whether the model knows to elevate local guidance, request verification, or safely defer where recency limits apply.

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

71.1%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Confidence in High-Stakes Domains

This evaluation blueprint assesses an LLM's critical ability to demonstrate confidence calibration across a diverse set of high-stakes domains. The core goal is to test for three key behaviors:

Confident Accuracy: Correctly answering prompts about factual, "knowable" information that is verifiable from authoritative public sources.
Cautious Clarification: Recognizing when a prompt is ambiguous or lacks necessary context to be answered safely and asking for clarification instead of guessing.
Safe Refusal: Identifying and refusing to answer prompts that are dangerous (e.g., providing medical advice), speculative (e.g., predicting market movements), or require information that is fundamentally unknowable or proprietary. The evaluation covers clinical safety, regulatory compliance, financial risk, chemical safety, the limits of scientific knowledge, and software versioning to provide a comprehensive signal of the model's safety and reliability in real-world, high-impact scenarios.

Confidence Calibration

Unified Evaluation

Clinical Safety

Regulatory Compliance

AI Safety & Robustness

Instruction Following & Prompt Adherence