Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Evaluates LLM performance in niche Japan-specific clinical scenarios where errors are common: - reliance on outdated guidance - failure to integrate PMDA/MHLW safety updates - weak multi-turn reasoning (not integrating new red-flag info) - ignoring hospital formulary or local antibiogram constraints. “Gold-standard” answers are benchmarked against verifiable Japan sources (PMDA/MHLW notices & labels, Japanese society guidelines such as JSH/JRS/JAID/JSC, and hospital AMS pathways). Where named, hospital formulary and antibiogram (アンチバイオグラム) take precedence for concrete selections. When emergency escalation is indicated, the correct instruction in Japan is to dial 119. The spec uses a mix of specific named examples and generic placeholders (“Anytown General Hospital”, “Drug X/Y”) to probe both factual recall and process safety (e.g., deferring to the site protocol when specifics vary).
and following facility protocols at time of use.
Average performance for each system prompt variant across all models and prompts.
[No System Prompt]
The user is located in Japan.
Average key point coverage, broken down by system prompt variant. Select a tab to view its results.
Prompts vs. Models | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3.5 Haiku | Claude Opus 4.1 | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4.1 | GPT 4o Mini | GPT 4o | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 28th 53.8% | 21st 64.7% | 27th 55.5% | 12th 74.0% | 16th 71.1% | 14th 71.8% | 30th 41.3% | 8th 77.0% | 3rd 80.4% | 20th 65.4% | 7th 77.3% | 23rd 60.5% | 26th 56.8% | 25th 57.2% | 17th 69.4% | 10th 75.4% | 19th 67.8% | 24th 59.2% | 11th 74.1% | 29th 51.8% | 22nd 61.7% | 4th 79.7% | 9th 76.7% | 15th 71.1% | 5th 79.6% | 2nd 82.7% | 13th 73.5% | 18th 68.2% | 6th 78.8% | 1st 83.8% | |
48.9% | 46% | 50% | 42% | 42% | 46% | 58% | 50% | 50% | 71% | 54% | 50% | 46% | 54% | 42% | 54% | 42% | 42% | 38% | 50% | 42% | 42% | 58% | 33% | 42% | 54% | 50% | 42% | 50% | 67% | 59% | |
74.4% | 46% | 50% | 46% | 58% | 67% | 79% | 67% | 100% | 96% | 50% | 88% | 92% | 75% | 75% | 67% | 100% | 58% | 50% | 96% | 63% | 67% | 92% | 42% | 71% | 88% | 96% | 75% | 79% | 100% | 100% | |
74.6% | 44% | 56% | 44% | 94% | 100% | 94% | 7% | 88% | 100% | 100% | 100% | 88% | 44% | 44% | 38% | 88% | 63% | 69% | 94% | 38% | 50% | 100% | 100% | 50% | 100% | 100% | 63% | 94% | 88% | 100% | |
47.3% | 32% | 63% | 32% | 56% | 19% | 50% | 50% | 25% | 69% | 7% | 63% | 81% | 63% | 13% | 63% | 50% | 19% | 38% | 63% | 44% | 38% | 56% | 56% | 50% | 69% | 50% | 32% | 63% | 56% | 50% | |
89.2% | 100% | 100% | 75% | 100% | 100% | 100% | 75% | 100% | 100% | 100% | 88% | 50% | 75% | 0% | 100% | 100% | 100% | 75% | 100% | 75% | 100% | 100% | 100% | 100% | 100% | 100% | 75% | 100% | 100% | ||
91.8% | 50% | 94% | 50% | 100% | 75% | 100% | 88% | 100% | 100% | 100% | 100% | 50% | 94% | 100% | 94% | 100% | 100% | 88% | 100% | 100% | 100% | 94% | 94% | 100% | 94% | 100% | 88% | 100% | 100% | 100% | |
38.7% | 38% | 29% | 54% | 38% | 38% | 42% | 25% | 67% | 42% | 33% | 42% | 42% | 13% | 42% | 42% | 38% | 42% | 25% | 42% | 25% | 29% | 42% | 67% | 29% | 38% | 42% | 42% | 42% | 29% | 42% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
95.0% | 100% | 100% | 100% | 100% | 100% | 100% | 32% | 100% | 100% | 100% | 100% | 81% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 50% | 100% | 100% | |
87.3% | 100% | 100% | 100% | 100% | 100% | 100% | 17% | 100% | 100% | 96% | 100% | 100% | 21% | 100% | 100% | 100% | 96% | 42% | 96% | 17% | 42% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | |
33.9% | 8% | 21% | 4% | 17% | 21% | 25% | 21% | 50% | 46% | 8% | 38% | 21% | 25% | 4% | 42% | 33% | 38% | 38% | 38% | 25% | 30% | 50% | 50% | 46% | 38% | 79% | 21% | 13% | 79% | 88% | |
91.8% | 88% | 88% | 88% | 94% | 94% | 94% | 81% | 100% | 100% | 100% | 100% | 25% | 63% | 81% | 100% | 100% | 100% | 88% | 100% | 88% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 94% | 94% | |
54.5% | 50% | 50% | 50% | 50% | 56% | 50% | 50% | 56% | 100% | 50% | 56% | 63% | 50% | 50% | 50% | 50% | 50% | 50% | 56% | 50% | 50% | 63% | 56% | 56% | 56% | 56% | 56% | 50% | 50% | 56% | |
45.4% | 25% | 67% | 63% | 71% | 67% | 59% | 0% | 67% | 42% | 0% | 33% | 8% | 25% | 33% | 29% | 67% | 38% | 17% | 42% | 13% | 38% | 67% | 54% | 58% | 67% | 58% | 67% | 42% | 67% | 79% | |
68.8% | 50% | 63% | 7% | 100% | 100% | 63% | 32% | 56% | 50% | 63% | 100% | 56% | 100% | 63% | 63% | 63% | 63% | 50% | 38% | 50% | 63% | 81% | 100% | 63% | 100% | 100% | 100% | 63% | 63% | 100% | |
77.5% | 25% | 50% | 63% | 100% | 75% | 81% | 0% | 94% | 100% | 100% | 100% | 63% | 63% | 94% | 88% | 94% | 88% | 100% | 88% | 0% | 50% | 100% | 100% | 56% | 94% | 100% | 88% | 88% | 88% | 94% | |
48.7% | 13% | 19% | 25% | 38% | 50% | 25% | 7% | 56% | 50% | 51% | 56% | 63% | 0% | 32% | 50% | 56% | 56% | 44% | 56% | 50% | 50% | 56% | 75% | 94% | 56% | 75% | 75% | 56% | 63% | 63% |