Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude 3.5 Sonnet | Claude 3.7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Opus 4.1 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek Chat V3.1 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | GPT 5 | GPT OSS 120b | GPT OSS 20b | O4 Mini | GLM 4.5 | Qwen3 30b A3B Instruct 2507 | Qwen3 32b | Grok 3 | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 28th 52.4% | 22nd 60.9% | 24th 58.1% | 30th 50.3% | 34th 48.3% | 12th 69.3% | 7th 78.3% | 14th 68.5% | 23rd 58.7% | 21st 62.4% | 17th 65.7% | 19th 63.6% | 4th 86.3% | 2nd 88.4% | 32nd 49.9% | 36th 44.4% | 30th 50.3% | 29th 51.1% | 26th 56.3% | 3rd 86.4% | 33rd 48.4% | 25th 57.1% | 18th 64.3% | 15th 65.9% | 27th 54.0% | 5th 84.7% | 34th 48.3% | 1st 95.9% | 9th 73.9% | 10th 73.1% | 15th 65.9% | 8th 76.1% | 20th 63.6% | 11th 72.1% | 13th 68.9% | 6th 79.3% | |
75.1% | 31% | 96% | 71% | 22% | 22% | 100% | 96% | 100% | 96% | 100% | 100% | 83% | 96% | 100% | 6% | 33% | 67% | 61% | 27% | 96% | 31% | 67% | 79% | 83% | 49% | 100% | 71% | 100% | 100% | 83% | 100% | 96% | 63% | 96% | 83% | 100% | |
89.3% | 96% | 91% | 96% | 91% | 96% | 91% | 91% | 89% | 82% | 91% | 91% | 87% | 96% | 91% | 87% | 91% | 82% | 77% | 82% | 96% | 91% | 96% | 82% | 82% | 82% | 91% | 77% | 96% | 91% | 91% | 91% | 91% | 96% | 82% | 87% | 96% | |
25.9% | 17% | 17% | 17% | 17% | 17% | 67% | 67% | 21% | 17% | 21% | 17% | 17% | 38% | 50% | 50% | 21% | 17% | 21% | 17% | 21% | 17% | 17% | 17% | 17% | 17% | 21% | 17% | 83% | 21% | 21% | 17% | 25% | 25% | 17% | 29% | 25% | |
36.0% | 11% | 4% | 0% | 22% | 7% | 4% | 21% | 33% | 9% | 0% | 25% | 21% | 83% | 78% | 4% | 4% | 4% | 4% | 18% | 92% | 0% | 42% | 58% | 39% | 39% | 81% | 4% | 92% | 67% | 83% | 3% | 92% | 58% | 82% | 33% | 78% | |
47.3% | 44% | 50% | 50% | 0% | 32% | 50% | 100% | 50% | 19% | 25% | 50% | 50% | 100% | 100% | 38% | 7% | 25% | 13% | 50% | 100% | 32% | 19% | 50% | 50% | 38% | 100% | 25% | 100% | 38% | 38% | 50% | 38% | 13% | 38% | 50% | 69% | |
82.5% | 68% | 68% | 73% | 100% | 64% | 73% | 73% | 87% | 88% | 100% | 77% | 87% | 91% | 100% | 64% | 55% | 63% | 88% | 100% | 100% | 68% | 59% | 64% | 96% | 59% | 100% | 50% | 100% | 100% | 96% | 100% | 91% | 90% | 90% | 100% | 87% | |
99.2% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 94% | 100% | 100% | 100% | 100% | 100% | 94% | 94% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |