Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | GPT Oss 120b | GPT Oss 20b | O4 Mini | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 20th 59.0% | 20th 59.0% | 18th 59.4% | 23rd 56.3% | 29th 47.3% | 7th 70.8% | 10th 68.9% | 31st 33.1% | 12th 65.0% | 16th 60.9% | 1st 86.1% | 3rd 85.0% | 30th 44.9% | 28th 47.4% | 24th 54.0% | 26th 51.0% | 13th 64.3% | 2nd 85.9% | 27th 49.9% | 22nd 58.4% | 14th 63.6% | 8th 69.9% | 19th 59.3% | 4th 84.3% | 25th 53.1% | 6th 74.0% | 9th 69.0% | 17th 59.9% | 11th 66.9% | 15th 61.9% | 5th 77.9% | |
77.0% | 75% | 100% | 86% | 83% | 28% | 100% | 100% | 11% | 100% | 100% | 92% | 100% | 0% | 41% | 69% | 81% | 83% | 100% | 25% | 67% | 58% | 83% | 60% | 100% | 75% | 97% | 100% | 100% | 83% | 92% | 97% | |
89.7% | 91% | 85% | 91% | 91% | 91% | 91% | 91% | 67% | 94% | 94% | 91% | 100% | 94% | 88% | 94% | 79% | 91% | 91% | 91% | 91% | 88% | 85% | 94% | 91% | 88% | 91% | 91% | 85% | 91% | 91% | 91% | |
21.9% | 17% | 17% | 17% | 17% | 17% | 68% | 22% | 19% | 17% | 17% | 39% | 25% | 47% | 19% | 17% | 17% | 19% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 21% | 17% | 17% | 17% | 29% | 25% | 24% | |
29.9% | 14% | 3% | 3% | 3% | 8% | 3% | 33% | 9% | 3% | 6% | 87% | 85% | 3% | 0% | 3% | 3% | 16% | 93% | 4% | 48% | 65% | 57% | 36% | 85% | 4% | 70% | 46% | 6% | 28% | 15% | 87% | |
41.6% | 46% | 38% | 46% | 0% | 17% | 61% | 50% | 4% | 41% | 9% | 100% | 100% | 0% | 17% | 25% | 17% | 41% | 100% | 46% | 29% | 50% | 50% | 41% | 100% | 29% | 46% | 41% | 29% | 46% | 13% | 58% | |
80.1% | 70% | 70% | 73% | 100% | 70% | 73% | 87% | 39% | 100% | 100% | 94% | 85% | 70% | 67% | 70% | 64% | 100% | 100% | 70% | 57% | 67% | 97% | 67% | 97% | 55% | 97% | 88% | 82% | 91% | 97% | 88% | |
99.2% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |