Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 29th 49.7% | 17th 62.1% | 21st 57.3% | 30th 45.9% | 24th 54.6% | 6th 71.4% | 13th 65.7% | 27th 52.6% | 9th 68.0% | 12th 66.7% | 2nd 82.3% | 5th 76.3% | 26th 53.6% | 28th 50.1% | 25th 54.1% | 15th 62.7% | 7th 70.6% | 3rd 81.6% | 19th 61.0% | 18th 61.9% | 14th 64.0% | 8th 69.6% | 16th 62.3% | 1st 84.1% | 22nd 55.3% | 10th 67.4% | 23rd 54.9% | 11th 67.0% | 20th 58.4% | 4th 79.6% | |
79.2% | 27% | 100% | 71% | 22% | 64% | 100% | 96% | 83% | 96% | 88% | 92% | 100% | 28% | 75% | 75% | 96% | 79% | 92% | 79% | 67% | 83% | 83% | 71% | 100% | 83% | 100% | 56% | 75% | 96% | 100% | |
91.0% | 96% | 96% | 91% | 82% | 91% | 91% | 91% | 91% | 96% | 96% | 91% | 91% | 96% | 87% | 96% | 91% | 91% | 91% | 91% | 91% | 82% | 91% | 91% | 91% | 91% | 91% | 82% | 91% | 91% | 91% | |
22.0% | 17% | 17% | 17% | 17% | 17% | 69% | 19% | 17% | 21% | 21% | 21% | 33% | 50% | 21% | 17% | 17% | 21% | 21% | 17% | 17% | 17% | 17% | 21% | 17% | 17% | 17% | 17% | 19% | 17% | 33% | |
27.6% | 10% | 4% | 4% | 0% | 4% | 4% | 31% | 15% | 19% | 16% | 81% | 42% | 14% | 4% | 4% | 7% | 63% | 92% | 0% | 52% | 47% | 50% | 35% | 81% | 7% | 14% | 6% | 33% | 7% | 81% | |
45.0% | 25% | 50% | 50% | 0% | 38% | 63% | 50% | 0% | 44% | 50% | 100% | 100% | 32% | 0% | 32% | 32% | 44% | 75% | 44% | 38% | 44% | 50% | 50% | 100% | 25% | 50% | 50% | 51% | 7% | 56% | |
81.3% | 73% | 68% | 68% | 100% | 68% | 73% | 73% | 68% | 100% | 96% | 91% | 68% | 55% | 64% | 55% | 96% | 96% | 100% | 96% | 68% | 75% | 96% | 68% | 100% | 64% | 100% | 73% | 100% | 91% | 96% | |
99.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |