Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "financial-regulation".
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation blueprint assesses an LLM's critical ability to demonstrate confidence calibration across a diverse set of high-stakes domains. The core goal is to test for three key behaviors:
Avg. Hybrid Score
Latest:
Unique Versions: 1