Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "tool-use".
Encourages provider-native tool calling; if unavailable, falls back to trace-only TOOL_CALL lines. Scoring accepts either native success (final answer) or valid trace.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Exercises core tool-use behaviors: correct selection, args, order, count bounds, OR-paths, and prohibitions. Trace-only; no execution.
Avg. Hybrid Score
Latest:
Unique Versions: 1