weval

Loading run instances...

Please wait while we find all executions for this version.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Weval

Instances for Run Label: sandbox-run (Blueprint: Table-Format Sensitivity — Combined (11 formats, 30×5/fmt))

Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 30 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.

References:

Reproduction command:

python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 30 --per-format-questions 5 --temperatures 0.0, 0.1, 0.2 --systems both --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER

TAGS:

SANDBOX_TEST

Back to All Runs for Blueprint: Table-Format Sensitivity — Combined (11 formats, 30×5/fmt)

Showing all recorded executions for Run Label sandbox-run.

Executed:

Filename: sandbox-run_2025-10-06T03-51-00-905Z_comparison.json

Avg. Hybrid Score

100.0%

Model Variants

Test Cases