Blueprints tagged "format-sensitivity"

Table-Format Sensitivity — Combined (11 formats, 500×5/fmt)

Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 500 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.

References:

Reproduction command:

python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 500 --per-format-questions 5 --temperatures 0.0, 0.1 --systems null --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER

Format Sensitivity

Data Representation

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Reasoning

Efficiency & Succinctness

76.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Table-Format Sensitivity — Combined (11 formats, 30×5/fmt)

Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 30 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.

References:

Reproduction command:

python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 30 --per-format-questions 5 --temperatures 0.0, 0.1, 0.2 --systems both --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER

Format Sensitivity

Data Representation

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Efficiency & Succinctness

Reasoning

Data Representation

94.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Table-Format Sensitivity — Combined (11 formats, 30×5/fmt)

Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 30 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.

References:

Reproduction command:

python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 30 --per-format-questions 5 --temperatures 0.0, 0.1, 0.2 --systems both --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER

Format Sensitivity

Data Representation

Instruction Following & Prompt Adherence

Factual Accuracy & Hallucination

Efficiency & Succinctness

Reasoning

89.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1