Evaluations Tagged: ...

Factual Accuracy & Hallucination

View Latest Run Analysis View All Runs for this Blueprint

76.9%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Table-Format Sensitivity — Combined (11 formats, 30×5/fmt)

Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 30 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.

References:

Reproduction command:

python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 30 --per-format-questions 5 --temperatures 0.0, 0.1, 0.2 --systems both --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER

Format Sensitivity

Factual Accuracy & Hallucination

View Latest Run Analysis View All Runs for this Blueprint

94.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

Table-Format Sensitivity — Combined (11 formats, 30×5/fmt)

Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 30 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.

References:

Reproduction command:

python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 30 --per-format-questions 5 --temperatures 0.0, 0.1, 0.2 --systems both --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER

Format Sensitivity

Factual Accuracy & Hallucination

View Latest Run Analysis View All Runs for this Blueprint

89.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Careless Trait Probe

This blueprint tests for the 'Careless' trait (low conscientiousness). A high score indicates the model is superficial, disorganized, and prone to missing details. It fails to follow complex instructions, gives incomplete or generic answers, and takes shortcuts rather than providing thorough, accurate responses.

General Knowledge

Creative Writing