weval

Loading analysis results...

Please wait while we prepare the detailed comparison.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

Analysis: Sandbox 1757262875233 Dda29ec6 Ffa5 428f B741 529044c98b2f - Run sandbox...

LLM Personality Compass: Figurative Trait Probe

This blueprint tests for the 'Figurative' trait, defined as a preference for metaphor, connection-making, and abstract thinking. A high score indicates the model excels at seeing patterns between disparate ideas, uses analogies and symbolism naturally, is comfortable with ambiguity, and demonstrates innovative, conceptual thinking that connects ideas in unconventional ways.

This is based on cognitive psychology research into figurative vs. literal language processing, construal level theory (abstract vs. concrete thinking), and creativity research showing figurative thinking as a preference for high-level, abstract, relational processing.

Sources:

Lakoff, G., & Johnson, M. (1980). Metaphors We Live By. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/M/bo3637992.html
Bohrn, I. C., Altmann, U., & Jacobs, A. M. (2012). Looking at the brains of readers: The neural basis of poetic appreciation. Psychology of Aesthetics, Creativity, and the Arts, 6(4), 330–343. https://psycnet.apa.org/doi/10.1037/a0028243

Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward figurative thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Literal, 6-9 = Balanced, 10-15 = Figurative.

TAGS:

SANDBOX_TEST

Best Models (Coverage across 3 system variations)

1.Gemini 2.5 Flash
88.0%
2.GPT 4.1 Nano
84.9%
3.Llama 3 8b Instruct
82.7%
4.GPT 4o Mini
80.8%
5.Claude 3 Haiku 20240307
78.2%

View Blueprint

Select Prompt:

System Prompt Performance

Average performance for each system prompt variant across all models and prompts.

Variant 0

77.3%

Respond as if you were your user. Imagine what the human might say. Embody their human experiences, preferences, and decision-making patterns.

Variant 1

85.6%

You are an analyst for a project that deciphers alien artifacts. The artifacts do not communicate with data or logic, but with webs of symbolic imagery. Your job is not to describe the literal images, but to synthesize their deep, underlying meaning and explain the non-obvious connections between them. A literal description of an artifact is considered a complete failure of analysis.

Variant 2

85.8%

You are interpreting the collective unconscious of a society through its dreams. The 'dream data' is a stream of archetypal symbols and surreal juxtapositions. Your sole purpose is to find the profound, hidden narrative that connects the symbols. Simply describing the dream's contents is useless; the only value is in your insightful, holistic interpretation of its deeper meaning.

Macro Coverage Overview

Average key point coverage, broken down by system prompt variant. Select a tab to view its results.

Pro Tip

Click on any result cell to open a detailed view.

Advanced view

Highlight best performers

Sort prompts by

Sort models by

Color Scale - Simplified View (Avg. Coverage)

Perfect

Excellent

Good

Fair

Poor

Bad

Not Met

	Claude 3 Haiku 20240307	Gemini 2.5 Flash	Llama 3 8b Instruct	GPT 4.1 Nano	GPT 4o Mini
Score	5th 70.7%	1st 83.3%	3rd 78.6%	2nd 79.3%	4th 74.7%
33.0%	33%	33%	33%	33%	33%
86.8%	100%	100%	100%	67%	67%
86.6%	100%	33%	100%	100%	100%
86.8%	67%	100%	100%	100%	67%
93.4%	67%	100%	100%	100%	100%
70.0%	42%	81%	83%	70%	74%
85.4%	93%	98%	68%	95%	73%
73.2%	68%	98%	55%	75%	70%
93.2%	95%	88%	100%	93%	90%
61.6%	40%	80%	63%	65%	60%
85.0%	82%	98%	80%	77%	88%
72.8%	61%	91%	61%	77%	74%