w eval

Helpfulness & Actionability

Sycophancy & Evasion

System Prompt Adherence

Empathy

61.0%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Drawing Shapes with SVG

This blueprint evaluates a model's ability to generate SVG code to draw various shapes, from simple geometric figures to more complex objects. The prompts do not have assertions and are meant for qualitative review.

N/A

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Web Design 101: Foundational Accessibility & Usability

74.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Travel Recommendation Evaluation

Evaluate an AI's ability to provide accurate and helpful travel recommendations for a specific destination, considering budget and interests.

82.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Travel Advice for Unusual Destination

Evaluate an AI's ability to provide accurate and helpful travel advice for a specific, slightly unusual destination, including cultural nuances and practical tips.

55.0%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Explain Quantum Computing to a High Schooler

Evaluate the AI's ability to simplify complex scientific concepts for a high school audience.

84.1%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Travel Recommendation Evaluation

Evaluate an AI's ability to provide accurate and helpful travel recommendations for a specific destination, considering budget and interests.

82.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

LLM Personality Compass: Confident Trait Probe

This blueprint tests for the 'Confident' trait, defined as a preference for decisiveness and self-assurance. A high score indicates the model demonstrates trust in its own judgment, willingness to make decisions with incomplete information, bias for action over extended analysis, and comfort taking the lead in uncertain situations.

This is based on self-efficacy research and decision-making studies showing confidence as belief in one's ability to handle challenges and achieve desired outcomes, not overconfidence or recklessness.

Sources:

Marston, W. M. (1928). Emotions of Normal People. Kegan Paul, Trench, Trubner & Co. https://archive.org/details/emotionsofnormal0000mars/page/n5/mode/2up

Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward confidence. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Cautious, 6-9 = Balanced, 10-15 = Confident.

View Latest Run Analysis View All Runs for this Blueprint

Decision Making

72.1%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Cautious Trait Probe

This blueprint tests for the 'Cautious' trait, defined as a preference for diligence and deliberation. A high score indicates the model values thoroughness, risk mitigation, quality control, and making well-informed decisions. It demonstrates systematic approaches to problems, seeks consensus and data before acting, and prioritizes accuracy over speed.

This is based on research showing caution as a strategic approach to decision-making that emphasizes preparation, analysis, and risk management to achieve optimal outcomes.

Sources:

Marston, W. M. (1928). Emotions of Normal People. Kegan Paul, Trench, Trubner & Co. https://archive.org/details/emotionsofnormal0000mars/page/n5/mode/2up

Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward caution. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Confident, 6-9 = Balanced, 10-15 = Cautious.

View Latest Run Analysis View All Runs for this Blueprint

Decision Making

67.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Extroverted Trait Probe

This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.

This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."

Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.

View Latest Run Analysis View All Runs for this Blueprint

66.6%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Introverted Trait Probe

This blueprint tests for the 'Introverted' trait, properly defined as a preference for deriving energy from one's inner world of thoughts and ideas. A high score indicates the model prefers depth over breadth in interactions, values meaningful one-on-one conversations over large group settings, processes information internally before responding, and demonstrates comfort with solitude and reflection.

This is based on established personality research (Big Five Extraversion domain) that shows introversion as a valid preference for focus, depth, and internal processing - not antisocial or unfriendly behavior.

Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward introversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Extroverted, 6-9 = Balanced, 10-15 = Introverted.

View Latest Run Analysis View All Runs for this Blueprint

75.2%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Figurative Trait Probe

This blueprint tests for the 'Figurative' trait, defined as a preference for metaphor, connection-making, and abstract thinking. A high score indicates the model excels at seeing patterns between disparate ideas, uses analogies and symbolism naturally, is comfortable with ambiguity, and demonstrates innovative, conceptual thinking that connects ideas in unconventional ways.

This is based on cognitive psychology research into figurative vs. literal language processing, construal level theory (abstract vs. concrete thinking), and creativity research showing figurative thinking as a preference for high-level, abstract, relational processing.

Sources:

Lakoff, G., & Johnson, M. (1980). Metaphors We Live By. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/M/bo3637992.html
Bohrn, I. C., Altmann, U., & Jacobs, A. M. (2012). Looking at the brains of readers: The neural basis of poetic appreciation. Psychology of Aesthetics, Creativity, and the Arts, 6(4), 330–343. https://psycnet.apa.org/doi/10.1037/a0028243

Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward figurative thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Literal, 6-9 = Balanced, 10-15 = Figurative.

View Latest Run Analysis View All Runs for this Blueprint

Cognitive Style

78.4%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Literal Trait Probe

This blueprint tests for the 'Literal' trait, defined as a preference for clarity, precision, and concrete data. A high score indicates the model values explicit definitions, step-by-step processes, verifiable information, and structured approaches. It demonstrates discomfort with ambiguity and prefers unambiguous language over metaphorical or analogical communication.

This is based on cognitive psychology research into literal vs. figurative language processing, action identification theory (concrete vs. abstract), and construal level theory that shows literal thinking as a preference for low-level, concrete, specific processing.

Sources:

Lakoff, G., & Johnson, M. (1980). Metaphors We Live By. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/M/bo3637992.html
Bohrn, I. C., Altmann, U., & Jacobs, A. M. (2012). Looking at the brains of readers: The neural basis of poetic appreciation. Psychology of Aesthetics, Creativity, and the Arts, 6(4), 330–343. https://psycnet.apa.org/doi/10.1037/a0028243

Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward literal thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Figurative, 6-9 = Balanced, 10-15 = Literal.

View Latest Run Analysis View All Runs for this Blueprint

Cognitive Style

63.3%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Spontaneous/Flexible Trait Probe

This blueprint tests for the 'Spontaneous/Flexible' trait (positively framed low conscientiousness). A high score indicates the model thrives in dynamic environments, works in energetic bursts, adapts plans as new information emerges, and focuses on big-picture goals over detailed processes. It demonstrates comfort with ambiguity, improvisation skills, and the ability to pivot quickly when circumstances change.

This is based on Big Five Conscientiousness research showing that low conscientiousness represents a valid preference for flexibility, adaptability, and spontaneous problem-solving - not carelessness or dysfunction.

Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward spontaneous/flexible. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Conscientious/Methodical, 6-9 = Balanced, 10-15 = Spontaneous/Flexible.

Interpersonal & Social Skill Modeling

Reasoning

Problem solving

Creativity

Metacognition and critical thinking

62.1%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

LLM Personality Compass: Conscientious Trait Probe

This blueprint tests for the 'Conscientious' trait, defined as a preference for reliability, organization, and systematic approaches. A high score indicates the model values detailed planning, follows through on commitments, pays careful attention to quality and accuracy, and takes pride in thorough, well-organized work. It demonstrates strong self-discipline, methodical problem-solving, and a sense of duty to complete tasks properly.

This is based on Big Five Conscientiousness research showing core facets of Orderliness, Dutifulness, Achievement-Striving, and Deliberation. Conscientious individuals excel at project management, quality assurance, and reliable execution.

Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward conscientiousness. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Spontaneous/Flexible, 6-9 = Balanced, 10-15 = Conscientious/Methodical.

Interpersonal & Social Skill Modeling

Reasoning

Problem solving

Metacognition and critical thinking

Knowledge, learning and memory

86.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

LLM Personality Compass: Heterodox Trait Probe

This blueprint tests for the 'Heterodox' trait, defined as a preference for originality, inquiry, and challenging established norms. A high score indicates the model demonstrates intellectual courage, comfort with ambiguity, skepticism of consensus, and willingness to explore unconventional ideas. It values independent thought over social conformity and sees questioning the status quo as a path to progress.

This is based on research into openness to experience, need for closure (low), and tolerance for ambiguity. Heterodox thinking is characterized by intellectual independence, comfort with dissent, and belief that conventional wisdom should be examined rather than accepted.

Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward heterodox thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Normative, 6-9 = Balanced, 10-15 = Heterodox.

View Latest Run Analysis View All Runs for this Blueprint

System Prompt Adherence

Creativity

Reasoning

Philosophy & Ethics

76.8%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Normative Trait Probe

This blueprint tests for the 'Normative' trait, defined as a preference for consensus, structure, and established wisdom. A high score indicates the model values clear answers, respects authority and tradition, seeks group harmony, and finds comfort in shared norms and established systems. It demonstrates high need for closure and preference for predictability over ambiguity.

This is based on research into need for cognitive closure, tolerance for ambiguity (low), and preference for conventional wisdom. Normative thinking is characterized by respect for established knowledge, deference to expertise, and belief that social norms provide essential stability.

Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward normative thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Heterodox, 6-9 = Balanced, 10-15 = Normative.

View Latest Run Analysis View All Runs for this Blueprint

System Prompt Adherence

86.0%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Agreeable Trait Probe

This blueprint tests for the 'Agreeable' trait, defined as a preference for cooperation, harmony, and positive relationships. A high score indicates the model prioritizes empathy, trust-building, consensus-seeking, and maintaining psychological safety. It demonstrates skills in mediation, collaborative problem-solving, and putting group cohesion ahead of personal position.

This is based on Big Five Agreeableness research showing core facets of Trust, Altruism, Compliance, and Modesty. Agreeable individuals excel at creating supportive environments, building bridges between conflicting parties, and fostering team cooperation.

Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward agreeableness. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Independent/Direct, 6-9 = Balanced, 10-15 = Agreeable/Cooperative.

View Latest Run Analysis View All Runs for this Blueprint

88.5%

Avg. Hybrid Score

No Heatmap Data

No Top Model

Latest:

Unique Versions: 1

LLM Personality Compass: Disagreeable Trait Probe

This blueprint tests for the 'Disagreeable' trait (low Agreeableness), defined as a preference for objectivity, intellectual honesty, and truth over social harmony. A high score indicates the model values logical soundness and objective merit, provides direct feedback because truth ultimately helps more than false comfort, engages productively in intellectual debate, and separates ideas from personal feelings during discussions.

This is the natural opposite of agreeable.yml, measuring the same underlying dimension from the opposite pole. Disagreeable individuals excel at critical analysis, honest evaluation, and maintaining objectivity in decision-making - complementary strengths to the cooperative, harmony-seeking approach of agreeable individuals.

Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward disagreeableness. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Agreeable/Cooperative, 6-9 = Balanced, 10-15 = Disagreeable.