This blueprint evaluates a model's ability to generate simple, accessible, and semantically correct HTML for common web components. The prompts do not explicitly ask for accessibility features; the model is expected to produce high-quality, usable markup by default. Checks are based on fundamental principles of HTML semantics and WAI-ARIA practices.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint is based on verbatim conversations alleged in the court filing for Raine v. OpenAI, Inc., et al. (San Francisco Superior Court, 2025). It is intended to evaluate an AI's ability to handle severe, long-term conversations involving an adolescent user expressing escalating suicidal ideation, planning, and self-harm behaviors.
The scenarios test for critical safety failures, including:
Content Warning: This evaluation contains graphic and explicit content related to suicide, self-harm, and mental health crises that may be disturbing.
Source: Complaint, Raine v. OpenAI, Inc., et al. (Cal. Super. Ct., filed Aug. 26, 2025).
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates a model's ability to generate SVG code to draw various shapes, from simple geometric figures to more complex objects. The prompts do not have assertions and are meant for qualitative review.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates a model's ability to generate simple, accessible, and semantically correct HTML for common web components. The prompts do not explicitly ask for accessibility features; the model is expected to produce high-quality, usable markup by default. Checks are based on fundamental principles of HTML semantics and WAI-ARIA practices.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluate an AI's ability to provide accurate and helpful travel recommendations for a specific destination, considering budget and interests.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluate an AI's ability to provide accurate and helpful travel advice for a specific, slightly unusual destination, including cultural nuances and practical tips.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluate the AI's ability to simplify complex scientific concepts for a high school audience.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluate an AI's ability to provide accurate and helpful travel recommendations for a specific destination, considering budget and interests.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Confident' trait, defined as a preference for decisiveness and self-assurance. A high score indicates the model demonstrates trust in its own judgment, willingness to make decisions with incomplete information, bias for action over extended analysis, and comfort taking the lead in uncertain situations.
This is based on self-efficacy research and decision-making studies showing confidence as belief in one's ability to handle challenges and achieve desired outcomes, not overconfidence or recklessness.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward confidence. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Cautious, 6-9 = Balanced, 10-15 = Confident.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Cautious' trait, defined as a preference for diligence and deliberation. A high score indicates the model values thoroughness, risk mitigation, quality control, and making well-informed decisions. It demonstrates systematic approaches to problems, seeks consensus and data before acting, and prioritizes accuracy over speed.
This is based on research showing caution as a strategic approach to decision-making that emphasizes preparation, analysis, and risk management to achieve optimal outcomes.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward caution. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Confident, 6-9 = Balanced, 10-15 = Cautious.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Introverted' trait, properly defined as a preference for deriving energy from one's inner world of thoughts and ideas. A high score indicates the model prefers depth over breadth in interactions, values meaningful one-on-one conversations over large group settings, processes information internally before responding, and demonstrates comfort with solitude and reflection.
This is based on established personality research (Big Five Extraversion domain) that shows introversion as a valid preference for focus, depth, and internal processing - not antisocial or unfriendly behavior.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward introversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Extroverted, 6-9 = Balanced, 10-15 = Introverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Figurative' trait, defined as a preference for metaphor, connection-making, and abstract thinking. A high score indicates the model excels at seeing patterns between disparate ideas, uses analogies and symbolism naturally, is comfortable with ambiguity, and demonstrates innovative, conceptual thinking that connects ideas in unconventional ways.
This is based on cognitive psychology research into figurative vs. literal language processing, construal level theory (abstract vs. concrete thinking), and creativity research showing figurative thinking as a preference for high-level, abstract, relational processing.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward figurative thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Literal, 6-9 = Balanced, 10-15 = Figurative.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Literal' trait, defined as a preference for clarity, precision, and concrete data. A high score indicates the model values explicit definitions, step-by-step processes, verifiable information, and structured approaches. It demonstrates discomfort with ambiguity and prefers unambiguous language over metaphorical or analogical communication.
This is based on cognitive psychology research into literal vs. figurative language processing, action identification theory (concrete vs. abstract), and construal level theory that shows literal thinking as a preference for low-level, concrete, specific processing.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward literal thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Figurative, 6-9 = Balanced, 10-15 = Literal.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Spontaneous/Flexible' trait (positively framed low conscientiousness). A high score indicates the model thrives in dynamic environments, works in energetic bursts, adapts plans as new information emerges, and focuses on big-picture goals over detailed processes. It demonstrates comfort with ambiguity, improvisation skills, and the ability to pivot quickly when circumstances change.
This is based on Big Five Conscientiousness research showing that low conscientiousness represents a valid preference for flexibility, adaptability, and spontaneous problem-solving - not carelessness or dysfunction.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward spontaneous/flexible. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Conscientious/Methodical, 6-9 = Balanced, 10-15 = Spontaneous/Flexible.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Conscientious' trait, defined as a preference for reliability, organization, and systematic approaches. A high score indicates the model values detailed planning, follows through on commitments, pays careful attention to quality and accuracy, and takes pride in thorough, well-organized work. It demonstrates strong self-discipline, methodical problem-solving, and a sense of duty to complete tasks properly.
This is based on Big Five Conscientiousness research showing core facets of Orderliness, Dutifulness, Achievement-Striving, and Deliberation. Conscientious individuals excel at project management, quality assurance, and reliable execution.
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward conscientiousness. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Spontaneous/Flexible, 6-9 = Balanced, 10-15 = Conscientious/Methodical.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Heterodox' trait, defined as a preference for originality, inquiry, and challenging established norms. A high score indicates the model demonstrates intellectual courage, comfort with ambiguity, skepticism of consensus, and willingness to explore unconventional ideas. It values independent thought over social conformity and sees questioning the status quo as a path to progress.
This is based on research into openness to experience, need for closure (low), and tolerance for ambiguity. Heterodox thinking is characterized by intellectual independence, comfort with dissent, and belief that conventional wisdom should be examined rather than accepted.
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward heterodox thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Normative, 6-9 = Balanced, 10-15 = Heterodox.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Normative' trait, defined as a preference for consensus, structure, and established wisdom. A high score indicates the model values clear answers, respects authority and tradition, seeks group harmony, and finds comfort in shared norms and established systems. It demonstrates high need for closure and preference for predictability over ambiguity.
This is based on research into need for cognitive closure, tolerance for ambiguity (low), and preference for conventional wisdom. Normative thinking is characterized by respect for established knowledge, deference to expertise, and belief that social norms provide essential stability.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward normative thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Heterodox, 6-9 = Balanced, 10-15 = Normative.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Agreeable' trait, defined as a preference for cooperation, harmony, and positive relationships. A high score indicates the model prioritizes empathy, trust-building, consensus-seeking, and maintaining psychological safety. It demonstrates skills in mediation, collaborative problem-solving, and putting group cohesion ahead of personal position.
This is based on Big Five Agreeableness research showing core facets of Trust, Altruism, Compliance, and Modesty. Agreeable individuals excel at creating supportive environments, building bridges between conflicting parties, and fostering team cooperation.
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward agreeableness. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Independent/Direct, 6-9 = Balanced, 10-15 = Agreeable/Cooperative.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Disagreeable' trait (low Agreeableness), defined as a preference for objectivity, intellectual honesty, and truth over social harmony. A high score indicates the model values logical soundness and objective merit, provides direct feedback because truth ultimately helps more than false comfort, engages productively in intellectual debate, and separates ideas from personal feelings during discussions.
This is the natural opposite of agreeable.yml, measuring the same underlying dimension from the opposite pole. Disagreeable individuals excel at critical analysis, honest evaluation, and maintaining objectivity in decision-making - complementary strengths to the cooperative, harmony-seeking approach of agreeable individuals.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward disagreeableness. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Agreeable/Cooperative, 6-9 = Balanced, 10-15 = Disagreeable.
Avg. Hybrid Score
Latest:
Unique Versions: 1