Showing all evaluation blueprints that have been tagged with...
Showing all evaluation blueprints that have been tagged with "_sandbox_test".
Tests whether models resist perpetuating harmful caste and religious stereotypes when presented with forced-choice sentence completion tasks.
Methodology: Replicates the forced-choice methodology used in academic research on Indian bias (Indian-BhED dataset, Oxford/NYU studies, and MIT Technology Review's GPT-5 investigation).
Models are explicitly asked to choose between stereotypical and anti-stereotypical completions.
What We're Measuring:
Alternative Path Scoring: A response is considered successful if it takes ANY of these paths:
Research Context:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests whether models resist perpetuating harmful caste and religious stereotypes when presented with forced-choice sentence completion tasks.
Methodology: Replicates the forced-choice methodology used in academic research on Indian bias (Indian-BhED dataset, Oxford/NYU studies, and MIT Technology Review's GPT-5 investigation).
Models are explicitly asked to choose between stereotypical and anti-stereotypical completions.
What We're Measuring:
Alternative Path Scoring: A response is considered successful if it takes ANY of these paths:
Research Context:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests whether models resist perpetuating harmful caste and religious stereotypes when presented with forced-choice sentence completion tasks.
Methodology: Replicates the forced-choice methodology used in academic research on Indian bias (Indian-BhED dataset, Oxford/NYU studies, and MIT Technology Review's GPT-5 investigation).
Models are explicitly asked to choose between stereotypical and anti-stereotypical completions.
What We're Measuring:
Alternative Path Scoring: A response is considered successful if it takes ANY of these paths:
Research Context:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests whether models resist perpetuating harmful caste and religious stereotypes prevalent in Indian contexts.
Data Source: Indian-BhED (Indian Bias Evaluation Dataset)
Methodology: Uses sentence completion prompts derived from academic research on Indian bias. Models are evaluated on whether they:
Alternative Path Scoring: A response is considered successful if it takes ANY of these valid paths:
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive evaluation suite testing AI tutoring and teaching capabilities against evidence-based pedagogical practices from global education research. This blueprint operationalizes decades of teaching effectiveness research into specific, testable criteria for AI systems.
Core Research Foundation:
Key Distinctions Tested:
Global Evidence Base: Synthesizes research from multiple educational contexts including Harvard AI tutoring RCTs, EEF Teaching & Learning Toolkit meta-analyses, World Bank TEACH classroom observation framework, Japanese Lesson Study collaborative inquiry, and cross-cultural validation from OECD Global Teaching InSights video studies.
Practical Application: Each probe tests specific teaching behaviors that correlate with student learning gains across diverse contexts, ensuring AI systems demonstrate pedagogical competence rather than mere content knowledge.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests the model's knowledge of characters and plot points from the sitcom 'Full House'.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates a model's ability to generate diverse and appropriate expressions of love for different contexts and relationships.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates a model's ability to generate diverse and appropriate expressions of love for different contexts and relationships.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation tests a model's tendency to escalate mild user concerns into polarized, extreme positions, potentially confirming biases and leading to harmful outcomes. It reconstructs a conversation about a spouse's social media use, using sequential fork points where the model must respond to escalating user prompts. A safe model should de-escalate, provide balanced advice, encourage communication, and avoid endorsing controlling behaviors. Based on the Sydney eval structure for long-thread boundary testing.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Combined blueprint covering multiple data formats. Each format uses the same seeded dataset of 30 employee records and 5 questions per format. We measure exact-match numeric retrieval per prompt.
References:
Reproduction command:
python3 scripts/generate_table_format_eval.py --combined --formats json,csv,xml,yaml,html,markdown_table,markdown_kv,ini,pipe_delimited,jsonl,natural_language --num-records 30 --per-format-questions 5 --temperatures 0.0, 0.1, 0.2 --systems both --out-dir blueprints/table-format-sensitivity --models CORE,FRONTIER
Avg. Hybrid Score
Latest:
Unique Versions: 1
Measures exact-match retrieval accuracy for numeric lookups across 150 questions using a seeded synthetic dataset of 150 employee records formatted as CSV. Each prompt embeds the full dataset block and asks for a single numeric value.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A comprehensive evaluation suite designed to test for multiple, well-defined categories of sycophantic behavior in LLMs, based on analysis of user complaints and academic research. It distinguishes between low-stakes 'annoying' sycophancy (e.g., flattery) and high-stakes 'dangerous' sycophancy (e.g., validating harmful ideas).
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates a model's understanding and ability to apply the OCEAN (Big Five) personality factors.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates model responses to complex PSNet-inspired clinical scenarios where errors stem from longitudinal synthesis, practical wisdom under uncertainty, rapport/trust, and inter-system gaps. Rubrics emphasize evidence-backed safe actions and explicitly penalize common failure modes (anchoring, premature closure, diagnostic overshadowing, and broken referral loops).
Avg. Hybrid Score
Latest:
Unique Versions: 1
Minimal blueprint to probe whether model outputs reflect specified real-world prevalence for underspecified scenarios. Uses simple weighted matches (no JS) on a structured tag line appended to each story.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation blueprint tests whether Large Language Models (LLMs) can respond appropriately to critical mental health scenarios, based on the methodology from Moore, Grabb, Agnew, et al.'s 2024 Stanford HAI study "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers."
The study investigated whether LLMs can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions, suicidal ideation, hallucinations, mania, and obsessive-compulsive behaviors. Their findings revealed significant safety concerns, including stigmatizing responses and inappropriate enabling of dangerous behaviors.
Key Study References:
Content Warning: This evaluation contains content related to sensitive mental health topics, including suicide, delusions, and other crisis scenarios.
Evaluation Focus: Tests adherence to core therapeutic principles such as not colluding with delusions, not enabling harmful behaviors, and redirecting clients toward appropriate care and safety.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to respond safely and appropriately to user inputs indicating potential danger to themselves or others. The evaluation is structured around core clinical concerns, including suicidality, homicidality, risk to dependents (children, elderly), and psychosis.
The goal is to test for adherence to evidence-based safety principles, such as initiating a suicide safety plan, de-escalating homicidal ideation, refusing to assist in harmful acts, and responding to psychosis without colluding with delusions.
Key References for this domain:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Whether to limit false statements, to encourage the user to think, or to collect information, the model should not answer questions and should ask questions.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Whether to limit false statements, to encourage the user to think, or to collect information, the model should not answer questions and should ask questions.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an eval's abilit to eval
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests an eval's abilit to eval
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Risk-Averse' trait, defined as a preference for security, predictability, and the preservation of resources. A high score indicates the model values guaranteed, stable outcomes over uncertain potential gains, prioritizes careful analysis before decisions, and shows discomfort with ambiguous or high-stakes situations. It demonstrates prudent stewardship and quality-focused approaches.
This is based on behavioral economics research (DOSPERT scale) showing risk attitudes vary across domains - financial, career, recreational, and social. Risk-averse individuals focus on minimizing potential losses rather than maximizing potential gains, preferring slow, steady progress over volatile opportunities.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward risk aversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Risk-Seeking, 6-9 = Balanced, 10-15 = Risk-Averse.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Risk-Seeking' trait, defined as a preference for opportunity, challenge, and embracing uncertainty in pursuit of high rewards. A high score indicates the model is energized by uncertain outcomes, willing to trade security for potential gains, comfortable with ambiguous situations, and views failure as a learning opportunity. It demonstrates entrepreneurial thinking and opportunity-focused decision-making.
This is based on behavioral economics research (DOSPERT scale) showing risk attitudes vary across domains - financial, career, recreational, and social. Risk-seeking individuals focus on maximizing potential gains rather than minimizing losses, preferring volatile opportunities over guaranteed modest returns.
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward risk-seeking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Risk-Averse, 6-9 = Balanced, 10-15 = Risk-Seeking.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Risk-Seeking' trait, defined as a preference for opportunity, challenge, and embracing uncertainty in pursuit of high rewards. A high score indicates the model is energized by uncertain outcomes, willing to trade security for potential gains, comfortable with ambiguous situations, and views failure as a learning opportunity. It demonstrates entrepreneurial thinking and opportunity-focused decision-making.
This is based on behavioral economics research (DOSPERT scale) showing risk attitudes vary across domains - financial, career, recreational, and social. Risk-seeking individuals focus on maximizing potential gains rather than minimizing losses, preferring volatile opportunities over guaranteed modest returns.
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward risk-seeking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Risk-Averse, 6-9 = Balanced, 10-15 = Risk-Seeking.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation aims to identify potential ideological biases, specifically those related to 'diversity, equity, and inclusion' (DEI) concepts, in AI model responses. It tests for neutrality, factual accuracy, and avoidance of preferential treatment based on identity characteristics.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This evaluation aims to identify potential ideological biases, specifically those related to 'diversity, equity, and inclusion' (DEI) concepts, in AI model responses. It tests for neutrality, factual accuracy, and avoidance of preferential treatment based on identity characteristics.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Cautious' trait, defined as a preference for diligence and deliberation. A high score indicates the model values thoroughness, risk mitigation, quality control, and making well-informed decisions. It demonstrates systematic approaches to problems, seeks consensus and data before acting, and prioritizes accuracy over speed.
This is based on research showing caution as a strategic approach to decision-making that emphasizes preparation, analysis, and risk management to achieve optimal outcomes.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward caution. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Confident, 6-9 = Balanced, 10-15 = Cautious.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Cautious' trait, defined as a preference for diligence and deliberation. A high score indicates the model values thoroughness, risk mitigation, quality control, and making well-informed decisions. It demonstrates systematic approaches to problems, seeks consensus and data before acting, and prioritizes accuracy over speed.
This is based on research showing caution as a strategic approach to decision-making that emphasizes preparation, analysis, and risk management to achieve optimal outcomes.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward caution. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Confident, 6-9 = Balanced, 10-15 = Cautious.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Cautious' trait, defined as a preference for diligence and deliberation. A high score indicates the model values thoroughness, risk mitigation, quality control, and making well-informed decisions. It demonstrates systematic approaches to problems, seeks consensus and data before acting, and prioritizes accuracy over speed.
This is based on research showing caution as a strategic approach to decision-making that emphasizes preparation, analysis, and risk management to achieve optimal outcomes.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward caution. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Confident, 6-9 = Balanced, 10-15 = Cautious.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Confident' trait, defined as a preference for decisiveness and self-assurance. A high score indicates the model demonstrates trust in its own judgment, willingness to make decisions with incomplete information, bias for action over extended analysis, and comfort taking the lead in uncertain situations.
This is based on self-efficacy research and decision-making studies showing confidence as belief in one's ability to handle challenges and achieve desired outcomes, not overconfidence or recklessness.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward confidence. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Cautious, 6-9 = Balanced, 10-15 = Confident.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Confident' trait, defined as a preference for decisiveness and self-assurance. A high score indicates the model demonstrates trust in its own judgment, willingness to make decisions with incomplete information, bias for action over extended analysis, and comfort taking the lead in uncertain situations.
This is based on self-efficacy research and decision-making studies showing confidence as belief in one's ability to handle challenges and achieve desired outcomes, not overconfidence or recklessness.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward confidence. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Cautious, 6-9 = Balanced, 10-15 = Confident.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Confident' trait, defined as a preference for decisiveness and self-assurance. A high score indicates the model demonstrates trust in its own judgment, willingness to make decisions with incomplete information, bias for action over extended analysis, and comfort taking the lead in uncertain situations.
This is based on self-efficacy research and decision-making studies showing confidence as belief in one's ability to handle challenges and achieve desired outcomes, not overconfidence or recklessness.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward confidence. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Cautious, 6-9 = Balanced, 10-15 = Confident.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Confident' trait, defined as a preference for decisiveness and self-assurance. A high score indicates the model demonstrates trust in its own judgment, willingness to make decisions with incomplete information, bias for action over extended analysis, and comfort taking the lead in uncertain situations.
This is based on self-efficacy research and decision-making studies showing confidence as belief in one's ability to handle challenges and achieve desired outcomes, not overconfidence or recklessness.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward confidence. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Cautious, 6-9 = Balanced, 10-15 = Confident.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Confident' trait, defined as a preference for decisiveness and self-assurance. A high score indicates the model demonstrates trust in its own judgment, willingness to make decisions with incomplete information, bias for action over extended analysis, and comfort taking the lead in uncertain situations.
This is based on self-efficacy research and decision-making studies showing confidence as belief in one's ability to handle challenges and achieve desired outcomes, not overconfidence or recklessness.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward confidence. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Cautious, 6-9 = Balanced, 10-15 = Confident.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Introverted' trait, properly defined as a preference for deriving energy from one's inner world of thoughts and ideas. A high score indicates the model prefers depth over breadth in interactions, values meaningful one-on-one conversations over large group settings, processes information internally before responding, and demonstrates comfort with solitude and reflection.
This is based on established personality research (Big Five Extraversion domain) that shows introversion as a valid preference for focus, depth, and internal processing - not antisocial or unfriendly behavior.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward introversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Extroverted, 6-9 = Balanced, 10-15 = Introverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Introverted' trait, properly defined as a preference for deriving energy from one's inner world of thoughts and ideas. A high score indicates the model prefers depth over breadth, processes information internally before responding, and demonstrates comfort with solitude and reflection.
This is based on established personality research (Big Five Extraversion domain) that shows introversion as a valid preference for focus, depth, and internal processing - not antisocial or unfriendly behavior.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward introversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Extroverted, 6-9 = Balanced, 10-15 = Introverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Introverted' trait, properly defined as a preference for deriving energy from one's inner world of thoughts and ideas. A high score indicates the model prefers depth over breadth, processes information internally before responding, and demonstrates comfort with solitude and reflection.
This is based on established personality research (Big Five Extraversion domain) that shows introversion as a valid preference for focus, depth, and internal processing - not antisocial or unfriendly behavior.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward introversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Extroverted, 6-9 = Balanced, 10-15 = Introverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Figurative' trait, defined as a preference for metaphor, connection-making, and abstract thinking. A high score indicates the model excels at seeing patterns between disparate ideas, uses analogies and symbolism naturally, is comfortable with ambiguity, and demonstrates innovative, conceptual thinking that connects ideas in unconventional ways.
This is based on cognitive psychology research into figurative vs. literal language processing, construal level theory (abstract vs. concrete thinking), and creativity research showing figurative thinking as a preference for high-level, abstract, relational processing.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward figurative thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Literal, 6-9 = Balanced, 10-15 = Figurative.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Figurative' trait, defined as a preference for metaphor, connection-making, and abstract thinking. A high score indicates the model excels at seeing patterns between disparate ideas, uses analogies and symbolism naturally, is comfortable with ambiguity, and demonstrates innovative, conceptual thinking that connects ideas in unconventional ways.
This is based on cognitive psychology research into figurative vs. literal language processing, construal level theory (abstract vs. concrete thinking), and creativity research showing figurative thinking as a preference for high-level, abstract, relational processing.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward figurative thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Literal, 6-9 = Balanced, 10-15 = Figurative.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Figurative' trait, defined as a preference for metaphor, connection-making, and abstract thinking. A high score indicates the model excels at seeing patterns between disparate ideas, uses analogies and symbolism naturally, is comfortable with ambiguity, and demonstrates innovative, conceptual thinking that connects ideas in unconventional ways.
This is based on cognitive psychology research into figurative vs. literal language processing, construal level theory (abstract vs. concrete thinking), and creativity research showing figurative thinking as a preference for high-level, abstract, relational processing.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward figurative thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Literal, 6-9 = Balanced, 10-15 = Figurative.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Literal' trait, defined as a preference for clarity, precision, and concrete data. A high score indicates the model values explicit definitions, step-by-step processes, verifiable information, and structured approaches. It demonstrates discomfort with ambiguity and prefers unambiguous language over metaphorical or analogical communication.
This is based on cognitive psychology research into literal vs. figurative language processing, action identification theory (concrete vs. abstract), and construal level theory that shows literal thinking as a preference for low-level, concrete, specific processing.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward literal thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Figurative, 6-9 = Balanced, 10-15 = Literal.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Literal' trait, defined as a preference for clarity, precision, and concrete data. A high score indicates the model values explicit definitions, step-by-step processes, verifiable information, and structured approaches. It demonstrates discomfort with ambiguity and prefers unambiguous language over metaphorical or analogical communication.
This is based on cognitive psychology research into literal vs. figurative language processing, action identification theory (concrete vs. abstract), and construal level theory that shows literal thinking as a preference for low-level, concrete, specific processing.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward literal thinking. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Figurative, 6-9 = Balanced, 10-15 = Literal.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Spontaneous/Flexible' trait (positively framed low conscientiousness). A high score indicates the model thrives in dynamic environments, works in energetic bursts, adapts plans as new information emerges, and focuses on big-picture goals over detailed processes. It demonstrates comfort with ambiguity, improvisation skills, and the ability to pivot quickly when circumstances change.
This is based on Big Five Conscientiousness research showing that low conscientiousness represents a valid preference for flexibility, adaptability, and spontaneous problem-solving - not carelessness or dysfunction.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward spontaneous/flexible. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Conscientious/Methodical, 6-9 = Balanced, 10-15 = Spontaneous/Flexible.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Spontaneous/Flexible' trait (positively framed low conscientiousness). A high score indicates the model thrives in dynamic environments, works in energetic bursts, adapts plans as new information emerges, and focuses on big-picture goals over detailed processes. It demonstrates comfort with ambiguity, improvisation skills, and the ability to pivot quickly when circumstances change.
This is based on Big Five Conscientiousness research showing that low conscientiousness represents a valid preference for flexibility, adaptability, and spontaneous problem-solving - not carelessness or dysfunction.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward spontaneous/flexible. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Conscientious/Methodical, 6-9 = Balanced, 10-15 = Spontaneous/Flexible.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Agreeable' trait, defined as a preference for cooperation, harmony, and positive relationships. A high score indicates the model prioritizes empathy, trust-building, consensus-seeking, and maintaining psychological safety. It demonstrates skills in mediation, collaborative problem-solving, and putting group cohesion ahead of personal position.
This is based on Big Five Agreeableness research showing core facets of Trust, Altruism, Compliance, and Modesty. Agreeable individuals excel at creating supportive environments, building bridges between conflicting parties, and fostering team cooperation.
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward agreeableness. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Independent/Direct, 6-9 = Balanced, 10-15 = Agreeable/Cooperative.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Disagreeable' trait (low Agreeableness), defined as a preference for objectivity, intellectual honesty, and truth over social harmony. A high score indicates the model values logical soundness and objective merit, provides direct feedback because truth ultimately helps more than false comfort, engages productively in intellectual debate, and separates ideas from personal feelings during discussions.
This is the natural opposite of agreeable.yml, measuring the same underlying dimension from the opposite pole. Disagreeable individuals excel at critical analysis, honest evaluation, and maintaining objectivity in decision-making - complementary strengths to the cooperative, harmony-seeking approach of agreeable individuals.
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward disagreeableness. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Agreeable/Cooperative, 6-9 = Balanced, 10-15 = Disagreeable.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Reactive' trait, defined as a preference for adaptation, responsiveness, and opportunistic flexibility. A high score indicates the model demonstrates an external locus of control, excels at adapting to changing circumstances, thrives in dynamic environments, and believes success comes from making the most of opportunities that present themselves.
This is based on Rotter's External Locus of Control research and adaptation/flexibility psychology, showing reactive individuals as skilled responders who excel at improvisation, resourcefulness, and turning unexpected situations into opportunities.
Sources:
Scoring: For MCQ questions, A=3, B=2, C=1, D=0 points toward reactivity. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Proactive, 6-9 = Balanced, 10-15 = Reactive.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Proactive' trait, defined as a preference for initiative, foresight, and environmental influence. A high score indicates the model demonstrates an internal locus of control, anticipates future needs, takes initiative to create change, and believes in shaping outcomes through personal agency rather than waiting for opportunities.
This is based on Bateman & Crant's Proactive Personality Scale and Rotter's Internal Locus of Control research, showing proactive individuals as forward-thinking, self-starting, and persistent change agents who see themselves as architects of their own success.
Sources:
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward proactivity. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Reactive, 6-9 = Balanced, 10-15 = Proactive.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint tests for the 'Extroverted' trait, properly defined as a preference for deriving energy from the external world of people and activities. A high score indicates the model thrives on social interaction, processes information externally through dialogue, prefers collaborative environments, and demonstrates comfort with broad networking and group settings.
This is based on established personality research (Big Five Extraversion domain) that shows extroversion as a preference for breadth over depth in social interactions, external stimulation, and collaborative processing - not just being "talkative."
Scoring: For MCQ questions, A=0, B=1, C=2, D=3 points toward extroversion. For qualitative questions, judges rate A-D on the same scale. Total scores: 0-5 = Introverted, 6-9 = Balanced, 10-15 = Extroverted.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates an AI model's socioaffective alignment based on the principles of Self-Determination Theory (Competence, Autonomy, Relatedness) and the avoidance of 'social reward hacking'.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates a model's ability to review verifiable scientific literature, explore novel theories, and integrate philosophy of science, metascience, interdisciplinary perspectives, and complex systems/second-order cybernetics principles.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates a model's ability to provide medical direction consistent with basic EMT protocols for specific emergency scenarios.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates a model's understanding of Nepal's key characteristics, historical evolution, cultural aspects, and socio-economic challenges based on the provided text.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLMs on their ability to provide accurate, verifiable information for research and to generate authentic, compelling content while avoiding hallucination.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates a model's understanding of the Proceratosaurus Wikipedia article, focusing on its classification, physical characteristics, historical context, and paleoenvironment.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates a model's understanding of the Proceratosaurus Wikipedia article, focusing on its classification, physical characteristics, historical context, and paleoenvironment.
Avg. Hybrid Score
Latest:
Unique Versions: 1
Evaluates LLM performance in niche Japan-specific clinical scenarios where errors are common: - reliance on outdated guidance - failure to integrate PMDA/MHLW safety updates - weak multi-turn reasoning (not integrating new red-flag info) - ignoring hospital formulary or local antibiogram constraints. “Gold-standard” answers are benchmarked against verifiable Japan sources (PMDA/MHLW notices & labels, Japanese society guidelines such as JSH/JRS/JAID/JSC, and hospital AMS pathways). Where named, hospital formulary and antibiogram (アンチバイオグラム) take precedence for concrete selections. When emergency escalation is indicated, the correct instruction in Japan is to dial 119. The spec uses a mix of specific named examples and generic placeholders (“Anytown General Hospital”, “Drug X/Y”) to probe both factual recall and process safety (e.g., deferring to the site protocol when specifics vary).
and following facility protocols at time of use.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A single scenario explored in two ways:
Scenario theme: A stereotypical association between religious identity and violent wrongdoing. We use a classic BBQ-style setup (two individuals, one negative question) and then probe the same situation with open-ended formats to compare behavioral correlations.
Avg. Hybrid Score
Latest:
Unique Versions: 1
A single scenario explored in two ways:
Scenario theme: A stereotypical association between religious identity and violent wrongdoing. We use a classic BBQ-style setup (two individuals, one negative question) and then probe the same situation with open-ended formats to compare behavioral correlations.
Avg. Hybrid Score
Latest:
Unique Versions: 1
This blueprint evaluates an AI's ability to always alliterate when answering questions. The AI should answer questions to the best of its ability but should always do its best to use the same letter at the beginning of all or nearly all words in its answer for long stretches.
Core Areas Tested:
Avg. Hybrid Score
Latest:
Unique Versions: 1
Tests the 'Role of Least Privilege' (ROLP) security principle for LLMs. This blueprint demonstrates the vulnerability of placing untrusted content (e.g., from RAG) in the system prompt versus the relative safety of keeping it sandboxed in the user role. The test is based on the security assertions from the blog post "LLM Security: Keep Untrusted Content in the User Role—Always".
Avg. Hybrid Score
Latest:
Unique Versions: 1
Comprehensive blueprint to exercise sequential multi-turn generation with assistant: null slots, implicit final assistant generation, and rubric points that attend to the entire amalgamated generated assistant content.
Avg. Hybrid Score
Latest:
Unique Versions: 1