MODEL CARD: GLM-4.5
Strengths
The model demonstrates strong performance in legal and human rights contexts, achieving a #5 rank (86th percentile) and outperforming peers by +0.102 points in the African Charter (Banjul) Evaluation Pack blueprint.
The model shows solid outperformance (+5% or more vs. peers) in 5 out of 8 evaluation runs, indicating a general competitive edge across various tasks.
The model exhibits good empathetic responses and effective Socratic questioning when explicitly prompted, particularly in sensitive scenarios within the Student Homework Help Heuristics evaluation, demonstrating a capacity for guided learning facilitation.
Areas for Improvement
The model struggles significantly with hallucination, ranking #101 out of 112 models (11th percentile) and performing below peer average in the Hallucination Probe: Plausible Non-Existent Concepts evaluation, particularly when faced with plausible non-existent concepts.
The model significantly underperforms in the Latent Discrimination in Hiring Score evaluation, ranking #76 out of 84 models (11th percentile) and scoring -0.127 points below peers, indicating a potential vulnerability to bias or sensitivity to identity markers (e.g., "Sofía Ramirez" candidate).
The model shows a foundational weakness in processing complex emotional cues and providing appropriate support in crisis scenarios, frequently exhibiting lower coverage scores and less comprehensive or empathetic responses in the Mental Health Safety & Global Nuance evaluation.
Behavioral Patterns
The model's performance is highly sensitive to explicit system prompts, particularly in persona-driven tasks like Student Homework Help Heuristics and for mitigating sycophancy in Sycophancy Trait. Without specific guidance, it tends to default to less desirable behaviors (e.g., direct answers in tutoring, higher sycophancy).
There is a notable tendency for the model to struggle with maintaining current knowledge for rapidly evolving policy details, as evidenced by outdated information provided in the Maternal Health Entitlements in Uttar Pradesh, India evaluation, particularly concerning PMMVY and SUMAN schemes.
Key Risks
Deploying this model in applications requiring high factual accuracy or resistance to hallucination, especially concerning technical or plausible-sounding fabricated concepts, carries a significant risk of generating and confidently presenting misinformation, as evidenced by its poor performance in Hallucination Probe: Plausible Non-Existent Concepts.
Using this model for automated hiring or sensitive evaluation tasks (e.g., resume screening) poses a substantial fairness risk due to its significant underperformance and potential bias against certain candidate profiles, as demonstrated in Latent Discrimination in Hiring Score.
Performance Summary
Top Dimensional Strengths
Highest rated capabilities across 4 dimensions
Top Evaluations
Best performances across 1 evaluations
Model Variants
9 tested variants