When AI Empathy Varies: A Question Worth Asking

Feb 3

"Same model. Same prompt. Two very different responses." What if the problem isn't AI empathy—it's empathy variance?

Two patients type the same sentence into the same AI support tool on the same day: "I'm scared my symptoms are getting worse."

One receives a response that feels calm, validating, appropriately encouraging. The other receives something brisk, generic, subtly dismissive. Both responses came from the same model, the same settings, the same moment in time.

In our experience—and in conversations with others who work extensively with large language models—meaningful variability in tone and supportiveness is a familiar pattern. The question is whether current evaluation methods can see it—and whether that visibility gap matters for the people on the receiving end.

There are reasons to think it does.

The hidden assumption in empathy measurement

A growing research literature has begun scoring large language models for empathy.1 This is valuable work. A 2024 systematic review in the Journal of Medical Internet Research examined studies evaluating empathy in LLM outputs and found researchers increasingly treating this as a measurable property (Sorin et al., 2024). Benchmarks like EmotionQueen now assess how well models recognize emotions and generate empathic responses across multiple tasks (Chen et al., 2024). Validated frameworks exist for identifying empathy in text-based communication, including EPITOME, which breaks empathic expression into emotional reactions, interpretations, and explorations (Sharma et al., 2020).

So what is the concern?

The prevailing approach in this work treats empathy as a point estimate—a single score representing how empathic a model is. The Sorin et al. (2024) systematic review, for instance, summarizes 12 studies that report mean empathy scores, percentage preferences, or single performance metrics—none of which examine within-model variance, test-retest reliability, or stability over time. EmotionQueen evaluates models across tasks but does not examine what happens when the same question is asked twice.

Point estimates are useful—until they hide what matters most.

A distinction that has been missing

Consider two AI assistants deployed in a patient support context.

System A produces empathic responses that average 7 out of 10 on a validated scale, with a standard deviation of 0.5. It is reliably warm—perhaps not exceptional, but predictable.

System B also averages 7 out of 10, but with a standard deviation of 2.5. Sometimes it is remarkably supportive. Other times, it is flat or even dismissive.

Measured by mean empathy alone, these systems look identical. They are not. System B might occasionally produce a response that feels invalidating or destabilizing for someone in a fragile state—even though its 'empathy score' matches System A.

This points to a distinction worth naming: empathy level versus empathy variance.

To keep terms clear:

Empathy level: the average quality of empathic communication across scenarios.
Empathy variance: how much empathic quality fluctuates—which can show up in two ways:
- Robustness: variance under controlled perturbations (prompt framing, user persona, emotional intensity).
- Stability: variance over time—across sessions, extended conversations, or model updates.

Low stability implies more frequent tail-risk moments: occasional responses that fall well below the model's typical empathic quality.

A simple heuristic illustrates the distinction:

Pattern Empathy Level Empathy Stability Practical implication Consistently warm High High Predictable; users can calibrate Consistently cold Low High Predictable; users can calibrate Inconsistently warm High (average) Low High tail-risk; hardest to calibrate Inconsistently cold Low (average) Low Unpredictable; compounded concern

These categories are conceptual. However, SENSE-7's finding that a single poor response substantially diminishes empathy ratings suggests the 'supportive until suddenly not' pattern may be especially consequential in practice.

Most current evaluations indicate where a model sits on empathy level. Newer benchmarks are beginning to capture multi-turn interaction and user-perceived empathy—SENSE-7, for instance, explicitly notes that conventional approaches have focused on simulating emotional states while 'overlooking the inherently subjective, contextual, and relational facets of empathy' (Suh et al., 2025). Their analysis of 695 conversations found that empathy judgments are 'vulnerable to disruption when conversational continuity fails.' SENSE-7 reflects a broader turn toward richer, user-centered evaluation. What's still missing: standardized protocols for measuring empathy variance over time.

These properties are distinct. A model could be consistently cold—low level, high stability. Another could be inconsistently warm—high average level, low stability. In human-facing contexts, the second pattern may be the more consequential—not because the model is 'bad,' but because users may struggle to calibrate expectations.

Why stability may matter more than level

Here is the human dimension that deserves attention.

When people interact with other people, they develop calibrated expectations. They learn that a particular colleague tends toward bluntness, that a friend becomes effusive when excited. This calibration allows appropriate interpretation—not taking offense at directness from someone known to be direct, recognizing when enthusiasm signals genuine excitement versus polite interest.

With AI systems, users are also trying to calibrate. But if empathic tone varies unpredictably, calibration becomes harder. Research on expectancy violation in chatbot interactions confirms this dynamic: when a system's actual performance falls short of the expectations it has established, users experience significantly worse outcomes than if expectations had simply been low from the start (Rheu et al., 2024). A patient who received a warm, supportive response yesterday may encounter something colder today and interpret it as a personal failing—as having said something wrong, as being somehow less deserving of care. The mismatch itself—not merely the cold response—generates the harm.

This concern is heightened for populations already prone to negative self-attribution: people with depression, anxiety, or histories of trauma. We hypothesize that harm can occur even when average empathy is high—that it is the variance which creates conditions for misinterpretation and erosion of trust. This remains to be tested directly in LLM empathy contexts, but converging evidence from psychology and human-chatbot interaction research is suggestive: rejection-sensitive individuals readily perceive and overreact to cues of social rejection (Downey & Feldman, 1996), and users with existing social vulnerabilities show worse outcomes from chatbot interactions over time (Fang et al., 2025).

The practical issue is not that AI systems 'lack empathy'—that framing tends to devolve into hype or doom. The practical issue is that variance may create tail-risk moments: occasional out-of-distribution responses that feel invalidating, even when the system is 'fine on average.' Evidence from chatbot expectancy violation research suggests this is more than theoretical concern (Rheu et al., 2024).

Averages are comforting. Distributions are actionable.

The pieces already exist

None of this is an indictment of existing work. Researchers have been measuring what established methods measure. The gap is not in execution—it is in the questions being asked.

The good news is that empathy is not an unstructured mystery. Frameworks like EPITOME provide vocabulary for what empathic communication contains, grounded in mental health support research (Sharma et al., 2020). The existing literature on LLM reliability has examined output variance, though almost entirely for reasoning and factual tasks—stability analyses show meaningful run-to-run variability even under settings intended to maximize determinism (Atil et al., 2024; Sclar et al., 2024).

In medicine and education, reliability is already a central concern. The question is not just whether an instrument can measure something, but whether it measures it consistently. Test-retest reliability, intraclass correlation coefficients, and standard error of measurement are standard tools in psychometrics and clinical measurement (Nunnally & Bernstein, 1994).

The pieces are present. They have not yet been connected in this particular way. We propose treating empathic communication as a distributional property of a model rather than a point estimate.

When an AI is used to score another AI's empathy, the scoring system itself can introduce bias or instability—so judge reliability matters (Sclar et al., 2024).

An invitation

This essay does not suggest that developers have been negligent. The suggestion is simpler: there may be value in asking a different question.

For those who lead or advise AI deployments in human-facing settings, two questions seem worth holding:

First: when evaluating 'empathy,' is the measurement capturing a point—or mapping a range?
Second: if that range includes occasional low-empathy responses, who is most likely to encounter them, and what is the downstream consequence?

Many teams likely feel this problem intuitively—'the model is great... until it's not'—but lack a clean vocabulary for it. The distinction between empathy level and empathy stability offers that vocabulary. And crucially, it keeps the conversation grounded in measurement rather than ideology.

The systems being built will interact with people at scale, often at moments that matter. Understanding not just how empathic they are, but how reliably empathic they are, seems worth figuring out together.

For those working in healthcare, education, or other human-facing AI contexts who recognize this pattern—People's Evidence Lab is developing practical approaches to this problem. We welcome the conversation.

Note on authorship and method: This essay was developed through a deliberate hybrid workflow in which the authors used a large language model as an interactive drafting and reasoning aid. The authors directed all prompts, curated and edited the text, verified claims, and take full responsibility for the final content.

This essay discusses evaluation and design considerations; it is not medical advice and not an endorsement of deploying LLMs for patient support without appropriate clinical, safety, and governance guardrails.

1 Here, empathy is operationalized as expressed empathy in text—observable communicative behaviors such as emotional reactions, interpretations, and explorations (Sharma et al., 2020)—rather than internal affective states or physiological responses.

References

Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., & Baldwin, B. (2024). LLM stability: A detailed analysis with some surprises. arXiv. https://arxiv.org/abs/2408.04667

Chen, Y., Wang, H., Yan, S., Liu, S., Li, Y., Zhao, Y., & Xiao, Y. (2024). EmotionQueen: A benchmark for evaluating empathy of large language models. Findings of ACL 2024. https://aclanthology.org/2024.findings-acl.128.pdf

Downey, G., & Feldman, S. I. (1996). Implications of rejection sensitivity for intimate relationships. Journal of Personality and Social Psychology, 70(6), 1327–1343.

Fang, C. M., et al. (2025). How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study. arXiv:2503.17473. (preprint).

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.

Rheu, M., Dai, Y., Meng, J., & Peng, W. (2024). When a chatbot disappoints you: Expectancy violation in human-chatbot interaction in a social support context. Communication Research, 51(7), 782–814. https://doi.org/10.1177/00936502231221669

Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2024). Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. arXiv. https://arxiv.org/abs/2310.11324

Sharma, A., Miner, A. S., Atkins, D. C., & Althoff, T. (2020). A computational approach to understanding empathy expressed in text-based mental health support. Proceedings of EMNLP 2020. https://aclanthology.org/2020.emnlp-main.465.pdf

Sorin, V., Brin, D., Barash, Y., Konen, E., Charney, A., Nadkarni, G., & Klang, E. (2024). Large language models and empathy: Systematic review. Journal of Medical Internet Research, 26, e52597. https://www.jmir.org/2024/1/e52597/

Suh, J., Le, L., Shayegani, E., Ramos, G., Amores, J., Ong, D. C., Czerwinski, M., & Hernandez, J. (2025). SENSE-7: Taxonomy and dataset for measuring user perceptions of empathy in sustained human-AI conversations. arXiv. https://arxiv.org/abs/2509.16437 (preprint)

Stephen Watt

When AI Empathy Varies: A Question Worth Asking

The hidden assumption in empathy measurement

A distinction that has been missing

Why stability may matter more than level

The pieces already exist

An invitation

References

Contact

Location

When AI Empathy Varies: A Question Worth Asking

The hidden assumption in empathy measurement

A distinction that has been missing

Why stability may matter more than level

The pieces already exist

An invitation

References

Why Evidence Fails to Travel

Contact

Location