Multi-Source Emotion Annotation in Children's Language: When LLM Consensus Diverges from Human Judgment

Farida Said1 and Jeanne Villaneau2
1LMBA, 2IRISA Université de Bretagne Sud


Abstract

Automated emotion annotation increasingly relies on inter-LLM agreement as a proxy for label quality. We test this assumption on 2,106 clause-level segments from interviews with French-speaking children (ages 6-11) about parental roles, a setting where affect is often implicit rather than lexically explicit. Using a 500-segment expert gold standard, we show that internal consensus can be seriously misleading: Dawid-Skene, a probabilistic label aggregation method, estimates GPT-5.2 valence accuracy at 90.7%, whereas evaluation against human gold yields 71.0%, revealing substantial overestimation driven by shared neutralization bias. Conversely, Dawid-Skene underestimates Claude Sonnet 4, reversing model ranking. Majority Vote, Dawid-Skene, and MACE produce near-identical consensus labels, suggesting that the main source of error lies in shared annotator bias rather than in the aggregation rule itself. We release the expert gold subset and the probabilistic corpus to support future work. Our results show that high inter-LLM agreement cannot replace external human validation for affect annotation.