ArXiv TLDR

I'm Fine, But My Voice Isn't: Cross-Modal Affective Dissonance Detection for Reflective Journaling

🐦 Tweet
2604.27517

Sumin Lee

cs.HC

TLDR

This paper introduces Cross-Modal Affective Dissonance Detection (CADD) to identify emotional mismatches between text and voice in digital journaling.

Key contributions

  • Formalizes Cross-Modal Affective Dissonance Detection (CADD) for text-voice emotional mismatches.
  • Creates CADD-Journal, a 1,800-sample TTS dataset isolating acoustic signals from textual content.
  • Proposes DACM, a dual-encoder model with asymmetric cross-modal attention, achieving 0.711 macro-F1.
  • Identifies a significant domain gap between TTS-trained models and real speech, guiding future data collection.

Why it matters

This work addresses the authenticity gap in digital journaling by detecting when expressed text emotions don't match vocal tone. It provides novel methods and datasets to advance research in cross-modal emotion detection, crucial for mental health applications. The identified domain gap highlights key challenges for real-world deployment.

Original Abstract

Digital journaling creates an authenticity gap: users consciously translate raw emotions into text, often sanitizing narratives even in private writing. We formalize this as Cross-Modal Affective Dissonance Detection (CADD), a directional three-way classification distinguishing Masking (positive text, negative acoustics), Coping (negative text, positive acoustics), and Congruent utterances, grounded in Gross's process model of emotion regulation. We present three further contributions: (i) CADD-Journal, a 1,800-sample TTS dataset with a shared-sentence-pool design that provably isolates acoustic signal from textual content; (ii) DACM, a dual-encoder model with asymmetric cross-modal attention that re-solves a gradient degeneracy in pooled fusion, achieving macro-F1 0.711 - with a four-step ablation demonstrating that asymmetric attention is the dominant driver (+ 0.242) while the DIM is effective only on cross-modal features (+0.033); and (iii) a domain gap quantification: zero-shot evaluation across three naturalistic corpora reveals a substantial gap between TTS-trained models and real speech, and we identify two concrete requirements for future in-the-wild corpus construction. ReflectJournal, a proof-of-concept iOS application, operationalizes the framework and provides a deployment platform for naturalistic data collection.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.