Multimodal Emotion Diarization: Frame-Wise Integration of Text and Audio Representations

Tue 18.03 09:00 - 09:30

Abstract: Speech Emotion Diarization (SED) segments an audio stream into continuous emotional states, providing a more fine-grained analysis than traditional Speech Emotion Recognition (SER), which assigns a single emotion per utterance. This seminar presents a novel multimodal SED framework that integrates frame-level text and audio embeddings through temporal synchronization and direct concatenation, enhancing emotion tracking over time. The approach leverages WavLM for audio representations and EmoBERTa for text embeddings, aligned at the word level. To ensure smoother predictions, a context-aware sliding window mechanism refines frame-wise emotion classification. The effectiveness of this method is assessed using the Emotion Diarization Error Rate (EDER), demonstrating a significant performance improvement over score fusion and cross-attention baselines, achieving an EDER of 25%. This work advances emotion diarization by leveraging multimodal fusion and structured smoothing, paving the way for more accurate emotion tracking in real-world interactions.

Speaker

Ziv Tamir

Technion

  • Advisors Oren Kurland

  • Academic Degree M.Sc.