Publications

Detailed Information

MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT

Cited 148 time in Web of Science Cited 215 time in Scopus
Authors

Yoon, Seunghyun; Byun, Seokhyun; Jung, Kyomin

Issue Date
2018-12
Publisher
IEEE
Citation
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), pp.112-118
Abstract
Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.
ISSN
2639-5479
URI
https://hdl.handle.net/10371/186820
DOI
https://doi.org/10.1109/SLT.2018.8639583
Files in This Item:
There are no files associated with this item.
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share