음성 감정 인식을 위한 트랜스포머 기반 피치곡선 모델

Abstract: 사람과 인공지능 간 상호작용이 증가함에 따라 음성 감정 인식(Speech Emotion Recognition) 기술이 주목받고 있다. 본 연구는 Transformer 기반 음성 인식 모델의 한계인 피치(pitch) 정보 약화를 보완하고자 한다. 따라서 상위 레이어에 피치 정보를 명시적으로 통합하는 피치곡선 모델(Pitch Contour Model, PCM)을 제안한다.
차원형 음성 감정 인식은 Valence(긍부정도), Arousal(흥분도), Dominance(지배도)로 구성된 차원형 감정을 예측하는 태스크이다. 본 연구에서는 차원형 음정 감정 인식 태스크를 수행함으로써 감정 상태의 미세한 변화를 포착하고 분석하고자 하였다. Wav2Vec 2.0과 HuBERT와 같은 Transformer 기반 음성 인식 모델은 문맥 정보를 처리하는 데 강점을 보이지만, 상위 레이어에서 음향적 특징(특히 피치)이 약화되는 한계를 가진다. 본 연구는 피치 정보의 통합을 통해 Valence와 Arousal 성능 향상을 기대하였으며, 한국어 환경에서도 제안한 모델의 성능을 검증하였다.
본 연구에서는 IEMOCAP, MSP-Podcast v1.11, Aihub 멀티모달 영상 데이터셋을 활용하여 실험을 수행하였다. 실험 결과, 피치곡선 모델은 기존 Transformer 기반 모델 대비 Valence-Arousal-Dominance 차원에서 전반적인 성능 개선을 보였으며, 일부 데이터셋에서는 SOTA 성능을 달성하였다. 특히, Valence와 Arousal 차원에서 피치 정보의 추가 효과가 두드러졌고, 한국어 데이터셋에서도 그 유효성이 입증되었다.
결론적으로, 본 연구는 피치 정보가 감정 인식에서 중요한 음향적 단서로 작용함을 입증하였다. 또한 Transformer 기반 모델의 한계를 보완하는 접근법을 제안하였다. 이 연구는 영어, 한국어 음성에 적용 가능한 감정 인식 모델의 설계 방향을 제시한다.
With the increasing interaction between humans and artificial intelligence, speech emotion recognition (SER) technology has attracted significant attention. This study aims to address the limitation of pitch information degradation in Transformer-based speech recognition models. To overcome this, we propose the Pitch Contour Model (PCM), which explicitly integrates pitch information into the upper layers of the Transformer model.
Dimensional speech emotion recognition predicts emotions along the Valence (positivity-negativity), Arousal (activation level), and Dominance (degree of control) dimensions. By performing this task, the proposed method captures and analyzes subtle emotional variations. While Transformer-based models like Wav2Vec 2.0 and HuBERT excel at processing contextual information, they face challenges as acoustic features, particularly pitch, tend to weaken in the upper layers. This study focuses on integrating pitch information to improve Valence and Arousal performance and further validates the proposed model in Korean language environments.
Experiments were conducted using the IEMOCAP, MSP-Podcast v1.11, and Aihub multimodal video datasets. Results show that the Pitch Contour Model outperformed conventional Transformer-based models across Valence-Arousal-Dominance dimensions, achieving state-of-the-art (SOTA) performance on certain datasets. Notably, the added pitch information yielded substantial improvements in Valence and Arousal dimensions, while its effectiveness was also verified using a Korean dataset.
In conclusion, this study demonstrates that pitch information serves as a critical acoustic cue in speech emotion recognition. Furthermore, it presents a method to address the limitations of Transformer-based models. This research provides a foundation for developing emotion recognition models applicable to both English and Korean speech.

Language: kor

URI: https://hdl.handle.net/10371/222130

https://dcollection.snu.ac.kr/common/orgView/000000189115

Files in This Item:

000000189115.pdf 0.88 MB

Appears in Collections:

College of Humanities (인문대학)
- Program in Cognitive Science (협동과정-인지과학전공)
  - Theses (Master's Degree_협동과정-인지과학전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share