S-Space College of Humanities (인문대학) Program in Cognitive Science (협동과정-인지과학전공) Theses (Master's Degree_협동과정-인지과학전공)
Vowel Duration and Fundamental Frequency Prediction for Automatic Transplantation of Native English Prosody onto Korean-accented Speech
자동 운율 복제를 위한 모음 길이와 기본 주파수 예측
- Sabaleuski Matsvei
- 인문대학 협동과정 인지과학전공
- Issue Date
- 서울대학교 대학원
- 학위논문 (석사)-- 서울대학교 대학원 : 인문대학 협동과정 인지과학전공, 2018. 8. 정민화.
- The use of computers to help people improve their pronunciation skills of a foreign language has rapidly increased in the last decades. Majority of such Computer-Assisted Pronunciation Training (CAPT) systems have been focused on teaching correct pronunciation of segments only, however, while prosody received much less attention. One of the new approaches to prosody training is self-imitation learning. Prosodic features from a native utterance are transplanted onto learner’s own speech, and given back as corrective feedback. The main drawback is that this technique requires two identical sets of native and non-native utterances, which makes its actual implementation cumbersome and inflexible.
As a preliminary research towards developing a new method of prosody transplantation, the first part of the study surveys previous related works and points out their advantages and drawbacks. We also compare prosodic systems of Korean and English, point out major areas of mistakes that Korean learners of English tend to do, and then we analyze acoustic features that this mistakes are correlated with. We suggest that transplantation of vowel duration and fundamental frequency will be the most effective for self-imitation learning by Korean speakers of English.
The second part of this study introduces a new proposed model for prosody transplantation. Instead of transplanting acoustic values from a pre-recorded utterance, we suggest to use a deep neural network (DNN) based system to predict them instead. Three different models are built and described: baseline recurrent neural network (RNN), long short-term memory (LSTM) model and gated recurrent unit (GRU) model. The models were trained on Boston University Radio Speech Corpus, using a minimal set of relevant input features. The models were compared with each other, as well as with state-of-the-art prosody prediction systems from speech synthesis research.
Implementation of the proposed prediction model in automatic prosody transplantation is described and the results are analyzed. A perceptual evaluation by native speakers was carried out. Accentedness and comprehensibility ratings of modified and original non-native utterances were compared with each other. The results showed that duration transplantation can lead to the improvements in comprehensibility score. This study lays the groundwork for a fully automatic self-imitation prosody training system and its results can be used to help Korean learners master problematic areas of English prosody, such as sentence stress.