Browse
S-Space
College of Engineering/Engineering Practice School (공과대학/대학원)
Dept. of Electrical and Computer Engineering (전기·정보공학부)
Theses (Ph.D. / Sc.D._전기·정보공학부)
Style-Adaptive Speech Synthesis Utilizing Supplementary Information : 보완 정보를 이용한 스타일 적응 음성 합성 기법
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | 김남수 | - |
dc.contributor.author | June Sig Sung | - |
dc.date.accessioned | 2017-07-13T06:57:21Z | - |
dc.date.available | 2017-07-13T06:57:21Z | - |
dc.date.issued | 2013-02 | - |
dc.identifier.other | 000000009677 | - |
dc.identifier.uri | https://hdl.handle.net/10371/118900 | - |
dc.description | 학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 2. 김남수. | - |
dc.description.abstract | Parameter adaptation is necessitated to reduce the mismatch between the training and test conditions caused by speaker variability, channel difference and environmental effects. Particularly, speech data expressing various emotional states and singing voice with different musical notes will show quite different temporal and spectral characteristics compared with those of the reading-style speech. However, it is practically difficult to collect a lot of data from each speaking (or singing) style. For that reason, a small amount of data obtained from the target condition is usually utilized to adapt the parameters of speech recognition or speech synthesis systems.
This thesis brings in the performance enhancement of the conventional MLLR adaptation utilizing supplementary information. Since the conventional MLLR could not use the pitch and duration information for the singing voice and the level of emotional intensity for the expressive speech, it degrades the adaptation performance when the conventional MLLR is employed to the adaptation of the singing voice or expressive speech synthesis. Furthermore, the conventional MLLR does not guarantee the performance in case of non-linear mapping because it assumes to a linear mapping between an initial model set and the adaptation data. In the first approach, we extend the conventional MLLR to the factored MLLR (FMLLR) framework where each MLLR parameter is defined as a function of the control parameter vector. And we further improve the performance of FMLLR by extending its structure from the one that is limited by diagonal assumption to generalized form. We also present a training method to estimate the general form of FMLLR parameters based on the expectation-maximization (EM) algorithm and performance evaluation reveals the effectiveness of the proposed technique. To enhance the adaptation performance of the conventional MLLR, we propose a nonlinear generalization of MLLR for HMM-based speech synthesis system. The algorithm performs the nonlinear regression between a mean vector of base model and a corresponding mean vector of adaptation data with the use of kernel methods. Kernels are taken into account both exponential and polynomial forms, and the technique for learning their weights is also presented. One of the advantages of the proposed algorithm is that except for nonlinear mapping of the base model, the rest of operations are linear. Therefore the proposed algorithm can be obtained by solving a system of linear equations in the similar way as the conventional MLLR, which makes an easy implementation, and requires not so much computational loads. | - |
dc.description.tableofcontents | Abstract i
Contents iii List of Figures v List of Tables x 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Scope of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 HMM-based speech synthesis system 7 2.1 Overview of HMM-based speech synthesis system . . . . . . . . . . . 7 2.2 Multi Space Probability Distribution . . . . . . . . . . . . . . . . . . 10 2.3 Decision tree based context clustering . . . . . . . . . . . . . . . . . 12 2.3.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Construction of decision tree . . . . . . . . . . . . . . . . . . 13 2.4 Parameter generation algorithm . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Maximizing P(O | - |
dc.description.tableofcontents | Q, λ) with respect to O . . . . . . . . . . . 18
3 Previous researches 21 3.1 Excitation enhancement . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Waveform Interpolation . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 Excitation Modeling based on PCA . . . . . . . . . . . . . . 27 3.2 Adaptation for speech synthesis . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 MLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Factored MLLR adaptation 33 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 MLLR adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 MRHSMM combined with MLLR . . . . . . . . . . . . . . . . . 40 4.4 factored MLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Experiments on speech synthesis . . . . . . . . . . . . . . . . . 48 4.5.1 Experiments on Singing Voice Synthesis . . . . . . . . 48 4.5.2 Experiments on Expressive Speech Synthesis . . . . . 61 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.7 appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Non-linear generalization of MLLR with kernel learning for HMMbased speech synthesis 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 MLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 KernelMLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.1 Objective performance evaluation . . . . . . . . . . . . . . . . 85 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6 Conclusion 93 A Statistical Approaches to ExcitationModeling in HMM-based Speech Synthesis 95 A.1 Waveform Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.2 Time domain zero-paddedWI . . . . . . . . . . . . . . . . . . . . . . 98 A.3 Low Dimensional Representation of CWs . . . . . . . . . . . . . . . 101 A.3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . 101 A.3.2 Non-negative Matrix Factorization . . . . . . . . . . . . . . . 102 A.4 Experiments for FDZ . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.5 Experiments for comparison between FDZ and TDZ . . . . . . . . . 107 A.5.1 Performance of low-dimensional representation . . . . . . . . 108 A.5.2 Evaluation of objective measure . . . . . . . . . . . . . . . . . 109 A.5.3 Subjective tests . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Bibliography 113 | - |
dc.format | application/pdf | - |
dc.format.extent | 3720399 bytes | - |
dc.format.medium | application/pdf | - |
dc.language.iso | en | - |
dc.publisher | 서울대학교 대학원 | - |
dc.subject | HMM-based speech synthesis | - |
dc.subject | parameter adaptation | - |
dc.subject | MLLR | - |
dc.subject | factored MLLR | - |
dc.subject | kernel-based MLLR | - |
dc.subject | singing voice synthesis | - |
dc.subject | expressive speech synthesis | - |
dc.subject.ddc | 621 | - |
dc.title | Style-Adaptive Speech Synthesis Utilizing Supplementary Information | - |
dc.title.alternative | 보완 정보를 이용한 스타일 적응 음성 합성 기법 | - |
dc.type | Thesis | - |
dc.contributor.AlternativeAuthor | 성준식 | - |
dc.description.degree | Doctor | - |
dc.citation.pages | 118 | - |
dc.contributor.affiliation | 공과대학 전기·컴퓨터공학부 | - |
dc.date.awarded | 2013-02 | - |
- Appears in Collections:
- College of Engineering/Engineering Practice School (공과대학/대학원)Dept. of Electrical and Computer Engineering (전기·정보공학부)Theses (Ph.D. / Sc.D._전기·정보공학부)
- Files in This Item:
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.