Browse

Model-Based and Data-Driven Techniques for Environment-Robust Automatic Speech Recognition
주변 환경에 강인한 음성인식을 위한 모델 및 데이터기반 기법

DC Field Value Language
dc.contributor.advisor김남수-
dc.contributor.author강신재-
dc.date.accessioned2017-07-13T07:11:23Z-
dc.date.available2017-07-13T07:11:23Z-
dc.date.issued2015-08-
dc.identifier.other000000067007-
dc.identifier.urihttp://hdl.handle.net/10371/119121-
dc.description학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 김남수.-
dc.description.abstractIn this thesis, we propose model-based and data-driven techniques for environment-robust automatic speech recognition. The model-based technique is the feature enhancement method in the reverberant noisy environment to improve the performance of Gaussian mixture model-hidden Markov model (HMM) system. It is based on the interacting multiple model (IMM), which was originally developed in single-channel scenario. We extend the single-channel IMM algorithm such that it can handle the multi-channel inputs under the Bayesian framework. The multi-channel IMM algorithm is capable of tracking time-varying room impulse responses and background noises by updating the relevant parameters in an on-line manner. In order to reduce the computation as the number of microphones increases, a computationally efficient algorithm is also devised. In various simulated and real environmental conditions, the performance gain of the proposed method has been confirmed.
The data-driven techniques are based on deep neural network (DNN)-HMM hybrid system. In order to enhance the performance of DNN-HMM system in the adverse environments, we propose three techniques. Firstly, we propose a novel supervised pre-training technique for DNN-HMM system to achieve robust speech recognition in adverse environments. In the proposed approach, our aim is to initialize the DNN parameters such that they yield abstract features robust to acoustic environment variations. In order to achieve this, we first derive the abstract features from an early fine-tuned DNN model which is trained based on a clean speech database. By using the derived abstract features as the target values, the standard error back-propagation algorithm with the stochastic gradient descent method is performed to estimate the initial parameters of the DNN. The performance of the proposed algorithm was evaluated on Aurora-4 DB and better results were observed compared to a number of conventional pre-training methods.
Secondly, a new DNN-based robust speech recognition approaches taking advantage of noise estimates are proposed. A novel part of the proposed approaches is that the time-varying noise estimates are applied to the DNN as additional inputs. For this, we extract the noise estimates in a frame-by-frame manner from the IMM algorithm which has been known to show good performance in tracking slowly-varying background noise. The performance of the proposed approaches is evaluated on Aurora-4 DB and better performance is observed compared to the conventional DNN-based robust speech recognition algorithms.
Finally, a new approach to DNN-based robust speech recognition using soft target labels is proposed. The soft target labeling means that each target value of the DNN output is not restricted to 0 or 1 but takes non negative values in (0,1) and their sum equals 1. In this study, the soft target labels are obtained from the forward-backward algorithm well-known in HMM training. The proposed method makes the DNN training be more robust in noisy and unseen conditions. The performance of the proposed approach was evaluated on Aurora-4 DB and various mismatched noise test conditions, and found better compared to the conventional hard target labeling method.
Furthermore, in the data-driven approaches, an integrated technique using above three algorithms and model-based technique is described. In matched and mismatched noise conditions, the performance results are discussed. In matched noise conditions, the initialization method for the DNN was effective to enhance the recognition performance. In mismatched noise conditions, the combination of using the noise estimates as an DNN input and soft target labels showed the best recognition results in all the tested combinations of the proposed techniques.
-
dc.description.tableofcontentsAbstract i
Contents iv
List of Figures viii
List of Tables x
1 Introduction 1
2 Experimental Environments and Database 7
2.1 ASR in Hands-Free Scenario and Feature Extraction 7
2.2 Relationship between Clean and Distorted Speech in Feature Domain 10
2.3 Database 12
2.3.1 TI Digits Corpus 13
2.3.2 Aurora-4 DB 15
3 Previous Robust ASR Approaches 17
3.1 IMM-Based Feature Compensation in Noise Environment 18
3.2 Single-Channel Reverberation and Noise-Robust Feature Enhancement Based on IMM 24
3.3 Multi-Channel Feature Enhancement for Robust Speech Recognition 26
3.4 DNN-Based Robust Speech Recognition 27
4 Multi-Channel IMM-Based Feature Enhancement for Robust Speech Recognition 31
4.1 Introduction 31
4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 33
4.3 Multi-Channel Feature Enhancement in a Bayesian Framework 35
4.3.1 A Priori Clean Speech Model 37
4.3.2 A Priori Model for RIR 38
4.3.3 A Priori Model for Background Noise 39
4.3.4 State Transition Formulation 40
4.3.5 Function Linearization 41
4.4 Feature Enhancement Algorithm 42
4.5 Incremental State Estimation 48
4.6 Experiments 52
4.6.1 Simulation Data 52
4.6.2 Live Recording Data 54
4.6.3 Computational Complexity 55
4.7 Summary 56
5 Supervised Denoising Pre-Training for Robust ASR with DNN-HMM 59
5.1 Introduction 59
5.2 Deep Neural Networks 61
5.3 Supervised Denoising Pre-Training 63
5.4 Experiments 65
5.4.1 Feature Extraction and GMM-HMM System 66
5.4.2 DNN Structures 66
5.4.3 Performance Evaluation 68
5.5 Summary 69
6 DNN-Based Frameworks for Robust Speech Recognition Using Noise Estimates 71
6.1 Introduction 71
6.2 DNN-Based Frameworks for Robust ASR 73
6.2.1 Robust Feature Enhancement 74
6.2.2 Robust Model Training 75
6.3 IMM-Based Noise Estimation 77
6.4 Experiments 78
6.4.1 DNN Structures 78
6.4.2 Performance Evaluations 79
6.5 Summary 82
7 DNN-Based Robust Speech Recognition Using Soft Target Labels 83
7.1 Introduction 83
7.2 DNN-HMM Hybrid System 85
7.3 Soft Target Label Estimation 87
7.4 Experiments 89
7.4.1 DNN Structures 89
7.4.2 Performance Evaluation 90
7.4.3 Effects of Control Parameter ξ 91
7.4.4 An Integration with SDPT and ESTN Methods 92
7.4.5 Performance Evaluation on Various Noise Types 93
7.4.6 DNN Training and Decoding Time 95
7.5 Summary 96
8 Conclusions 99
Bibliography 101
요약 108
-
dc.formatapplication/pdf-
dc.format.extent3306514 bytes-
dc.format.mediumapplication/pdf-
dc.language.isoen-
dc.publisher서울대학교 대학원-
dc.subjectRobust speech recognition-
dc.subjectmulti-channel-
dc.subjectinteracting multiple model (IMM)-
dc.subjectdereverberation-
dc.subjectpre-training-
dc.subjectdenoising-
dc.subjectbackground noise estimation-
dc.subjectdeep neural network (DNN)-
dc.subjectDNN-based regression-
dc.subjectback-propagation-
dc.subjectsoft target labels-
dc.subject.ddc621-
dc.titleModel-Based and Data-Driven Techniques for Environment-Robust Automatic Speech Recognition-
dc.title.alternative주변 환경에 강인한 음성인식을 위한 모델 및 데이터기반 기법-
dc.typeThesis-
dc.description.degreeDoctor-
dc.citation.pagesxii, 110-
dc.contributor.affiliation공과대학 전기·컴퓨터공학부-
dc.date.awarded2015-08-
Appears in Collections:
College of Engineering/Engineering Practice School (공과대학/대학원)Dept. of Electrical and Computer Engineering (전기·정보공학부)Theses (Ph.D. / Sc.D._전기·정보공학부)
Files in This Item:
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse