Deep Learning Approach for Robust Voice Activity Detection and Speech Enhancement

강태균

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Deep Learning Approach for Robust Voice Activity Detection and Speech Enhancement : 잡음에 강인한 음성 구간 검출과 음성 향상을 위한 딥 러닝 기반 기법 연구

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 강태균

Advisor: 김남수

Major: 공과대학 전기·컴퓨터공학부

Issue Date: 2017-02

Publisher: 서울대학교 대학원

Keywords: Voice activity detection ; speech enhancement ; deep neural network

Description: 학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 김남수.

Abstract: Over the past decades, a number of approaches have been proposed to improve the performances of voice activity detection (VAD) and speech enhancement algorithms which are crucial for speech communication and speech signal processing systems. In particular, the increasing use of machine learning-based techniques has led to the more robust algorithms in low SNR conditions. Among them, the deep neural network (DNN) has been one of the most popular techniques.
While the DNN-based technique is successfully applied to these tasks, the characteristics of VAD and speech enhancement tasks are not fully incorporated to the DNN structures and objective functions. In this thesis, we propose the novel training schemes and post-filter for DNN-based VAD and speech enhancement. Unlike algorithms with basic DNN-based framework, the proposed algorithm combines the knowledge from signal processing and machine learning society to develop the improve DNN-based VAD and speech enhancement algorithm. In the following chapters, the environmental mismatch problem in the VAD area is compensated by applying multi-task learning to the DNN-based VAD. Also, the DNN-based framework is proposed in the speech enhancement scenario and the novel objective function and post-filter which are derived from the characteristics on human auditory perception improve the DNN-based speech enhancement algorithm.
In the VAD task, the DNN-based algorithm was recently proposed and outperformed the traditional and other machine learning-based VAD algorithms. However, the performance of the DNN-based algorithm sometimes deteriorates when the training and test environments are not matched with each other. In order to increase the performance of the DNN-based VAD in unseen environments, we adopt the multi-task learning (MTL) framework which consists of the primary VAD and subsidiary feature enhancement tasks. By employing the MTL framework, the DNN learns the denoising function in the shared hidden layers that is useful to maintain the VAD performance in mismatched noise conditions.
Second, the DNN-based framework is applied to the speech enhancement by considering it as a regression task. The encoding vector of the conventional nonnegative matrix factorization (NMF)-based algorithm is estimated by the proposed DNN and the performance of the DNN-based algorithm is compared to the conventional NMF-based algorithm.
Third, the perceptually motivated objective function is proposed for the DNN-based speech enhancement. In the proposed technique, a new objective function which consists of the Mel-scale weighted mean square error, temporal and spectral variations similarities between the enhanced and clean speech is employed in the DNN training stage. The proposed objective function helps to compute the gradients based on a perceptually motivated non-linear frequency scale and alleviates the over-smoothness of the estimated speech.
Furthermore, the post-filter which adjusts the variance over frequency bins further compensates the lack of contrasts between spectral peaks and valleys in the enhanced speech. The conventional GV equalization post-filters do not consider the spectral dynamics over frequency bins. To consider the contrast between spectral peaks and valleys in each enhanced speech frames, the proposed algorithm matches the variance over coefficients in the log-power spectra domain.
Finally, in the speech enhancement task, an integrated technique using the proposed perceptually motivated objective function and the post-filter is described. In matched and mismatched noise conditions, the performance results of the conventional and proposed algorithm are discussed. Also, the subjective preference test result of these algorithms is also provided.

Language: English

URI: https://hdl.handle.net/10371/119272

Files in This Item:

000000141796.pdf 3.88 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share