Publications

Detailed Information

Environmental Sound Classification and Disentangled Factor Learning for Speech Enhancement : 음성 향상을 위한 환경음 분류 및 팩터 분리 학습

DC Field Value Language
dc.contributor.advisor김남수-
dc.contributor.author배수현-
dc.date.accessioned2020-05-19T08:04:23Z-
dc.date.available2020-05-19T08:04:23Z-
dc.date.issued2020-
dc.identifier.other000000160062-
dc.identifier.urihttps://hdl.handle.net/10371/168036-
dc.identifier.urihttp://dcollection.snu.ac.kr/common/orgView/000000160062ko_KR
dc.description학위논문(박사)--서울대학교 대학원 :공과대학 전기·컴퓨터공학부,2020. 2. 김남수.-
dc.description.abstractSounds carry a large amount of information about our everyday environment, especially human speech. However, environmental sound can also be an important factor in understanding the surrounding environment for user-customized services. The environmental sound acts as noise to be removed to the application for extracting speech information and is an object to be recognized to the application for extracting environmental information. From this perspective, we propose deep learning-based acoustic environment classification and speech enhancement techniques.

The goal of acoustic scene classification is to classify a test recording into one of the predefined acoustic scene classes. In the last few years, deep neural networks (DNNs) have achieved great success in various learning tasks and have also been used for the classification of environmental sounds. While DNNs are showing their potential in the classification task, they cannot fully utilize the temporal information. In this thesis, we propose a neural network architecture for the purpose of using sequential information. The long short-term memory (LSTM) layers extract the sequential information from consecutive audio features. The convolutional neural network (CNN) layers learn the spectro-temporal locality from spectrogram images, and the fully connected layers summarize the outputs of two networks to take advantage of the complementary features of the LSTM and CNN by combining them. By using the proposed combination structure, we achieved higher performance compared to the conventional DNN, CNN, and LSTM architectures.

Overlapping acoustic event classification is the task of estimating multiple acoustic events in a mixed source. In the case of non-overlapping event classification, many approaches have achieved great success using various feature extraction methods and deep learning models. However, in most real-life situations, acoustic events are overlapped, and different events may share similar properties. Simultaneously detecting mixed sources is a challenging problem. In this thesis, we propose a classification method for overlapping acoustic events that incorporates joint training with the source separation framework. Since overlapping acoustic events are mixed in multiple sources, we train the source separation model and multi-label classification model for estimating the type of overlapping acoustic events. The source separation model is trained to reconstruct the target sources by minimizing the interference of overlapping events. Joint training can be conducted to achieve end-to-end optimization between the acoustic event source separation and multi-label estimation.

Speech enhancement techniques aim to improve the quality and intelligibility of a given speech degraded by certain additive noise in the background. Most of the recently proposed deep learning-based speech enhancement techniques have focused on designing the neural network architectures as a black box. However, it is often beneficial to understand what kinds of hidden representations the model has learned. Since the real-world speech data are drawn from a generative process involving multiple entangled factors, disentangling the speech factor can encourage the trained model to result in better performance for speech enhancement. With the recent success in learning disentangled representation using neural networks, we explore a framework for disentangling speech and noise, which has not been exploited in conventional speech enhancement algorithms. In this thesis, we propose a novel noise-invariant speech enhancement method that manipulates the latent features to distinguish between the speech and noise features in the intermediate layers using an adversarial training scheme. Experimental results show that our model successfully disentangles the speech and noise latent features. Consequently, the proposed model not only achieves better enhancement performance but also offers more robust noise-invariant property than conventional speech enhancement techniques.
-
dc.description.abstract우리 주변에서 발생하는 소리들은 많은 정보를 담고 있으며, 특히 인간의 음성이 가장 대표적인 예이다. 하지만 음성 외에 발생하는 환경음 (environmental sound) 또한 사용자 맞춤형 서비스 측면에서 주위 환경을 파악하는 중요한 요소가 될 수 있다. 이러한 환경음은 음성 정보를 추출하기 위한 어플리케이션에는 잡음으로 작용되어 제거해야 할 대상이 되며, 반대로 주변 환경을 파악하기 위한 어플리케이션에서는 인식해야 할 대상이 된다. 이와 같은 관점으로 본 논문에서는 딥 러닝 기반의 음향 환경 분류와 음성 향상 기법에 대해 제안한다.

먼저 음향 환경 분류를 위해 CNN (convolutional neural network)과 LSTM (long short-term memory)을 결합하여 학습하는 분류 모델을 제안한다. 기존에 사용되었던 DNN (deep neural network) 기반 모델들은 음향 신호의 시간적인 정보를 활용하지 못한다는 단점이 있었다. 이를 극복하기 위해 LSTM 구조를 통해 시간적인 정보를 이용하였으며, 또한 음향 신호의 국부적인 주파수와 시간의 상관 정보를 이용하기 위해 CNN 구조를 함께 결합하였다. 이는 서로 다른 두 모델이 상호 보완적인 정보를 이용하여 학습이 되게 함으로써 기존의 기법에 비해 음향 환경 분류 성능이 향상됨을 확인하였다.

두 번째로 중첩된 음향 이벤트의 분류를 위해 음원 분리를 적용한 기법을 제안한다. 실생활에서는 서로 다른 음원들이 중첩되어 발생하는 경우가 많으며, 이는 분류의 난이도를 높이는 요소로 작용한다. 이를 해결하기 위해 중첩된 음향 이벤트를 음원 분리하는 모델을 통해 학습시키고, 별도로 각각의 분리된 이벤트를 분류하는 모델을 학습시킨 후, 마지막으로 두 모델을 결합하여 다시 훈련 (joint training)을 한다. 이를 통해 훈련된 모델은 중첩된 음향을 효과적으로 분리하여 각각의 이벤트를 분류하는 성능을 높이게 된다.

마지막으로, 팩터 분리 학습 (disentangled factor learning)을 적용한 음성 향상 기법을 제안한다. 위에서 제안한 기법들은 환경음을 인식하는 어플리케이션이지만, 음성 향상에서는 음성 이 외의 환경음은 제거를 목적으로 한다. 제안한 기법은 음성과 잡음을 각각 다른 팩터로 하여 잠재 공간 (latent space) 상에서 두 팩터를 분리하고, 잡음 팩터가 제거된 음성 팩터를 통해 깨끗한 음성 (clean speech)을 추정한다. 팩터 분리 학습으로 접근한 음성 향상 기법은 여러 성능 측정 기준에서 기존 딥 러닝 기반의 음성 향상 기법들보다 뛰어난 성능을 보였다. 또한 환경음 분류 정보를 사전에 이용한 환경음 인지 학습 (environmental sound aware training)이 음성 향상 성능에 미치는 영향을 확인하였다.
-
dc.description.tableofcontents1 Introduction 1
1.1 Environmental Sound Classification 1
1.2 Speech Enhancement 2
1.3 Disentangled Factor Learning 4
1.4 Outline of the thesis 4
2 Deep Learning Models for Acoustic Scene Classification 7
2.1 Introduction 7
2.2 Long Short-Term Memory 9
2.3 Parallel Combination of LSTM and CNN 10
2.3.1 Feature Extraction 10
2.3.2 LSTM Layers 13
2.3.3 CNN layers 13
2.3.4 Connected Layer of LSTM and CNN 14
2.4 Experiments 14
2.4.1 Dataset and Measurement 14
2.4.2 Neural Networks Setup 15
2.4.3 Results and Discussion 17
2.5 Summary 19
3 Overlapping Acoustic Event Classification Based on Joint Training with Source Separation 21
3.1 Introduction 21
3.2 Source Separation of Overlapping Acoustic Event 22
3.3 Proposed Method Using Joint Training 24
3.3.1 Source Separation Model 24
3.3.2 Multi-Label Classification Model 26
3.3.3 Joint Training Method 27
3.4 Experiments. 27
3.4.1 Dataset and Data Augmentation 27
3.4.2 Experimental Setup 28
3.4.3 Evaluation of Source Separation 30
3.4.4 Acoustic Event Classification Results 32
3.5 Summary 35
4 Disentangled Feature Learning for Noise-Invariant Speech Enhancement 37
4.1 Introduction 37
4.2 Masking-Based Speech Enhancement 39
4.3 Concept of Domain Adversarial Training 40
4.4 Disentangling Speech and Noise factors 42
4.4.1 Neural Network Architecture 42
4.4.2 Training Objectives 45
4.4.3 Adversarial Training for Disentangled Features 46
4.5 Experiments and Results 48
4.5.1 Dataset and Feature Extraction 48
4.5.2 Network Setup 49
4.5.3 Objective Measures 52
4.5.4 Performance Evaluation 52
4.5.5 Analysis of Noise-Invariant Speech Enhancement 58
4.5.6 Disentangled Feature Representations 59
4.6 Summary 61
5 Conclusions 63
Bibliography 65
요약 78
감사의 글 81
-
dc.language.isoeng-
dc.publisher서울대학교 대학원-
dc.subject.ddc621.3-
dc.titleEnvironmental Sound Classification and Disentangled Factor Learning for Speech Enhancement-
dc.title.alternative음성 향상을 위한 환경음 분류 및 팩터 분리 학습-
dc.typeThesis-
dc.typeDissertation-
dc.contributor.department공과대학 전기·컴퓨터공학부-
dc.description.degreeDoctor-
dc.date.awarded2020-02-
dc.identifier.uciI804:11032-000000160062-
dc.identifier.holdings000000000042▲000000000044▲000000160062▲-
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share