잡음에 강건한 발화 표상 모델을 활용한 음성 향상

김지원

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

잡음에 강건한 발화 표상 모델을 활용한 음성 향상 : Speech Enhancement Using Noise-Robust Speech Representation Model

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 김지원

Advisor: 이교구

Issue Date: 2022

Publisher: 서울대학교 대학원

Keywords: 음성향상 ; 자기지도발화표상 ; 음성합성기 ; Teacher모델

Description: 학위논문(석사) -- 서울대학교대학원 : 융합과학기술대학원 지능정보융합학과, 2022. 8. 이교구.

Abstract: 기존의 딥 러닝 알고리즘을 활용한 시간 영역에서의 음성 향상은 발화 데이터를 특징으로 인코딩 한 후, 이를 잡음이 제거된 발화 데이터로 디코딩 하는 오토 인코더 모델을 보편적으로 채택했다. 최근 다양한 분야에서 자기 지도 학습을 통한 사전 훈련 모델이 사용되고 있다. 음성과 관련한 사전 훈련 모델들은 자동 음성 인식이나 감정 분류 등 다양한 작업을 해결하기 위해 사용되고 있다. 사전 훈련 모델을 사용하여 작업을 해결한 연구들은 기존 연구들보다 높은 성능을 발휘하고 있으나, 음성 향상 작업에서는 사전 훈련 모델이 활용되고 있는 사례가 적다. 본 연구에서는 시간 영역에서의 음성 향상 모델로써 보편적으로 채택되는 오토 인코더와 같이 동일한 구조의 인코더/디코더를 채택하는 대신, 사전 훈련된 자기 지도 발화 표상 모델과 음성 합성기를 활용한 새로운 구조의 음성 향상 모델을 제안한다. 제안된 모델은 사전 훈련된 자기 지도 발화 표상 모델을 미세조정하여 잡음 특징이 제거된, 향상된 발화 표상을 추출하고, 이를 음성 합성기를 통해 향상된 발화 데이터로 합성한다. 본 연구를 통해 사전 훈련된 자기 지도 발화 표상 모델에서 출력되는 깨끗한 발화 표상이 음성 합성기를 활용하여 음성을 재합성할 수 있음을 보인다. 또한, 잡음이 섞인 발화 표상에서 잡음 특징을 제거하여 깨끗한 발화를 합성할 수 있도록, 향상된 발화 표상을 출력할 수 있는 target 모델을 훈련하기 위해 양질의 발화 표상을 출력하는 사전 훈련 모델을 teacher 모델로 사용한다. target 모델에서 출력되는 향상된 발화 표상이 teacher 모델에서 출력되는 깨끗한 발화 표상이 될 수 있도록 훈련하는 방법들에 대해 논한다. 제안된 방법을 통해 학습된 음성 향상 모델의 성능을 보이고, 모델의 한계점과 이를 극복하기 위해 필요한 향후 연구에 관하여 논한다.
Speech enhancement models in the time domain have adopted an autoencoder-based model architecture, which encodes noisy speech into features and decodes it into clean speech. Recently, pre-trained self-supervised speech representation models have been used in various fields. Pre-trained models in the audio domain are used for various tasks such as automatic speech recognition and emotion classification. Studies that solve tasks using a pre-trained model perform better than previous studies, but there are fewer cases where a pre-trained model was used for speech enhancement. In this study, instead of using an autoencoder, which is commonly adopted as a speech enhancement model in the time domain, we propose a novel speech synthesizer-based speech enhancement model using self-supervised speech representations. The proposed model fine-tunes the pre-trained self-supervised speech representation model to extract the enhanced speech representation, with the noise feature removed, and synthesize it to enhanced speech data via a speech synthesizer. Through this work, we show that a clean speech representation from a pre-trained self-supervised speech representation model can be re-synthesized via a speech synthesizer. In addition, a pre-trained model which extracts high-quality speech representation is used as a teacher model to train a target model, which extracts noise-robust representation. We discuss the ways to train the target model so that the enhanced speech representations extracted by the target model become the clean speech representation extracted by the teacher model. We demonstrate the performance of the proposed speech enhancement model and discuss the limitation of the proposed model and further works.

Language: kor

URI: https://hdl.handle.net/10371/188303

https://dcollection.snu.ac.kr/common/orgView/000000173118

Files in This Item:

000000173118.pdf 9.79 MB

Appears in Collections:

Graduate School of Convergence Science and Technology (융합과학기술대학원)
- Dept. of Intelligence and Information (지능정보융합학과)
  - Theses (Master's Degree_지능정보융합학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share