리포머 네트워크를 이용한 음성합성 시스템

임형래

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

리포머 네트워크를 이용한 음성합성 시스템 : Speech synthesis using reformer network

DC Field	Value	Language
dc.contributor.advisor	김남수	-
dc.contributor.author	임형래	-
dc.date.accessioned	2020-10-13T02:52:54Z	-
dc.date.available	2020-10-13T02:52:54Z	-
dc.date.issued	2020	-
dc.identifier.other	000000161576	-
dc.identifier.uri	https://hdl.handle.net/10371/169291	-
dc.identifier.uri	http://dcollection.snu.ac.kr/common/orgView/000000161576	ko_KR
dc.description	학위논문 (석사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2020. 8. 김남수.	-
dc.description.abstract	최근 음성합성 시스템은 신경망 기반의 종단형 음성합성 모델이 좋은 성능을 보이고 있다. 특히, 어텐션 메커니즘 기반의 시퀀스-투-시퀀스 모델은 텍스트와 스펙트로그램의 정렬과 함께 성공적으로 음향 모델링을 해내고 있다. 또한, 트랜스포머 모델 기반의 음성합성 모델은 사람의 목소리에 가까운 음성신호를 만들수 있다고 보고되었다. 하지만, 이러한 시퀀스-투-시퀀스 모델들은 많은 메모리 소모와 계산량을 요구되는데, 어텐션 에너지 값이 매 쿼리 시퀀스에 대해 키 시퀀스 전체에 대해 계산을 수행하기 때문이다. 이 문제를 해소하기 위해, 본 논문에서는 리포머 네트워크 기반 음성합성을 제안한다. 리포머 네트워크는 위치-민감성 해싱과 가역 잔여 네트워크를 사용하여 트랜스포머에 비해 메모리를 효율적으로 사용하여 모델을 학습할 수 있다. 본 논문에서는 실험을 통해 리포머 네트워크가 트랜스포머 네트워크에 비해 거의 절반의 메모리를 사용하여 음성합성 모델을 훈련할 수 있는 것을 확인하였다. 실험을 평가하기 위해 메모리 사용과 객관적, 주관적 성능평가를 사용하였다.	-
dc.description.abstract	Recent End-to-end text-to-speech (TTS) systems based on the deep neural network (DNN) have shown the state-of-the-art performance on the speech synthesis field. Especially, the attention-based sequence-to-sequence models have improved the quality of the alignment between the text and spectrogram successfully. Leveraging such improvement, speech synthesis using a Transformer network was reported to generate human-like speech audio. However, such sequence-to-sequence models require intensive computing power and memory during training. The attention scores are calculated over the entire key at every query sequence, which increases memory usage. To mitigate this issue, we propose the speech synthesis model based on Reformer network, the model using a Reformer network which utilizes the locality-sensitive hashing attention and the reversible residual network. As a result, we show that the Reformer network consumes almost twice smaller memory margin as the Transformer, which leads to the fast convergence of training end-to-end TTS system.We demonstrate such advantages with memory usage, objective, and subjective performance evaluation.	-
dc.description.tableofcontents	제 1 장 INTRODUCTION 5 제 2 장 Background 8 2.1 Transformer TTS 8 2.1.1 Feature extraction 8 2.1.2 Attention based encoder and decoder 9 2.1.3 Postnet 11 2.1.4 Loss function 12 2.2 Reformer network 13 2.2.1 Locality-sensitive hashing attention 13 2.2.2 Reversible residual network 15 2.3 Forward attention 17 제 3 장 Proposed method 20 3.1 Memory efficient Reformer TTS 20 3.1.1 feature extraction 22 3.1.2 Encoder 22 3.1.3 Decoder 23 3.1.4 PostNet 26 제 4 장 Experiments 27 4.1 Experimental setup 27 4.2 Evaluation 27 제 5 장 Conclusion and discussion 31 ABSTRACT 35 감사의 글 36	-
dc.language.iso	kor	-
dc.publisher	서울대학교 대학원	-
dc.subject	speech synthesis	-
dc.subject	attention-based text-to-speech	-
dc.subject	reformer network	-
dc.subject	음성합성	-
dc.subject	어텐션 기반 종단형 음성합성	-
dc.subject	리포머 네트워크	-
dc.subject.ddc	621.3	-
dc.title	리포머 네트워크를 이용한 음성합성 시스템	-
dc.title.alternative	Speech synthesis using reformer network	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.department	공과대학 전기·정보공학부	-
dc.description.degree	Master	-
dc.date.awarded	2020-08	-
dc.identifier.uci	I804:11032-000000161576	-
dc.identifier.holdings	000000000043▲000000000048▲000000161576▲	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Master's Degree_전기·정보공학부)

Files in This Item:

000000161576.pdf 4.39 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share