Deep Generative Model for Waveform Synthesis

Abstract: Deep generative model (DGM) is a deep learning-based approach for generative modeling, parameterized by deep neural networks (DNNs) and trained with stochastic gradient-based optimization, to approximate a probability distribution of real-world data. DGMs have been applied to conditional waveform synthesis, where the model is trained to generate audio waveform in time-domain that corresponds to a given conditional information. Neural vocoder is a conditional waveform synthesizer that generates waveform conditioned on structured information, such as spectrogram.

In this dissertation, we provide a comprehensive overview, analysis, and challenges on the landscape of DGM for conditional waveform synthesis, including autoregressive model (ARM), normalizing flow (NF), diffusion probabilistic model (DPM), and generative adversarial network (GAN). We raise three main issues in DGM for audio waveform: parameter efficiency, accelerated training and synthesis, and generalization.

To solve the problem of parameter efficiency, we propose a new method, called NanoFlow, that achieves a significant reduction on required network size for NFbased DGM. NanoFlow provides an alternative parameterization to build NF based on parameter decomposition, weight sharing, and a novel method called flow indication embedding. The proposed method, applied to flow-based generative model for audio waveform, demonstrates competitive quality under a significantly smaller size of DNNs.

For accelerated training and synthesis, we propose a novel approach, PriorGrad, that enables fast convergence of DPM during training and improved sampling speed from DPM at inference. PriorGrad constructs a data-dependent adaptive prior to train and sample from conditional waveform synthesis models based on DPM. Compared to a commonly used standard Gaussian prior, we show that PriorGrad achieves faster training and sampling using the informative prior distribution applied to diffusion-based generative model for audio.

To achieve generalization in DGM for waveform, we present BigVGAN that enables a challenging concept of universal neural waveform synthesis, with an unprecedented level of generalization and robustness to unobserved conditions during training. We propose anti-aliased multi-periodicity composition (AMP) module which uses periodic activation that provides proper inductive bias for modeling complex waveform, combined with anti-aliasing filter that suppresses high-frequency aliasing artifacts. We propose methods that enable large-scale training of GAN-based waveform synthesis by fixing failure modes during training without the need for regularization. BigVGAN achieves state-of-the-art audio quality in diverse conditions, including unseen speakers, novel languages, singing voice, music, and instrumental audio in varied unseen recording environments.
심층 생성 모델(DGM)은 딥러닝 기반 생성모델 방법론으로, 실제 데이터의 확 률 분포를 근사화하기 위해 심층 신경망(DNN)을 사용하여 매개변수화하고 그래디 언트 기반 최적화 기법으로 학습된다. DGM은 조건부 오디오 파형 합성 모델들에 성공적으로 적용되었으며, 이 모델들은 주어진 정보에 해당하는 오디오 파형을 시 간 영역에서 생성하도록 학습된다. 신경망 보코더는 스펙트로그램과 같은 구조화된 정보에 따라 파형을 생성하는 조건부 오디오 파형 합성기이다.

이 논문에서는 자동 회귀 모델(ARM), 정규화 플로우(NF), 디퓨전 확률 모델 (DPM) 및 생성적 적대 신경망(GAN)을 포함하여, 조건부 오디오 파형 합성을 위한 DGM 에 대한 개요, 심층적 분석 및 주요 과제를 제시한다. 오디오 파형에 대한 DGM 의 세 가지 주요 난점은 매개변수 효율성, 학습 및 합성의 가속, 일반화로 나타낼 수 있다.

파라미터 효율성 문제를 해결하기 위해, 본 논문은 정규화 플로우 기반 DGM 에 필요한 심층 신경망 크기를 크게 줄이는 NanoFlow라는 새로운 방법을 제안한다. NanoFlow는 매개변수 분해, 가중치 공유 및 플로우 표기 임베딩이라는 새로운 방 법을 기반으로 NF를 구축하기 위한 대안적인 매개변수화 방법을 제시한다. 정규화 플로우 기반 오디오 파형 생성 모델에 제안한 방법을 적용했을 때, 상대적으로 매우 작은 크기의 심층 신경망에서 고품질의 오디오 합성을 달성할 수 있다.

가속화된 학습 및 합성을 위해, 본 논문은 빠른 학습 수렴을 가능하게 하고 추론 시 샘플링 속도를 개선하는 새로운 접근 방식인 PriorGrad를 제안한다. PriorGrad 는 디퓨전 확률 모델을 기반으로 하는 조건부 파형 합성 모델에서 학습 및 샘플링에 사용하기 위해 데이터 종속적인 적응형 사전분포를 구성한다. 일반적으로 사용되는 표준 가우시안 사전분포과 비교했을 때, PriorGrad는 디퓨전 확산 모델 기반 오디오 파형 생성 모델에 정보력이 높은 사전분포를 기반으로 더 빠른 학습 및 샘플링을 달성함을 보여준다.

심층 생성 모델 기반 오디오 파형 모델의 일반화를 달성하기 위해, 본 논문에서 제시하는 BigVGAN은 학습 중 관찰되지 않은 조건에 대한 전례 없는 수준의 일반 화 및 견고성을 달성할 수 있으며, 범용적 신경망 기반 오디오 합성을 달성할 수 있다. BigVGAN은 복잡한 파형 모델링을 위한 적절한 유도 바이어스를 제공하기 위해 주기적 활성화 함수를 사용하며, 고주파 영역 앨리어싱 아티팩트를 억제하는 안티 앨리어싱 필터를 적용하고, 이를 기반으로 한 안티 앨리어싱 다중 주기성 구성 (AMP) 모듈을 제안한다. 또한, 생성적 적대 신경망 기반 오디오 파형 합성의 학습 불안정성을 개선하여 대규모 학습을 가능하게 하는 방법을 제안한다. BigVGAN은 학습에 사용하지 않은 다양한 녹음 환경에서, 새로운 화자, 새로운 언어, 발성, 음악 및 악기 오디오를 포함한 다양한 조건에서 최고의 오디오 품질을 달성한다.

Language: eng

URI: https://hdl.handle.net/10371/193240

https://dcollection.snu.ac.kr/common/orgView/000000175080

Files in This Item:

000000175080.pdf 21.08 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share