가사 정보를 활용한 가창 음원 분리

Abstract: 반주와 가창이 섞인 혼합 음원에서 가창을 분리해내는 가창 음원 분리 문제는 높은 상업적 활용도를 가지고, 다양한 음악정보검색 분야 연구의 전처리 과정에 사용될 수 있기 때문에 오디오 신호 처리 분야에서 가장 활발하게 연구되고 있는 분야 중 하나이다.
기존 기계 학습 알고리즘을 사용한 연구 분야에서는 음원 이외의 부가적인 정보를 활용해 음원 분리 성능을 높이는 다양한 방법이 제안되었지만, 딥러닝을 활용한 음원 분리에서는 이러한 시도가 아직 많이 보고되지 않았다. 따라서 본 연구는 가창 음원 분리 작업에서 사용될 수 있는 가장 대표적인 부가 정보인 가사와 음정 중, 상대적으로 구하기 쉬운 가사 정보를 사용하여 음원 분리 성능을 향상시키는 방법을 제안한다.
본 연구에서는 가사가 가창의 타이밍에 맞게 정렬되어 있는 상황을 가정하고, 정렬된 가사 정보를 활용한 음원 분리 네트워크를 제안한다. 이를 위해 음악 음원 분리에서 높은 성능을 나타내는 open-unmix 네트워크와, 정렬된 가사를 보조적인 입력으로 받는 가사 인코더 네트워크를 결합한 새로운 네트워크를 제안한다. 이때 음원 분리 성능 향상이 정렬된 가사의 타이밍 정보 뿐만 아니라, 음소 정보로부터도 기인한다는 것을 정량적으로 확인 및 분석한다.
또한, 음성 합성 네트워크를 pre-train 한 다음 본 연구에서 제안한 가사 인코더 네트워크에 전이 학습 (transfer learning) 하는 기법을 통해, 제안한 네트워크의 음원 분리 성능을 더욱 높일 수 있는 방법을 제안한다.
마지막으로, 공개된 음원-텍스트 정렬 툴을 활용해 가사를 자동으로 정렬한 다음, 본 연구에서 제안한 가사 인코더 기반의 음원 분리 네트워크를 사용할 수 있음을 보인다.
Singing voice separation, which refers to the task that isolating singing from mixture audio sources that are combined with singing and accompaniment, is one of the most actively studied fields in the field of audio signal processing, as it has high commercial utilization and can be used in the pre-processing of various music information retrieval tasks.
In the field of research using existing machine learning algorithms, various methods have been proposed to increase the performance of singing voice separation by utilizing additional information other than sound sources, but many such attempts have yet to be reported in the singing voice separation using deep learning. Therefore, this work proposes a method to improve the singing voice separation performance by using relatively easy-to-get lyrics information, among the most representative additional information that can be used in the singing voice separation task.
Therefore, we propose a singing voice separation network utilizing aligned lyrics information. To this end, we combine oepn-unmix network, which represents high performance in music source separation, and a lyrics encoder network that receives aligned lyrics as auxiliary inputs. We quantitatively confirm and analyze that the improvement in the separation performance of the singing is attributable not only to timing information of the aligned lyrics but also to phonetic information.
Furthermore, we propose a method to further enhance the performance of the proposed network by pre-training the speech synthesis network and then apply transfer learning approach to the proposed lyrics encoder network in this work.
Finally, we show that we can automatically align the lyrics using the released sound-text alignment tool, and then use the lyrics encoder-based singing voice separation network proposed in this study.

Language: kor

URI: https://hdl.handle.net/10371/175882

https://dcollection.snu.ac.kr/common/orgView/000000164430

Files in This Item:

000000164430.pdf 21.59 MB

Appears in Collections:

Graduate School of Convergence Science and Technology (융합과학기술대학원)
- Dept. of Intelligence and Information (지능정보융합학과)
  - Theses (Master's Degree_지능정보융합학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share