코드가 표기된 단성 악보의 종단 간 광학적 음악 인식

오설아

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

코드가 표기된 단성 악보의 종단 간 광학적 음악 인식 : End-to-End Optical Music Recognition of Monophonic Scores with Chord Annotations

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 오설아

Advisor: 박종헌

Issue Date: 2022

Publisher: 서울대학교 대학원

Keywords: 광학적음악인식 ; 코드인식 ; 데이터셋구축 ; 데이터증강

Description: 학위논문(석사) -- 서울대학교대학원 : 공과대학 산업공학과, 2022. 8. 박종헌.

Abstract: 광학적 음악 인식(Optical Music Recognition)은 이미지 형태의 악보로부터 음악적 표기들을 기계가 이해할 수 있는 형태로 복호화(decoding)하는 방법론을 연구하는 분야이다. 기존의 광학적 음악 인식 연구들은 딥러닝 기법을 활용하면서 인식 성능을 큰 폭으로 개선했으나 대부분 조표, 박자표, 음표 등의 악상 기호를 인식하는 데 그쳤다. 악보에 표기된 정보 중 악상 기호 외에도 코드(chord)는 음악을 구성하는 기본적인 요소로서 악상 기호만큼이나 중요한 역할을 하는데, 이를 인식 대상으로 함께 고려한 연구는 많지 않다.

본 논문에서는 end-to-end 방식의 딥러닝 기법을 활용하여 코드가 문자 형태로 표기된 단성(monophonic) 악보로부터 코드와 악상 기호를 한 번에 인식하는 방법론을 제안한다. 이를 위해 실제 악보 이미지에 대해 코드와 악상 기호를 모두 라벨링하여 직접 데이터셋을 구축하고, 광학적 음악 인식 분야에서 State-Of-The-Art(SOTA)를 달성하고 있는 Convolutional Recurrent Neural Network(CRNN) 구조를 차용하여 사전 학습된(pre-trained) Resnet-152와 양방향(bi-directional) Long Short-Term Memory(LSTM)를 결합한 모델 구조를 제시하였다. 학습 시에는 모델의 일반화(generalization) 성능을 높이고 학습 데이터셋 이외의 데이터에 대해서도 강건한 성능을 가지도록 입력 이미지에 대해 데이터 증강(augmentation) 기법을 도입하여 모델이 좀 더 다양한 상황에 대해 학습할 수 있도록 하였다. 대부분의 광학적 음악 인식 연구에서는 end-to-end 방식으로 모델을 학습하기 위해 Connectionist Temporal Classification(CTC) 손실 함수를 사용하는데, 본 연구에서는 코드와 악상 기호로 구성된 라벨 집합의 크기가 매우 크고 빈번하게 등장하는 기호와 그렇지 않은 기호의 차이가 매우 커서 발생하는 데이터 불균형 문제를 완화하기 위해 focal CTC 손실 함수를 도입하였다. 또한 모델의 출력을 기호 단위의 시퀀스로 복호화하기 위해 단순히 greedy 복호화 방식을 사용하지 않고 beam search 복호화를 도입하여 예측 정확도를 높이고자 하였다.

더 나아가 라벨링 시에 악상 기호와 코드의 표현 방식을 다양화하여 라벨 표현 방식의 차이에 따른 비교 실험과 데이터 증강 기법 적용 여부를 포함하여 입력 이미지 전처리 과정의 차이에 따른 비교 실험을 진행하였다. 또한 제안 기법의 효과성을 검증하기 위해 학습 데이터셋에 코드를 포함하기 전과 후를 비교하여 모델이 악상 기호에 대한 인식 성능을 저해하지 않으면서 코드와 악상 기호를 한 번에 인식하도록 학습되었음을 보였다. 그뿐만 아니라 공개 데이터셋으로 테스트를 진행하여 선행 연구와의 정량적 성능 비교를 통해 제안 모델의 일반화 성능이 뛰어나다는 것을 검증하였고, 특히 데이터 증강 기법을 적용한 것이 이에 핵심적인 역할을 하여 적용하기 전과 비교해서 모델이 데이터셋의 종류에 의존하지 않고 강건한 성능을 보이도록 학습되었음을 증명하였다.
Optical Music Recognition is a field that studies the methodology of decoding musical notations from score images into a machine-readable form. Existing optical music recognition studies have significantly improved recognition performance by using deep learning techniques, but most have only considered musical symbols such as key signatures, time signatures, and notes as recognition target. Among the information written on sheet music, in addition to musical symbols, chord is a basic element that composes music and plays an important role as important as musical symbols. But there are not many studies that consider it as a recognition target.

In this paper, using end-to-end deep learning techniques, we propose a methodology for recognizing chords and musical symbols at once from monophonic score images in which the chords are written in text form. To this end, we construct a dataset by labeling both chords and musical symbols on the actual score images, and design a Convolutional Recurrent Neural Network (CRNN) structure that achieved State-Of-The-Art (SOTA) in the field of optical music recognition. We propose a model structure combining pre-trained Resnet-152 and bi-directional Long Short-Term Memory (LSTM). In training, the data augmentation techniques are introduced to the input image to increase the generalization performance of the model and to make the model robust on data other than the training dataset. Most optical music recognition studies use the Connectionist Temporal Classification (CTC) loss function to train the model in an end-to-end manner. In this paper, focal CTC loss is introduced to alleviate the data imbalance problem that occurs due to the large difference between frequent symbols and non-frequent symbols. In addition, in order to decode the output of the model into a sequence of symbols, instead of simply using the greedy decoding method, beam search decoding was introduced to improve prediction accuracy.

Furthermore, by diversifying the representation methods of musical symbols and chords during labeling, comparison experiments were conducted according to the differences in label representation methods. Another comparison experiments were conducted according to the differences in input image preprocessing including whether data augmentation techniques were applied or not. In addition, in order to validate the effectiveness of the proposed method, by comparing before and after including the chords in the training dataset, it was shown that the model was trained to recognize the chords and the musical symbols at the same time without compromising the recognition performance of the musical symbols. Not only that, we tested the public dataset and validated that the generalization performance of the proposed model was excellent through quantitative performance comparison with previous studies. In particular, the addition of data augmentation played a key role in making the model robust without depending on the type of dataset.

Language: kor

URI: https://hdl.handle.net/10371/187656

https://dcollection.snu.ac.kr/common/orgView/000000172437

Files in This Item:

000000172437.pdf 9.19 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Industrial Engineering (산업공학과)
  - Theses (Master's Degree_산업공학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share