Exploring LLMs for Enhanced Music Captioning Evaluation

Abstract: Music captioning aims to generate descriptive textual captions given a piece of musical audio clip. The evaluation of generated content in multi-modal domains, particularly language and music, has traditionally relied on metrics designed for language or image-language evaluations, such as BLEU, SPICE, METEOR, and CIDEr, although they have not been justified to be suitable for the new domain. These metrics, moreover, do not adequately capture the nuances of musical content, which can lead to the failure of accurate evaluation. This thesis addresses this gap by developing a novel evaluation method customized to the music captioning domain, which aligns more closely with human judgment.

We first constructed a music-specific Named Entity Recognition (NER) dataset with the help of Large Language Models (LLMs) and performed NER tasks within the music domain. Utilizing the results of these tasks, we developed MusicSim, a similarity score that reflects the unique attributes of musical pieces. Our experimental results, conducted on a subset of the MusicCaps evaluation dataset, demonstrate that the proposed methods exhibit a higher correlation with human evaluations than traditional metrics.

Furthermore, our approach shows potential for broader applications in other domain-specific and multi-modal generation tasks involving language, which provides a foundation for more accurate and domain-specific caption evaluation.
뮤직 캡셔닝은 음악 오디오 클립이 주어졌을 때 이를 자연어로 기술하는 문제 이다. 이러한 음악-언어 멀티모달 도메인에서는 성능 평가를 위해 BLEU, SPICE, METEOR, CIDEr와 같은 기존의 언어, 또는 언어-이미지 분야의 평가 지표에 의존
해 왔으나, 이러한 기존의 지표를 완전히 새로운 멀티모달 도메인인 언어-음악에 적용할 수 있는 것인가에 대한 정당화가 이루어지지 않았다. 이러한 평가 지표는 음악 캡션에 존재할 수 있는 다양한 음악적 정보를 충분히 반영하지 못하므로, 올바른 평가 방법이 아닐 수 있다. 그리하여 이 논문에서는 인간이 평가하는 것과 더욱 일치하도록, 언어-음악 도메인에 맞는 새로운 평가 방법을 개발하였다.
우리는 먼저 거대언어모델(LLM)의 도움을 받아 음악 분야의 개체명 인식(NER) 데이터셋을 구축하고, BERT 기반의 자연어 모델을 사용하여 개체명 인식 작업을 수행하였다. 이 결과를 활용하여 뮤직 캡셔닝에서 음악의 속성을 잘 반영하는 MusicSim 유사도 점수를 개발하였다. MusicCaps 평가 데이터셋에 대해 실험을 진행하였으며, 인간이 평가하는 것과의 피어슨 상관관계를 계산하였다. 우리가 제안된 평가 방법을 사용하는 것이, 전통적인 평가 방법보다 더 높은 상관관계를 가지고 있음을 입증하였다.
나아가, 우리가 사용한 방법은 언어를 포함하는 다른 도메인 특정 및 멀티모달 생성 작업에서의 더 넓은 응용 가능성을 보여 주며, 다른 도메인에서 더 정확한 캡션 평가를 할 수 있는 기반을 마련하였다.

Language: eng

URI: https://hdl.handle.net/10371/215298

https://dcollection.snu.ac.kr/common/orgView/000000185303

Files in This Item:

000000185303.pdf 5.75 MB

Appears in Collections:

Graduate School of Convergence Science and Technology (융합과학기술대학원)
- Dept. of Intelligence and Information (지능정보융합학과)
  - Theses (Master's Degree_지능정보융합학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share