시각언어 멀티모달 모델의 모달리티 비대칭성 개선을 위한 유동적인 마스킹 방법의 적용

Abstract: The importance of pretrained vision-language multimodal models is increasing as various multimodal tasks are generated with the development of the Internet and smartphones. The performance of multimodal models is increasing in proportion to the multimodal data and computational power, although it is not clear whether the performance improvement is due to a proper reasoning.
In this paper, we measure the importance of each token from each modality via saliency map, one of the traditional methods to analysis models, and confirm the modality imbalance problem in multimodal models. We also found that recent multimodal models tend to focus on image modality than text, similar to the previous works.
In this paper, we propose a method to solve the modality imbalance problem of pretrained multimodal models. To do so, model calculate the importance score of input tokens with text-only masking. Then, according to their importance score, input tokens with image modality can be additionally masked. By training with dynamically masked input data, modality imbalance problem can be solved as masking can handle inappropriate relations between tokens.
After applying dynamic masking policy to pretrained multimodal model, it shows that importance score of text modality is increasing compare to baseline model. Thus, we claim that our new masking policy alleviate the modality imbalance problem of recent multimodal models.
최근 많은 주목을 받는 사전학습 기반의 시각언어 멀티모달 모델은 인터넷과 스마트폰의 발전으로 다양한 멀티모달 데이터가 생성되면서 점차 중요해지고 있다. 멀티모달 모델의 성능은 데이터와 연산 능력에 비례해 증가하고 있지만, 지름길(shortcut) 문제와 모달리티 비대칭성 문제로 인해 제대로 된 추론 과정을 거치고 있는지는 확실하지 않다.
본 논문에서는 돌출맵(Saliency Map) 등에서 사용된 그래디언트(gradient) 기반의 분석 방법을 활용하여 각 모달리티 토큰의 중요성을 측정하고, 이를 통해 모델에서의 비대칭성 문제를 확인하였다. 멀티모달 모델은 선행연구들에서의 결과와 유사하게 한쪽 모달리티, 특히 이미지 모달리티의 토큰에 집중하는 경향을 실험을 통해 확인할 수 있었다.
본 논문에서는 모달리티 비대칭성 문제를 해결하기 위해 유동적인 마스킹 기법을 제안한다. 텍스트와 이미지의 쌍으로 나타난 데이터 샘플의 텍스트 모달리티의 토큰만을 우선 임의로 마스킹하고, 사전학습 모델을 활용하여 토큰의 중요도 점수를 측정한다. 이 점수에 따라 이미지 모달리티의 토큰을 마스킹한 후 모델을 학습함으로써 모달리티 비대칭성 문제를 해결하고자 한다. 직관적으로, 그래디언트 기반의 분석 방법을 통해 지나치게 모델의 의존도가 높은 토큰들을 발견할 수 있고 마스킹을 통해 의존도를 조절할 수 있으므로 비대칭성 문제를 개선할 수 있다.
유동적인 마스킹 기법을 사전학습 멀티모달 모델에 적용한 뒤 각 모달리티의 중요도를 다시 측정해 본 결과, 모든 토큰을 임의로 마스킹하는 베이스라인 모델에 비해 개선된 마스킹 기법을 사용해 학습한 모델은 이미지 모달리티의 중요도가 감소하고 텍스트 쪽의 중요도가 증가하는 것을 확인할 수 있어 비대칭성 문제를 일부 완화할 수 있었다.

Language: kor

URI: https://hdl.handle.net/10371/183333

https://dcollection.snu.ac.kr/common/orgView/000000170256

Files in This Item:

000000170256.pdf 1.14 MB

Appears in Collections:

Graduate School of Data Science (데이터사이언스 대학원)
- Theses (Master's Degree_데이터사이언스학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share