Cross-Modal Representation Learning: Joint and Distributed Embedding

Abstract: 본 논문에서는 교차 모달 표현 학습에서 발생할 수 있는 문제점들을 개선하기 위한 두 가지 방법을 제안한다. 첫 째, 기존의 공동 임베딩 방식의 교차 모달 표현 학습 모델이 상이한 모달 데이터 사이의 표현을 학습하기 어려운 단점을 해결하기 위하여, 분산 임베딩 방식의 교차 모달 학습 모델을 제안한다. 분산 임베딩 방식의 학습 모델은 먼저 각 모달마다 독립적으로 단독 모달 표현 학습을 수행함으로써 각 모달마다 특화된 임베딩 공간을 학습한다. 그 후 교차 모달 표현을 학습하기 위해 여러 모달의 임베딩 공간사이를 연결하는 연상학습 모듈을 학습한다. 두 단계를 거치는 학습 과정을 통해 제안하는 모델은 상이한 모달들 간의 교차 모달 표현학습도 잘 수행할 수 있으며, 쌍이 주어지지 않은 교차 모달 데이터도 활용하여 학습할 수 있다는 장점을 가진다. 상이한 모달 관계 중 하나인 시각과 청각 모달 간의 데이터 생성 실험에서 제안하는 방법은 기존의 공동 임베딩 방식의 모델보다 향상된 성능을 검증하였다.

둘 째, 교차 모달 표현 학습을 위해서는 모달간 쌍을 이루는 데이터가 필수적이지만 실제 응용분야에서 충분한 수의 데이터 쌍을 확보하는 것은 어렵다. 이러한 문제점을 해결하기 위하여 교차 모달 표현 학습을 위한 능동적 학습 방법을 제안한다. 특히 교차 모달 표현 학습 관련 응용분야 중 하나인 이미지-텍스트 반환에 대한 능동적 학습을 제안한다. 기존의 이미지-텍스트 반환에 대한 능동적 학습 시나리오는 최신의 이미지-텍스트 반환 데이터셋에 적용하기 어렵기 때문에, 본 논문에서는 우선 최신의 데이터셋에 적합한 능동적 학습 시나리오를 먼저 제안한다. 주어진 이미지-텍스트 쌍 데이터에 대하여 사람에게 분류 라벨을 요청하는 기존의 시나리오와는 달리, 제안하는 시나리오는 쌍이 주어지지 않은 이미지 혹은 텍스트 데이터에 대하여 사람에게 나머지 모달리티의 데이터를 요청하여 쌍 데이터를 확보하는 것을 목표로 한다. 또한 제안하는 시나리오에 적합한 능동적 학습 알고리즘도 제안한다. 제안하는 알고리즘은 이미지-텍스트 반환에서 주로 사용되는 최대 힌지 트리플렛 손실함수에 가장 영향력을 많이 끼칠 것으로 생각되는 데이터를 선별한다. 이를 위해 특정 데이터가 손실함수에 영향력을 미칠 수 있는 조건을 정의하고, 정의된 조건에 기반하여 데이터가 손실함수에 미치는 영향력 점수를 추정한다. 제안하는 알고리즘은 영향력 점수가 가장 높은 순서대로 데이터를 선택하여 사람에게 나머지 쌍 데이터를 제공해줄 것을 요청한다. 최신의 이미지-텍스트 데이터셋에서의 제안하는 알고리즘이 무작위로 쌍 데이터를 확보하는 것보다 학습데이터 수 대비 향상된 성능을 달성하는 것을 보여주었다.
In this dissertation, we propose two methods to overcome problems that may occur in cross-modal representation learning. First, in order to overcome the problem that the existing joint embedding based model is difficult to learn relation among data from heterogeneous modalities, we propose a cross-modal representation learning model adopting the distributed embedding method. The proposed model first learns intra-modal association by training a specialized embedding space for each modality with single-modal representation learning. Then the proposed model learns cross-modal association by introducing associator, which connects the embedding spaces of multiple modalities. To separate the learning process of intra-modal association and cross-modal association, the model parameters involved in intra-modal association are not updated during training of cross-modal association. Through the two-step learning process, the proposed model can well perform cross-modal representation learning among heterogeneous modalities. Furthermore, the proposed model has the advantage of utilizing unpaired data for learning. We validated the proposed method in the cross-modal data generation task between visual and auditory modalities, which is one of the heterogeneous modal relationships. The proposed method achieves improved performance compared to the existing joint-embedding based models.

Second, though cross-modal paired data is essential for cross-modal representation learning, securing a sufficient number of paired data is too difficult in practical applications. To mitigate data shortage problem, we propose an active learning method for cross-modal representation learning. In particular, we propose active learning for image-text retrieval, which is one of the most popular applications related to cross-modal representation learning. Since the existing active learning scenario for image-text retrieval can not be applied to the recent image-text retrieval benchmarks, we first propose an active learning scenario feasible for the recent benchmarks. In contrast to the existing scenario where a category label for a given image-text pair data is queried to the human experts, in the proposed scenario, unpaired image or text data are given and human experts are requested to pair the unpaired data. We also proposed an active learning algorithm for the proposed scenario. The proposed algorithm selects the data that is expected to have the most influence on the max-hinge triplet loss function, which is mainly adopted loss function in recent image-text retrieval method. To this end, we define the condition that data can influence the loss function, and estimate the influence score (referred to as HN-Score) of the data on the loss function based on the defined condition. The proposed algorithm selects the data of the highest score. We validate the effectiveness of the proposed active learning algorithm through the various experiments on recent image-text retrieval benchmarks.

Language: eng

URI: https://hdl.handle.net/10371/181116

https://dcollection.snu.ac.kr/common/orgView/000000169190

Files in This Item:

000000169190.pdf 7.00 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share