Dual Composition Network in Interactive Image Retrieval

Abstract: We present an approach name Dual Composition Network (DCNet) for interactive image retrieval that retrieves the best target image from image database given a reference image and a natural language query.
To solve this task, existing methods have focused on learning a composite representation of the reference image and the text query to be as close to the target image feature as possible.
We refer this approach as the Composition Network.
In this work, we propose the Correction Network that models the difference between the reference and target image in a feature space and matches it with the text query feature.
That is, we consider two cyclic directional mappings for triplets of (reference image, text query, target image) by using both Composition Network and Correction Network.
We also propose a joint training loss that can further improve the robustness of multimodal representation learning.
We evaluate the proposed model on three benchmark datasets for multimodal retrieval: Fashion-IQ, Shoes, Fashion200K and CSS.
Our experiments show that our DCNet achieves new state-of-the-art performance on all four datasets,
and the addition of Correction Network consistently improves multiple existing methods that are solely based on Composition Network.
본 연구에서는 인터렉티브 이미지 검색 문제를 해결하는 Dual Composition Network (DCNet) 방법을 제안한다.
인터렉티브 이미지 검색 문제는 레퍼런스 이미지와 유저의 피드백인 자연어 쿼리가 들어왔을때, 이미지 데이터베이스에서 설명에 가장 알맞는 타겟 이미지를 검색하는 문제이다.
기존의 방법들은 위의 문제를 풀기 위해 레퍼런스 이미지와 자연어 쿼리의 정보를 종합하는 composite representation을 생성하고, 이를 타겟 이미지의 feature와 가까워 지도록 학습하는 방법을 사용한다.
이때 정보를 종합하는 방법을 Composition Network이라 한다.
본 연구는 위의 방법과 더불어 문제를 해결하는 새로운 방법으로, 레퍼런스 이미지와 타겟 이미지 사이의 차이를 모델링하는 difference representation을 구하여 자연어 쿼리 embedding과 가까워지도록 학습하는 Correction Network를 제안한다.
위 두 방법을 통해, 제안된 모델은 (레퍼런스 이미지, 자연어 쿼리, 타겟 이미지) 쌍 안에 나타나는 다양한 조합을 고려해 검색을 수행할 수 있다.
또한 Composition Network와 Correction Network를 합동 학습함으로, 두 방법을 각각 따로 학습했을때보다 더 좋은 representation을 얻어낸다.
제안된 방법을 검증하기 위해 본 연구는, 네개의 인터렉티브 이미지 검색 밴치마크 데이터셋 Fashion-IQ, Shoes, Fashion 200k, CSS에서 실험을 통해 기존 방법들보다 좋은 성능을 얻는것을 확인한다.
또한 Correction Network와의 합동 학습이 다양한 상황에서 공통적으로 성능 향상에 도움이 되는것을 확인한다.

Language: eng

URI: https://hdl.handle.net/10371/175391

https://dcollection.snu.ac.kr/common/orgView/000000164044

Files in This Item:

000000164044.pdf 4.00 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share