시각적 유사도와 의미적 유사도 간 비율 제어가 가능한 이미지 검색

Abstract: 이미지 검색(Image-to-Image retrieval)은 쿼리(query) 이미지에 대해 유사한 이미 지를 찾아주는 작업으로, 주로 크게 시각적 유사도와 의미적 유사도의 두 가지 접근 방식으로 나뉘어 연구되어왔다. 이미지 검색은 다양한 상황에서 이루어지기 때문에 하나의 기준으로 검색하는 것은 사용자의 의도에 유연하게 대응하기에는 한계가 있다.
본 논문에서는 시각적 관점과 의미적 관점 모두를 고려하여, 사용자가 목적에 따라 자유롭게 비중을 조절하고 사용자의 의도에 따라 유연하게 검색할 수 있는 방법론을 제안한다. 이를 위해 이미지를 표현하는 장면 그래프(Scene graph)에서 그래프 합성곱 신경망(Graph Convolutional Network)을 통해 시각적 특징과 의미적 특징을 추출한 후, 이를 내삽(Interpolation)하여 비율을 조절하여 검색하는 모델이다. 이미지에서 사 전 학습된(pre-trained) ResNet-152 모델을 통해 추출한 시각적 특징, 사람이 작성한 캡션에서 사전 학습된 Sentence-Bert(SBERT) 모델을 통해 추출한 의미적 특징을 대리 관련도(Surrogate relevance)로 활용하여 학습을 했다.
이를 통해 학습한 시각적·의미적 특징들이 각각 학습한 대리 관련도 측면에서 높은 normalized Discounted Cumulative Gain(nDCG)를 보임으로써 그래프 레벨을 통해 각 특징을 성공적으로 추출했음을 보였다. 또한, 알고리즘이 사람의 평가와 얼마나 유 사한지를 보여주는 인간 동의 점수(Human agreement score)에서 다른 선행 연구들과 정량적 성능 비교를 통해 이미지 검색 성능이 뛰어나다는 것을 검증할 수 있었다. 또한, 시각적·의미적 특징을 ResNet 및 캡션 SBERT로 대체한 모델과 비교 결과, 같은 그래 프 레벨 상에서 추출한 특징을 이용한 내삽이 더 좋은 성능을 보임을 확인할 수 있었다. 이미지에 대해 비율을 조정하며 검색한 정성적 결과를 통해 본 모델이 성공적으로 시각적·의미적 유사도의 비율을 조정한 검색을 수행할 수 있음 역시 확인하였다.
Image-to-Image retrieval is one of the studies that finds similar images with respect to a query image and has been mainly studied in two approaches: visual similar- ity and semantic similarity. Since image search is performed in various situations and contexts, there is a limitation to flexibly responding to the users intention by performing search based on a single criterion.
In this paper, considering both visual and semantic aspects, we propose a method- ology that allows users to freely adjust the weight according to the purpose and flexibly search according to the users intentions. To this end, it is a model that extracts visual and semantic features from a scene graph representing an image through a graph convolutional network, and then interpolates them to adjust the ratio to search. Surrogate relevances of visual feature extracted through pre-trained ResNet-152 model from the image and semantic feature extracted through pre-trained Sentence-Bert (SBERT) model from human captions are used to train the model.
Through this, image retrieval using learned visual · semantic feature showed high normalized discounted cumulative gain(nDCG) in terms of the surrogate relevance, indicating that each feature was successfully extracted through the graph level. In addition, it was possible to verify that the image search performance was excellent by comparing the quantitative performance with other previous studies in respect to the human agreement score, which shows how similar the algorithm is to human evaluation. In addition, as a result of comparing the visual and semantic features with ResNet and caption SBERT models respectively, it was confirmed that interpo- lation using features extracted on the same graph level showed better performance. It was also confirmed that this model can successfully perform a search that adjusts the ratio of visual and semantic similarity through the qualitative results of search- ing images while adjusting the ratio.

Language: kor

URI: https://hdl.handle.net/10371/193144

https://dcollection.snu.ac.kr/common/orgView/000000174487

Files in This Item:

000000174487.pdf 29.47 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Industrial Engineering (산업공학과)
  - Theses (Master's Degree_산업공학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share