Multimodal Self-Attention Network for Visual Reasoning

류성원

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Multimodal Self-Attention Network for Visual Reasoning : 시각적 추론을 위한 멀티모달 셀프어텐션 네트워크

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 류성원

Advisor: 조성준

Issue Date: 2019-08

Publisher: 서울대학교 대학원

Keywords: Visual Question Answering ; Visual Reasoning ; Self-Attention

Description: 학위논문(석사)--서울대학교 대학원 :공과대학 산업공학과,2019. 8. 조성준.

Abstract: Visual reasoning is more difficult than visual question answering since it requires sophisticated control of information from image and question. Extracted information from one source is used to extract information from the other and this process occurs alternately. This is natural since even human needs multiple glimpses of image and question to solve complicated natural language question with multi-step reasoning. One needs to handle information from earlier steps and use them in later steps to get the answer. Due to this difference, the results on these two tasks tend not to correlate closely.
In this paper, we propose Multimodal Self-attention Network (MUSAN) to solve visual reasoning task. Our model uses Transformer encoder by [22] to promote inti- mate interactions between images and the question in fine granular level. MUSAN achieved state-of-the-art performance in CLEVR dataset from raw pixels without prior knowledge or pretrained feature extractor. Also, MUSAN recorded 8th rank in the 2019 GQA challenge without functional or graphical information. Attention visualization of MUSAN shows that MUSAN performs stepwise reasoning with its own logic.
시각적 추론은 이미지와 질문의 정교한 정보 제어가 필요하기 때문에 시각적 질문 응답 보다 어렵다. 한 소스에서 추출 된 정보는 다른 소스에서 정보를 추출하는 데 사용되며 이 프로세스는 교대로 발생한다. 복잡한 자연어 문제를 다단계적 추리로 풀려면 인간조 차도 이미지와 질문을 여러 번 흘끗 볼 필요가 있기 때문에 이것은 당연한 것이다. 초기 단계에서 얻은 정보를 처리하고 나중에 답을 얻기 위해 사용할 필요가 있다. 이러한 차이 때문에, 이 두 과제에 대한 결과는 밀접하게 연관되지 않는 경향이 있다.
본 논문에서는 시각적 추리 과제를 해결하기 위해 MUSAN(Multimodal Self-attention Network)을 제안한다. 본 모델은 [22]가 제안한 트렌스포머 인코더를 사용하여 세부적 인 수준에서 이미지와 질문 간의 긴밀한 상호작용을 촉진한다. MUSAN은 사전 지식 이나 사전 훈련된 피쳐 추출기 없이 원시 픽셀에서 CLEVR 데이터셋의 최고 성능을 달성했다. 또 2019년 GQA 챌린지에서 문제 생성 함수 정보나 그래프 정보 없이 8위 를 기록했다. MUSAN의 어탠션 시각화는 MUSAN이 자신의 논리로 단계적 추론을 수행한다는 것을 보여준다.

Language: eng

URI: https://hdl.handle.net/10371/161026

http://dcollection.snu.ac.kr/common/orgView/000000157483

Files in This Item:

000000157483.pdf 4.19 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Industrial Engineering (산업공학과)
  - Theses (Master's Degree_산업공학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share