비디오 질의응답을 위한 동작-모양 시너지 네트워크

서아정

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

비디오 질의응답을 위한 동작-모양 시너지 네트워크 : Motion-Appearance Synergistic Networks for Video Question Answering

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 서아정

Advisor: 장병탁

Issue Date: 2021

Publisher: 서울대학교 대학원

Keywords: 비디오 질의 응답 ; 동작 정보 ; 모양 정보 ; 동작-모양 융합 ; 어텐션 메커니즘 ; Video Question Answering ; Motion information ; Appearance information ; Motion-Appearance fusion ; Attention Mechanism

Abstract: Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1)understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose MotionAppearance Synergistic Networks (MASN), which embed two crossmodal features grounded on motion and appearance information and selectively utilize them depending on the questions intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.
비디오 질의 응답은 AI 에이전트가 주어진 비디오를 기반으로 관련된 질문에 응답하는 문제이다. 비디오 질의 응답 문제를 해결하기 위해서는 세 가지 과제를 해결하여야 한다: (1) 다양한 질문의 의도를 이해하고, (2) 주어진 비디오의 다양한 요소(e.g. 물체, 행동, 인과관계)를 파악하여야 하며, (3) 언어와 시각 정보 두 modality 간의 상관관계를 기반으로 생성된 표상(cross-modal representation)을 통해 정답을 추론하여야 한다. 따라서 본 학위논문에서는 동작 정보 및 모양 정보에 기반한 두 가지 cross-modal representation 을 생성하고, 이를 질문의 의도에 따라 가중합하는 동작-모양 시너지 네트워크를 제안한다.
제안하는 모델은 세 가지의 모듈: 동작 모듈, 모양 모듈, 동작-모양 융합 모듈로 구성되어 있다. 동작 모듈에서는 질문과 행동 정보를 융합한 cross-modal representation 을 생성하며, 모양 모듈에서는 주어진 비디오의 모양 측면에 집중하여 표상을 생성한다. 최종적으로 동작-모양 융합 모듈에서 인코딩된 두 정보가 질문의 내용을 기반으로 융합된다. 실험 결과, 제안하는 모델은 대규모 비디오 질의 응답 데이터셋인 TGIF-QA 와 MSVD-QA 에 대해 최첨단의 성능을 보였다. 본 논문에서는 또한 제안하는 모델의 정성적 평가 결과에 대해서도 보여준다.

Language: kor

URI: https://hdl.handle.net/10371/178554

https://dcollection.snu.ac.kr/common/orgView/000000166520

Files in This Item:

000000166520.pdf 1.02 MB

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Program in Brain Science (협동과정-뇌과학전공)
  - Theses (Master's Degree_협동과정-뇌과학전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share