Co-attentional Transformers for Video Story Understanding

Abstract: Inspired by recent trends in vision and language learning, we explore the application of co-attention mechanisms for visiolingual fusion within an application of video story understanding. Like other video question answering (QA) tasks, video story understanding requires agents to grasp complex temporal dependencies. However, as it focuses on the narrative aspect of video it also requires understanding of the interactions between different characters, as well as their actions and their motivations.

In this thesis we introduce essential concepts from natural language processing (e.g. multi-head attention) and carry out a comprehensive survey of relevant work from adjacent fields such as visual question answering, visiolingual representation learning, video representation learning and video question answering. Based on our findings we propose a novel co-attentional Transformer model to better capture long-term dependencies seen in visual stories such as dramas and measure its performance on the video story understanding task in a video question answering setting.

We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions. Our model outperforms the baseline model by 6 percentage points in overall accuracy and at least 3.8 and up to 12.1 percentage points in accuracy on all difficulty levels and manages to beat the winner of the DramaQA challenge.
비전과 언어 학습의 최근 동향에 영감을 받아, 우리는 비디오 스토리 이해의 응용 프로그램 내에서 시각 언어 융합을 위한 co-attention mechanism 적용을 탐구한다. 다른 비디오 질문 답변(QA) 작업과 마찬가지로, 비디오 스토리 이해는 에이전트가 복잡하게 얽혀 있는 시간에 따른 의미 의존성을 파악해야 한다. 그러나, 비디오의 서술적 측면에 초점을 맞추면서, 에이전트는 또한 다른 등장인물들 사이의 상호작용과 그들의 행동과 동기를 이해하여야만 한다. 본 논문에서는 자연어 처리에서 필수적인 개념(예: multi-head attention 다중 헤드 주의)을 소개하고 시각적 질문 답변(visual question answering), 시각적 표현 학습(visiolingual representation learning), 비디오 표현 학습 및 비디오 질문 답변(video representation learning and video question answering)과 같은 인접 분야의 관련 작업에 대한 포괄적인 조사를 수행한다. 우리의 연구 결과를 바탕으로 우리는 드라마와 같은 시각적 스토리에서 보이는 시간에 흐름에 따른 장기적인 의미 의존성을 더 잘 포착하고 비디오 질문 답변 설정에서 비디오 스토리 이해 작업에 대한 성능을 측정하기 위한 새로운 공동 주의 트랜스포머 모델(novel co-attentional Transformer model)을 제안한다. 우리는 최근에 소개된 인물 중심의 비디오 스토리 이해 질문을 특징으로 갖는 Drama QA 데이터 세트에 우리의 새로운 모델을 적용해 평가한다. 우리 모델(~77% 정확도)은 기본 모델(~71% 정확도)대비 전체적으로 6% 이상 정확도가 높았으며, 모든 어려운 난이도 문제에서도 모든 다른 모델보다 최소 3.8%, 최대 12.1% 정확도 이상을 능가하였다. 이는 기존의 Drama QA challenge에 제출되었던 모든 우수한 모델의 성능을 능가함을 확인했다.

Language: eng

URI: https://hdl.handle.net/10371/177526

https://dcollection.snu.ac.kr/common/orgView/000000167056

Files in This Item:

000000167056.pdf 11.25 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share