비디오 데이터의 시공간 구조 학습을 위한 구성적 그래프 신경망

Abstract: 인간의 인지 능력은 실세계를 다양한 요소들과 그들 사이의 관계로 표현하고, 요소들 사이의 구성적 계층을 구축하는 구조적 표현에 기인한다. 비록 기계학습, 특히 심층 신경망 모델의 발전이 컴퓨터 비젼, 자연 언어 처리 등 다양한 분야에서 인간수준에 근접한 성능을 이끌었지만, 아직 인간의 이러한 구조적 표현 능력을 모사하는 것은 도전적인 문제로 남아있다. 최근 구조적 표현 학습을 위해, 그래프의 형태로 구조화된 데이터를 학습하는 그래프 신경망(Graph Nerual Networks, GNNs)이 제안되었다. 하지만, 일반적인 GNNs는 메세지 전달(message passing) 방식의 flat한 연산을 이용하여 그래프 내 요소(정점) 표현 학습에 중점을 두기 때문에 데이터의 전체적인 구성적 계층을 학습하는 것은 어렵다.
본 학위논문에서는, 데이터로부터 구성적이고 계층적인 구조를 학습하는 것을 근본적인 문제로 상정하고, 이러한 구성적 계층구조를 학습하는 방법론에 대해 연구한다. 이를 위해, 본 문제를 공간-수준과 시간-수준 두 단계로 나누어, 공간적 데이터와 시간적 데이터 각각에 대해 구성적 구조 표현 학습을 위한 두가지 구성적 그래프 신경망을 제안한다. 더 나아가, 공간-수준, 시간-수준의 구성적 구조 표현 학습을 통합하기 위한 다중수준 비디오 질의응답 데이터셋을 제안한다.
첫번째로, 공간-수준에서의 구성적 구조 표현 학습을 위해 새로운 그래프 풀링 알고리즘인 Spectrally Similar Graph Pooling (SSGPool)을 제안한다. 제안하는 SSGPool 알고리즘은 원본 그래프의 노드에서 축소된 그래프의 노드 사이의 매핑을 나타내는 축소 행렬(Coarsening matrix)를 학습하는 것을 핵심으로 한다. 이 때, 축소 행렬은 노드의 특징벡터를 기반으로 학습하면서 동시에 원본 그래프와 축소된 그래프 사이의 스펙트럼적 특징을 유지한다. SSGPool의 효과를 검증하기 위해, 다양한 그래프 벤치마크들에 대해 다른 강력한 기준모델들과 비교실험을 수행한다. 또한, 실세계 문제인 scene graph 기반의 이미지 검색 문제에 적용하여 제안하는 방법을 검증한다.
다음으로, 시간-수준에서의 구성적 구조 표현 학습을 위해, Cut-Based Graph Learning Networks (CB-GLNs)를 제안한다. CB-GLNs는 비디오 내 다양한 길이의 의미적 흐름과 그들의 구성을 내포하는 복잡한 의존적 구조를 발견함으로써 비디오의 표현을 학습한다. 이를 위해, 그래프의 정점과 간선이 각각 비디오 내 이미지 프레임들과 그 사이의 의존성을 표현하는 하나의 그래프 구조로 비디오를 나타낸다. CB-GLNs은 그래프 컷과 함께 매개변수화된 커널(parameterized kernel) 과 메세지 전달 프레임워크를 이용하여 데이터의 구성적 계층 구조를 다중레벨 그래프 형태로 찾는다. 성능평가를 위해 대표적인 두가지 비디오 이해 문제인 비디오 주제 분류 및 비디오 질의응답 에 대해 실험을 수행한다.
마지막으로 공간과 시간상의 구조 학습을 통합하기 위해 다중레벨 비디오 질의응답 데이터셋을 제안한다. 다중레벨 질의응답을 위해, 두가지 기준인 메모리 용량(memory capacity)와 논리적 복잡도(logical complexity)를 정의하고 이를 기반으로 질문의 계층적 난이도를 제안한다. 계층적 난이도는 앞서 제안된 시공간-수준과 정렬된다. 데이터셋은 TV 드라마 ``또 오해영''을 기반으로 구축하며 다중레벨 난이도 기반 질의응답 및 다양한 길이의 비디오클립을 포함한다. 통합된 시공간-수준의 구성적 구조 학습의 평가를 위해, 앞서 제안한 두가지 구성적 그래프 신경망을 결합하여 다중레벨 비디오 질의응답 실험을 수행한다.
The cognitive abilities of human depend on the structured representations that express the world as entities and their relations, and build compositional hierarchies among the entities. With the advances in machine learning, especially deep neural networks, artificial intelligence has demonstrated human-level performance in key areas such as computer vision and natural language processing. However, simulating the ability to learn structured representations still remains as a challenging problem. Recently, Graph Neural Networks (GNNs) which learn representations of graph structured data have been proposed to learn structured representations. Nevertheless, it is difficult to learn an overall compositional hierarchy with conventional GNNs as the GNNs focus on learning representations of entities (nodes) with \textit{flat} operators such as message passing.
In this dissertation, we postulate learning compositional and hierarchical structure from data as a fundamental problem, and study on methods to learn the compositional hierarchies. To this end, we divide the problem into two levels, spatial-level and temporal-level, and propose two compositional graph neural networks to learn the structured representations of spatial and temporal data, respectively. Also, we propose multilevel Video QA dataset to unify spatial and temporal-level compositional structured representation learning.
Firstly, to learn spatial-level compositional structured representations, we propose a novel graph pooling algorithm, Spectrally Similar Graph Pooling (SSGPool). The main idea of the SSGPool algorithm is to learn a coarsening matrix which maps nodes from an original graph to a smaller number of nodes in a coarsened graph. The coarsening matrix is trained based on feature vectors of nodes while keeping the spectral characteristics of the original graph in the coarsened one. To validate the effectiveness of the SSGPool, experiments on various graph benchmarks are conducted compared to strong baselines. Also, we evaluate our approach on a real-world problem, image retrieval with visual scene graphs.
Next, to learn temporal-level compositional structured representations, we propose Cut-Based Graph Learning Networks (CB-GLNs). The CB-GLNs learn representations of video data by discovering complex dependency structures that imply variable-length semantic flows and their composition. To this end, the video data is expressed as a graph, with nodes and edges corresponding to frames of the video and their dependencies respectively. The CB-GLNs find compositional hierarchies of the video in multilevel graph forms via a parameterized kernel with graph-cut and a message passing framework. For evaluations, two different tasks for video understanding are conducted: Video theme classification and Video question answering.
Finally, we propose a multilevel Video Question Answering (Video QA) dataset to unify compositional structured representation learning. For multilevel QA, hierarchical difficulty levels are proposed with two criteria: memory capacity and logical complexity. Then the hierarchical difficulties are aligned with proposed spatiotemporal-level to construct dataset. The dataset is built upon the TV drama ``Another Miss Oh'' and contains QA pairs with multilevel difficulties and various length video clips. To evaluate unified learning method, we combine two compositional graph neural networks proposed above and conduct an experiment for multilevel Video QA task.

Language: eng

URI: https://hdl.handle.net/10371/169336

http://dcollection.snu.ac.kr/common/orgView/000000162128

Files in This Item:

000000162128.pdf 28.03 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share