Improving Efficiency in Large-Scale Self-Supervised Video Representation Learning

Abstract: 비디오는 학습에 사용할 수 있는 동적이고 멀티모달(multimodal)한 시그널을 제공하기 때문에 컴퓨터비전과 기계학습에 있어서 아주 매력적인 데이터원이다. 특히 비디오 라벨링에는 시간적, 금전적 비용이 많이 들기 때문에 최근에는 비디오 이해를 위해 자기지도 비디오 표현학습법이 많이 주목받고 있다. 하지만 자기지도학습법은 주로 대규모 학습으로 진행되어 많은 연산 및 메모리 자원이 필요로 한다. 또한, 우리가 구할 수 있는 현실 속 비디오들은 대부분 노이즈를 많이 담고 있어 인간이 별도로 검수하지 않는 한 학습에 사용하기 좋은 비디오 데이터를 구하기가 어려워 대규모 데이터 수집에 어려움이 있다. 본 학위논문에서는, 위에서 언급된 자기지도 비디오 표현학습법과 관련된 문제들을 심층적으로 알아보고 학습의 효율성을 증대시키기 위한 세 가지 해결책을 제시한다. 첫 번째로, 라벨이 달리지 않은 비디오를 별도의 디코딩 과정 없이 학습에 사용할 수 있는 방법을 알아본다. 비디오는 보통 MPEG와 같은 압축된 형식으로 저장이 되고, 이를 디코딩하기 위해서는 많은 연산 자원이 필요하다. 본 논문에서 제시하는 새로운 모델 구조와 pretext task들은 최소한의 성능 감소만으로 압축된 형태의 비디오에서 학습이 가능하게 해주며 디코딩을 생략하여 빠른 비디오 처리를 가능하게 해준다. 두 번째로, 라벨이 달리지 않은 비디오로부터 문맥화된 청각-시각 표현을 자기지도학습으로 배우기 위한 양방향 멀티모달 트랜스포머(Transformer) 구조를 제시한다. 트랜스포머 모델이 많은 메모리 자원을 소요하기 때문에 기존에 멀티모달 트랜스포머 모델은 대규모로 종단간 학습(end-to-end training)을 진행하기가 어려웠다. 저차원 근사법에 기반한 행렬 분해를 통해 본 논문에서는 멀티모달 트랜스포머 모델의 크기를 줄여 성공적으로 종단간 학습시켰으며, 다양한 태스크에서 좋은 성능을 거두었다. 마지막으로, 청각-시각 자기지도 표현학습법에 사용할 수 있는 비디오 데이터를 모으기 위한 확장 가능하고(scalable) 자동화된 수집 파이프라인을 제안한다. 이 파이프라인은 상호정보량(Mutual Information)에 기반한 부분 집합 선택 알고리즘을 통해 노이즈가 있는 데이터를 필터링하며, 이를 통해 수집된 데이터셋에서 학습된 청각 및 시각 모델들은 인간의 검수를 통해 만들어진 기존 데이터셋에서 학습된 모델과 비교하여 비슷하거나 더 나은 성능을 보인다. 본 논문에서는 이 파이프라인을 이용하여 청각 및 시각 표현학습을 위해 사용할 수 있는, 1억 개의 비디오 클립으로 구성된 오픈 도메인 비디오 데이터셋 ACAV100M을 구성하였다.
Video is a very attractive data source for computer vision and machine learning; it contains dynamic and multimodal signals to learn from. Since adding annotations to videos is very expensive, self-supervised video representation learning has gained significant attention. However, self-supervised learning requires large-scale training, so we need large compute and memory resources. Furthermore, real-world videos are usually very noisy, so finding good video data to learn from requires human verification, which hinders large-scale data collection. In this thesis, we explore these problems in self-supervised video representation learning and propose the following three solutions to improve learning efficiency. First, we investigate how to learn from unlabeled videos without decoding them. Videos are usually stored in a compressed format, e.g., MPEG, and decoding them requires significant compute resources. Our novel architecture and proposed pretext tasks allow us to learn from unlabeled compressed videos with minimal performance degradation and achieve fast video processing time. Second, we introduce a multimodal bidirectional Transformer architecture for self-supervised learning of contextualized audio-visual representation from unlabeled videos. End-to-end training of multimodal Transformers is challenging due to the large memory requirement of Transformer architecture. With our novel parameter reduction technique based on matrix decomposition with low-rank approximation, we successfully train our multimodal Transformer and achieve competitive results in various downstream tasks. Lastly, we propose an automatic and scalable data collection pipeline for self-supervised audio-visual representation learning. We curate noisy video data using an MI-based subset selection algorithm. Audio and visual models trained on the resulting datasets yield competitive or better performance than those trained on existing, manually verified datasets. We release a large-scale open-domain video dataset, ACAV100M, consisting of 100M clips curated with our pipeline for audio-visual representation learning.

Language: eng

URI: https://hdl.handle.net/10371/193338

https://dcollection.snu.ac.kr/common/orgView/000000174295

Files in This Item:

000000174295.pdf 25.36 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share