Multimodal Representation Learning for Text-grounded Visual Story Understanding

Abstract: Story understanding is crucial for computational models tasked with generating narrative-driven long sequences, such as comics or movies. Due to the wide and dynamic contextual gaps within stories, story understanding poses significant challenges in AI domains, including computer vision and natural language processing. Effective multimodal representation learning methods are needed to bridge these gaps.
Existing research often interprets these contextual gaps as inherent alignment gaps between multimodal representations, with successful approaches including the computation of contrastive loss within a single timestep. However, these methods often lack scalability and applicability to sequences, as the gaps exist on both the vision and text modalities spatiotemporal axes.
This dissertation investigates multimodal representation learning methods that bridge the contextual gaps by using grounding representations, which are crucial for generating narrative-driven visual sequences. This study, an intermediate step toward generating longer videos, focuses on utilizing textual and visual modalities to enhance scalability through four key studies: text-grounding, attention mechanisms, a dual-level approach, and loosening text representation by skipping layers. M2FN, a model involving the leveraging of text modality through augmentation with language models, is proposed, demonstrating the effectiveness of forming text-grounded representations. This process facilitates a multimodal attention mechanism that is effective in understanding relationships within and between representations, influencing subsequent studies. M2FN demonstrates a maximum 38% improvement compared to previous state-of-the-art prediction performance of image assessment.
Inspired by computational neuroscience, dual-level approaches are introduced, exemplified by GLAC Net, a novel structure that comprehends contexts via distinct abstraction layers, ranging from local to global and from low to high levels. GLAC Net outperforms human evaluation results of 1st place in the visual storytelling challenge.
Furthermore, DramaGAN, a model that successfully performs visual story generation by integrating all the aforementioned studies, is presented. Subsequently, StoryDiffusion is introduced, utilizing recent advancements in auto-regressive latent diffusion models with loosely connected linguistic representations, delivering superior results. Both models demonstrate up to 71% and 28% improvements, respectively, in the fidelity of visual story generation compared to previous studies.
These studies emphasize their significance for understanding text-grounded visual contexts. These developments are anticipated to contribute to the production of cartoons tailored to user preferences, support the pre-production phase of movie-making, and aid in creating educational materials with visualizations, all grounded in text-based inputs. Additionally, this research has also had an impactful influence on neuroscience studies, inspiring a deeper understanding of the episodic memory structure.
스토리 이해는 만화나 영화와 같은 서사 중심의 긴 시퀀스를 생성하는 테스크를 맡은 계산 모델에 있어 중요하다. 스토리 내의 광범위하고 동적인 문맥적 격차 (contextual gaps)로 인해, 스토리 이해는 컴퓨터 비전 및 자연어 처리를 포함한 AI 분야에서 중요한 도전 과제를 제시한다. 이러한 격차를 메우기 위해 효과적인 다중 모달 표상 학습 방법이 필요하다.
기존 연구는 이러한 문맥적 격차를 다중 모달 표상 (multimodal representations) 사이에 본질적으로 발생하는 정렬 격차 (alignment gaps)로 취급하여, 단일 타임스텝 내에서 대조적 손실(contrastive loss)을 계산하는 것과 같은 성공적인 접근법을 보여줬다. 그러나 이 격차들은 시각 및 텍스트 모달리티의 시공간 축에서 존재하기에, 기존 방법들은 종종 시퀀스에 대한 확장성과 적용성이 부족하다.
이 학위 논문은 스토리 기반 시각 시퀀스를 생성하는 데 필수적인 그라운딩 표상(grounding representations)을 사용하여 이러한 문맥적 격차를 메우는 다중 모달 표상 학습 방법을 조사한다. 이 연구는 더 긴 비디오를 생성하기 위한 중간 단계로서, 확장성을 강화하기 위해 텍스트 및 시각 모달리티를 활용하는 것에 중점을 두고 네 가지 주요 연구를 진행한다: 텍스트 그라운딩, 어탠션 메커니즘(attention mechanism), 이중 레벨 (dual-level) 접근법, 그리고 층을 건너뛰며 텍스트 표상을 완화하는 것. 언어 모델을 통한 텍스트 모달리티 활용을 포함하는 M2FN 모델이 제안되며, 텍스트 기저 표상을 형성하는 효과를 보여준다. 이 과정은 표상 사이의 관계를 이해하는 데 효과적인 다중 모달 어탠션 메커니즘을 활용하며, 후속 연구에 영향을 미친다. M2FN은 이전 최고 수준의 이미지 평가 (image assessment) 성능과 비교하여 최대 38%의 향상을 보여준다.
계산학적 신경과학 (computational neuroscience)에서 영감을 받아, GLAC Net과 같이 지역부터 전역, 낮은 레벨에서 높은 레벨에 이르기까지 다양한 추상화 층을 통해 맥락을 이해하는 새로운 구조인 이중 레벨 접근법이 소개된다. GLAC Net은 시각적 스토리텔링 챌린지에서 1등을 차지한 모델의 인간 평가 결과를 능가한다.
또한, 위 연구들을 모두 종합하여 시각 스토리 생성을 성공적으로 수행한 DramaGAN 모델을 선보이며, 이어서 최근의 발전된 자기회귀 잠재 확산 모델 (auto-regressive latent diffusion models)을 사용하고 느슨하게 연결된 언어적 표상을 통해 우수한 결과를 제공하는 StoryDiffusion이 소개된다. 두 모델 모두 이전 연구와 비교하여 시각적 스토리 생성의 충실도에서 각각 최대 71% 및 28%의 개선을 보여준다.
이 연구들은 텍스트 기저 시각 맥락을 이해하는 것의 중요성을 강조한다. 이러한 연구 발전은 사용자의 선호에 따라 그려지는 웹툰의 제작, 콘셉트 시각화를 통한 영화 제작 전 단계 (pre-production phase)의 지원, 그리고 텍스트를 기반으로 학생 이해를 돕는 교육 자료의 제작에 기여할 것으로 기대된다. 또한, 이 연구는 신경과학 연구에도 영향을 미치며, 에피소드 기억 구조에 대한 더 깊은 이해를 촉진하는 데 영감을 준다.

Language: eng

URI: https://hdl.handle.net/10371/210111

https://dcollection.snu.ac.kr/common/orgView/000000180681

Files in This Item:

000000180681.pdf 56.87 MB

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Program in Brain Science (협동과정-뇌과학전공)
  - Theses (Ph.D. / Sc.D._협동과정-뇌과학전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share