A Multimodal, Multispeaker Abstractive Summarization Dataset of Discussion Threads

Abstract: With recent advances in artificial intelligence and large language models, automatic summarization of documents such as news articles, dialogues, and online discussions has been improving rapidly. However, much of these improvements have been limited to text-only summarization, and have not addressed that many online discussions are increasingly multimodal, consisting of not only text but also videos and images. While the growing number of multimodal online discussions necessitates automatic summarization to save time and reduce content overload, existing summarization datasets do not sufficiently cover this domain.
To address this, we present mRedditSum, the first multimodal discussion summarization dataset. It consists of 3,033 discussion threads where a post solicits advice regarding an issue described with an image and text, and respective comments express diverse opinions. We annotate each thread with a human-written summary that captures both the essential information from the text, as well as the details available only in the image. Experiments show that popular summarization models---GPT-3.5, BART, and T5---consistently improve in performance when visual information is incorporated.
We also introduce a novel method, cluster-based multi-stage summarization, that outperforms existing baselines and serves as a competitive baseline for future work.
인공지능 기술과 대규모 언어 모델의 발전에 힘입어, 뉴스, 대화, 토의를 위한 자동 요약 기술 또한 빠르게 발전했다. 그러나, 대부분의 자동 요약 기술은 텍스트만 요약하는 것에 한정되어 있으며, 비디오와 이미지를 수반하여 이뤄지고 있는 온라인상 많은 토의를 위한 기술은 거의 다뤄지지 않았다. 현재 요약 데이터 세트들 또한 텍스트들로만 이뤄져 있으며, 이러한 멀티모달 (Multimodal) 영역을 다루는 요약 데이터 세트는 충분치 않다.
이를 해결하기 위하여, 우리는 첫 멀티모달 토의 요약 데이터 세트인 mRedditSum을 선보인다. Reddit의 서브 레딧(subreddits)으로부터 모은 3,033개의 고품질의 토의 스레드(thread)들로 이루어진 본 데이터 세트는 이미지와 텍스트에 기반하여 조언을 구하는 글과 그 글에 다양한 의견으로 답하는 답변들로 구성돼 있다. 멀티모달의 특성에 맞게, 각 스레드에 해당하는 요약은 텍스트뿐만 아니라 이미지에서만 얻을 수 있는 정보들을 취합하여 사람이 작성하였다. 우리는 자동 요약에 자주 쓰이는 대규모 언어 모델들 - T5, BART, GPT-3 - 을 활용하여 실험을 진행하였고, 이미지 캡션(caption) 혹은 비전-텍스트 퓨전 계층(vision-text fusion layer)이 사용되었을 때, 자동 요약의 성능이 향상함을 보였다.

Language: eng

URI: https://hdl.handle.net/10371/196485

https://dcollection.snu.ac.kr/common/orgView/000000177567

Files in This Item:

000000177567.pdf 5.52 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share