S-Space College of Engineering/Engineering Practice School (공과대학/대학원) Dept. of Computer Science and Engineering (컴퓨터공학부) Theses (Master's Degree_컴퓨터공학부)
Automatic Story Extraction for Photo Stream via Coherence Recurrent Convolutional Neural Network
- 공과대학 컴퓨터공학부
- Issue Date
- 서울대학교 대학원
- Deep learning; Recurrent Neural Network; Convolutional Neural Network; Photo stream; Story extraction; Coherence; Image captioning; Natural Language Processing
- 학위논문 (석사)-- 서울대학교 대학원 : 컴퓨터공학부, 2017. 2. 김건희.
- Due to advances in computing power, data gathering and researchers there have been many improvements in artificial intelligence. Particularly, research related to images has proceeded very quickly. Computers have a similar level of cognitive abilities and can do many things that people can do through vision. It became possible to see, understand and express. Among them, We will focus on visual understanding and natural language expression. Various studies have been conducted to understand visual information and express it in natural language. One challenge that comes to the performance that a person can make is the creation of image captions for Flickr30K and MS COCO dataset. However, there is still a limit to simple data and tasks.
In this dissertation, we propose an approach for retrieving a sequence of natural sentences for an image stream. We dill with more complex, non-refined data compared to the previous work. This dissertation extends the preliminary work of Park and Kim, and an amendment of it was submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence.
Since general users often take a series of pictures on their experiences, much online visual information exists in the form of image streams, for which it would better take into consideration of the whole image stream to produce natural language descriptions. While almost all previous studies have dealt with the relation between a single image and a single natural sentence, our work extends both input and output dimension to a sequence of images and a sequence of sen- tences. To this end, we propose a multimodal neural architecture called coher- ence recurrent convolutional network (CRCN), which consists of convolutional neural networks, bidirectional long short-term memory (LSTM) networks, and an entity-based local coherence model. Our approach directly learns from vast user-generated resource of blog posts as text-image parallel training data. We collect more than 22K unique blog posts with 170K associated images for the topics of NYC, Disneyland, Australia, and Hawaii. We demonstrate that our approach outperforms other state-of-the-art image captioning candidate meth- ods, using both quantitative measures and user studies via Amazon Mechanical Turk.