Publications

Detailed Information

Retrieval of Twitter messages without an explicit query term by means of serialization and discourse segmentation : 트위터 게시물간 발화 연쇄·담화 분절 탐지 및 질의어 비포함 트윗 검색에의 활용

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

박수지

Advisor
신효필
Major
인문대학 언어학과
Issue Date
2014-08
Publisher
서울대학교 대학원
Keywords
Centering theoryDiscourse markerInformation retrievalSocial mediaTwitter
Description
학위논문 (석사)-- 서울대학교 대학원 : 언어학과, 2014. 8. 신효필.
Abstract
This thesis describes a phenomenon where multiple tweets constitute a single discourse segment, and builds two rule-based models to detect whether two consecutive tweets under the same authorship convey a single message. Given the length limit of 140 characters, a tweet should be interpreted as an element of a larger unit rather than an individual document. Considering such a larger unit as a discourse segment and a tweet as an utterance, this study makes the following assumptions based on Centering Theory:
(a) A tweet has at most one topic.
(b) In non-initial tweets of a discourse segment, a topic word is realized as an anaphora, in particular a zero form in Korean.
(c) Coherence between two tweets written by the same author is considered only if there is no tweet between them.
(d) In two consecutive tweets, a topic is preferred to be continued.

To predict tweet serialization and discourse segmentation, two criteria were used: temporal proximity and discourse markers. Temporal proximity shows whether the time interval between two tweets is less than a threshold level, which can be a constant or user-specific value. Discourse markers are classified into continuation markers and shift markers. Continuation markers include web-specific ones such as `>>', `(continued)', and numbers, and linguistic ones such as conjunctions and referring expressions. Shift markers include web-specific ones such as `RT' and URLS, and linguistic ones such as interjections and temporal adverbs. These factors are treated differently in two different models. The Strict Serialization (SS) model regards two tweets as serialized only if their interval is extremely short or they have a continuation marker. On the contrary, the Serialization Plus Discourse Segmentation (SPDS) model, following the assumption (d) that continuation is preferred to shifting, considers two tweets as serialized if their interval is not too long, and terminates a discourse segment only if the current tweet has a shift marker.

To verify whether the proposed models are useful, an information retrieval task is implemented. It is predicted by the assumption (b) and observed in the data that topic words were implicit in some tweets in discourse segments consisting of multiple tweets. The current search system cannot retrieve such tweets and thus fails to satisfy users' information need to find diverse opinions in Twitter. When finding discourse segments compiled by the proposed models, the system can retrieve tweets that belong to the same discourse segment as some explicitly relevant one, without retrieving too many irrelevant tweets. Consequently, the proposed models achieve higher means of precision rates than those of the Query Matching model and TF-IDF Weighting model. Furthermore, since the SPDS model outperforms the SS model, the principle of unmarkedness of topic continuation seems to be also valid for social media. Lastly, this thesis also discovers that linguistic markers such as interjections, which have been typically treated as stopwords in information retrieval, are useful for discourse segment detection.
Language
English
URI
https://hdl.handle.net/10371/131950
Files in This Item:
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share