Bag-of-Concepts: Comprehending Document Representation through Clustering Words in Distributed Representation

Han Kyul Kim

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Bag-of-Concepts: Comprehending Document Representation through Clustering Words in Distributed Representation : Bag-of-Concepts: 단어에 대한 분산표상의 군집화를 통한 해석 가능한 문서 표현법

DC Field	Value	Language
dc.contributor.advisor	조성준	-
dc.contributor.author	Han Kyul Kim	-
dc.date.accessioned	2017-07-14T03:24:58Z	-
dc.date.available	2017-07-14T03:24:58Z	-
dc.date.issued	2016-08	-
dc.identifier.other	000000135948	-
dc.identifier.uri	https://hdl.handle.net/10371/123602	-
dc.description	학위논문 (석사)-- 서울대학교 대학원 : 산업공학과 데이터마이닝전공, 2016. 8. 조성준.	-
dc.description.abstract	Two document representation methods are mainly used in solving text mining problems. Known for its intuitive and simple interpretability, the bag-of-words method represents a document vector by its word frequencies. However, this method suffers from the curse of dimensionality, and fails to preserve accurate proximity information when the number of unique words increases. Furthermore, this method assumes every word to be independent, disregarding the impact of semantically similar words on preserving document proximity. On the other hand, doc2vec, a basic neural network model, creates low dimensional vectors that successfully preserve the proximity information. However, it loses the interpretability as meanings behind each feature is indescribable. This paper proposes the bag-of-concepts method as an alternative document representation method that overcomes the weaknesses of these two methods. This proposed method creates concepts through clustering word vectors generated from word2vec, and uses the frequencies of these concept clusters to represent document vectors. Through these data-driven concepts, the proposed method incorporates the impact of semantically similar words on preserving document proximity effectively. With appropriate weighting scheme such as concept frequency-inverse document frequency, the proposed method provides better document representation than previously suggested methods, and also offers intuitive interpretability behind the generated document vectors. Based on the proposed method, subsequently constructed text mining models, such as decision tree, can also provide interpretable and intuitive reasons on why certain collections of documents are different from others.	-
dc.description.tableofcontents	Chapter1. Introduction 1 Chapter2. Related Work 5 2.1 Bag-of-Words 5 2.2 Word2Vec 6 2.3 Doc2Vec 10 Chapter3. Proposed Method 13 Chapter4. Data Set Description 15 Chapter5. Experiment Result 17 5.1 Representation Effectiveness 17 5.2 Representation Interpretability 24 5.3 Model Explainability 31 Chapter6. Conclusion 37 Bibliography 39 Abstract 44	-
dc.format	application/pdf	-
dc.format.extent	1833395 bytes	-
dc.format.medium	application/pdf	-
dc.language.iso	en	-
dc.publisher	서울대학교 대학원	-
dc.subject	bag-of-concepts	-
dc.subject	interpretable document representation	-
dc.subject	word2vec clustering	-
dc.subject.ddc	670	-
dc.title	Bag-of-Concepts: Comprehending Document Representation through Clustering Words in Distributed Representation	-
dc.title.alternative	Bag-of-Concepts: 단어에 대한 분산표상의 군집화를 통한 해석 가능한 문서 표현법	-
dc.type	Thesis	-
dc.contributor.AlternativeAuthor	김한결	-
dc.description.degree	Master	-
dc.citation.pages	44	-
dc.contributor.affiliation	공과대학 산업공학과	-
dc.date.awarded	2016-08	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Industrial Engineering (산업공학과)
  - Theses (Master's Degree_산업공학과)

Files in This Item:

000000135948.pdf 1.75 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share