한국어 토픽모델링을 위한 단어 임베딩 활용 가능성 탐색

백시온

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

한국어 토픽모델링을 위한 단어 임베딩 활용 가능성 탐색 : Exploration on utilization of word embedding for topic modeling in Korean data

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 백시온

Advisor: 이준환

Major: 인문대학 협동과정 인지과학전공

Issue Date: 2019-02

Publisher: 서울대학교 대학원

Description: 학위논문 (석사)-- 서울대학교 대학원 : 인문대학 협동과정 인지과학전공, 2019. 2. 이준환.

Abstract: 본 연구는 한국어 데이터 토픽 모델링을 위하여 단어 임베딩의 활용 방안을 탐색하는 것을 목표로 한다. 토픽 모델은 비구조화된 문서에서 주제의 집합을 찾는 방법론으로 대규모의 데이터에서 효과적으로 문서를 군집화하여 그 내용을 이해할 수 있다는 점에서 유용한 텍스트 마이닝 기법이다. 단어 임베딩은 단어를 벡터로 표현하는 방법으로 최근 개별 언어의 구조를 활용하는 임베딩 방법이 다수 연구된 바 있다. 본 연구는 이러한 단어 임베딩 연구 결과를 토픽 모델링에서 활용하기 위한 시도로 토픽 모델에서 널리 이용되고 있는 LDA(Latent Dirichlet Allocation)와 LDA에 단어 임베딩을 접목한 모델인 LDA2VEC을 활용하여 각 모델의 결과물을 질적으로 비교, 분석한다. 실험에는 최근 제안된 4 가지 유형의 단어 임베딩 Word2Vec, Glove, FastText, SISG(jm)이 사용되었으며, 네이버 뉴스 사회면의 온라인 기사와 빅카인즈의 데이터를 대상으로 토픽 모델을 수행하였다. 각 모델의 결과물은 토픽 내 연관어의 빈도와 가중치를 토대로 평가되었다. 이후 여러 모델의 결과물이 일치하는 정도를 활용하여 손쉽게 토픽 모델의 결과물을 정제할 수 있는 방안을 제안하였다. 동일한 데이터에 대해 서로 다른 단어 임베딩을 적용한 LDA2VEC으로 토픽 모델링을 수행한 뒤 각 모델의 결과물 중 중복되는 단어와 토픽을 최종 결과물로 채택하는 방법이다. 이러한 방법으로 정제한 결과물을 LDA 결과와 비교한 결과 보다 정제된 토픽 모델 결과물을 취할 수 있다는 점을 시사하였다. 본 연구는 한국어 데이터 토픽 모델링 수행 시 LDA에 대한 대안을 제공함과 동시에 토픽 모델의 결과물에서 손쉽게 핵심 키워드를 특정할 수 있는 방안을 제안한다는 점에서 방법론적 시사점을 제공한다.
The goal of this research is to explore appropriate methodologies for topic modeling in Korean data. Topic modeling is widely used in various research areas due to its capability in document clustering and topic extraction. Word embedding, on the other hand, is known for its effective representation of natural language and recent research in the field has developed language specific representations. LDA2VEC is an approach to exploit benefits of LDA and word embeddings by using context vector. This research qualitatively analyze the results of LDA2VEC models combined with 4 different types of word embeddings in comparison to LDA. Based on this qualitative analysis, the research suggests a method to fine-tune topic model results based on the agreement of different models and thus, provide an alternative of LDA. This method is expected to be especially useful for researchers dealing with a text data without prior knowledge or a task requiring identification of keywords.

Language: kor

URI: https://hdl.handle.net/10371/151565

Files in This Item:

000000155233.pdf 1.68 MB

Appears in Collections:

College of Humanities (인문대학)
- Program in Cognitive Science (협동과정-인지과학전공)
  - Theses (Master's Degree_협동과정-인지과학전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share