비지도 단어 임베딩 기반 토픽 모델링 상세 주제 변화 탐지

안병은

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

비지도 단어 임베딩 기반 토픽 모델링 상세 주제 변화 탐지 : Unsupervised Word Embedding Based Topic Modeling Extracts Latent Biomedical Knowledge from Korean Gov. Research Proposals

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 안병은

Advisor: 김용대

Issue Date: 2020

Publisher: 서울대학교 대학원

Description: 학위논문(석사)--서울대학교 대학원 :공학전문대학원 응용공학과,2020. 2. 김용대.

Abstract: 문자는 정보의 기록과 전달에 가장 효율적인 방법 중 하나이다. 특히 학문적 지식은 문자로 서술되는 경우가 압도적으로 높으며, 최신의 성과는 논문의 형태로 작성 시점을 포함하여 공유된다. 하지만, 정보의 양이 많아 짐에 따라, 효율적인 분석이 필요로 하게 되었으며, 문자의 경우 비구조화된 데이터로 그 의미를 찾아내기 어렵다. 기존에는 표식이 있는 대량의 문서를 기준으로 데이터 분류를 하는 지도학습기반 분석이 되었지만 이는 변해가는 상황을 반영하여 의미를 도출하기에는 한계가 있다.
본 연구는 비지도 학습 기반의 단어 임베딩을 통해 대규모 문서의 주요 주제와 시간에 따른 상세 주제 변화의 분석을 목적으로 한다. 이를 위해 2006년부터 2017년까지 정부에서 진행된, 바이오 신약개발 연구 과제를 자연어 군집화 방법인 토픽 모델링 방법으로 분석하였다. 전처리 과정 중 전문용어 인식률을 높이기 위해 NPMI를 적용하여, 바이오산업에 특화된 고유명사, 합성명사를 추출하였고, 토픽 모델링은 비지도 기계학습 방법인 LDA(Latent Dirichlet Allocation)으로 문서 내의 거시적인 주제의 변화를 탐색하였다. 이후, 기존 LDA와 단어 임베딩 방법이 결합한 LDA2vec (15만 건의 바이오 기술 관련된 문서로 학습된 단어 임베딩(GloVe) 기반 준 지도 토픽
모델링 을 활용하여, 시간에 따른 신약 연구의 주제뿐만 아니라, 기존의 방법에서 나아간, 주제 내의 단어와 단어의 관계를 탐지하고자 하였다. 이를 통해 특정 단어가 주어졌을 때, 기간별 주제 간의 유사성을 파악하고, 그 유사 주제 안에서 특정 단어 주변의 단어의 움직임을 분석하였다. 이는 주제 내에서의 단어 간의 영향력을 고려할 수 있게 되어, 상세 내용 분석 가능성을 제공한다.
본 연구로 미래의 잠재적 정보의 흐름을 기존 데이터 기반으로 추출할 수 있는 가능성을 보았다. 이는 대량의 학술 문헌 정보에서 학습된 주제 벡터와 단어 벡터를 통해 시간에 따른 상세 주제 변화의 정보를 제공할 수 있다는 점에서, 추후 연구 방향성 설정 및 다양한 자연어 분석이 필요 한 분야에 활용될 것이다.
Extracting latent knowledge from overwhelming text data has been a great challenge in the field of natural language processing. In recent years, with the significant improvement of deep learning, NLP is also experiencing a breakthrough and renewing the top score of various language tasks. However, to this date, understanding information, and creating new knowledge are human intelligence territory.
In this study, topic modeling algorithms were applied to extract latent information from massive documents (Korean government study proposals from 2006 to 2017, focusing on new drug development researches). To enhance domain-specific vocabulary detection during a preprocessing, NPMI(Normalized Pointwise Mutual Information) merged two nouns as a compound noun. For a topic modeling, unsupervised machine learning algorithm, LDA (Latent Dirichlet Allocation) was used to explore overall topic distributions. Furthermore, LDA2vec, which is a semi-supervised deep learning model that training topic vectors along word embedding vectors in the same dimension, was applied to observe specific words correlation in a topic.
Without any labeling data and insertion of biochemical information, word embedding vectors that trained with topic vectors provide further interpretable information. Also, proposed three novel ways extract latent features from words and topics, and observe future tendency of research. These findings empathize the possibility of precise knowledge understanding. The expected field of applications are new drug investigation, business intelligence, and a variety of text data.

Language: kor

URI: http://dcollection.snu.ac.kr/common/orgView/000000160297

Files in This Item:

000000160297.pdf 1.77 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Graduate School of Engineering Practice (공학전문대학원)
  - Theses (Master's Degree_공학전문대학원)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share