한국어 문장 내 관계 추출을 통한 지식 그래프 구축

김석기

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

한국어 문장 내 관계 추출을 통한 지식 그래프 구축 : Knowledge Graph Construction through Relation Extraction within Korean Sentences

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 김석기

Advisor: 신효필

Issue Date: 2022

Publisher: 서울대학교 대학원

Keywords: 지식 그래프 ; 관계 추출 ; 관계 표현 ; BERT

Description: 학위논문(석사) -- 서울대학교대학원 : 데이터사이언스대학원 데이터사이언스학과, 2022.2. 신효필.

Abstract: Recently, with the rapid development of computing performance, there have been many studies to solve various social tasks by using artificial intelligence technologies. Recent studies have shown that the use of knowledge graph for AI tasks is effective in improving performance. Therefore, constructing a knowledge graph is meaningful in itself. On the other hand, there is a limitation in that it is inefficient in terms of cost because it takes a lot of peoples efforts. Accordingly, many studies to construct KG mechanically have also been actively conducted. One way of mechanical KG construction is to extract relational information from sentences using natural language processing.
In this study, we train a model for relation extraction and proposes a method to construct Korean-based KG. There are many studies related to English-based KG but in the case of Korean, there are difficulties in constructing KG because the number of studies is relatively small. In this study, we tried to show the process of constructing KG by training a Korean relation extraction model in order to prepare a foundation for insufficient research on Korean KG construction.
This study introduces two preliminary experiments to improve the performance of relation extraction task using BERT prior to training the Korean relation extraction model. First, we conducted an experiment on whether the factors related to sentence structure affect the performance. We selected the distance between two entities and the length of the sentence as the factors, and we tried to find out how to learn the relation representation better. The results showed that the factors did not have a significant effect on the performance, and we collected the Korean relation extraction dataset by reflecting this result. Second, we compared the performance of relation extraction using BERT, according to the output representation used as the relation representation. Our Korean relation extraction model is trained using the model with the best performance from the experimental results.
This study constructs a Korean relation extraction dataset in order to train the Korean relation extraction model. Korean Wikipedia corpus data was used as a large-capacity corpus, and Wikidata was used as a large-scale knowledge graph. We conducted entity extraction from Korean Wikipedia corpus and also conducted entity linking to Wikidata. We published our dataset through GitHub.
Using the dataset constructed in this study, the Korean relation extraction model was trained by fine-tuning the KR-BERT-MEDIUM model, which is pre-trained based on Korean. After the evaluation of the model, we created a Python package korre, which preforms Korean relation extraction task, and released through GitHub.
In this study, Korean-based KG for the newspaper article domain was constructed using our model and we visualized it using neo4j. We present the limitations of our trained Korean relation extraction model and suggest the directions for future Korean KG construction research to overcome these limitations.
최근 컴퓨팅의 성능이 급속도로 발전함에 따라 사회의 다양한 과제들을 인공지능 기술을 활용하여 해결하려는 노력이 나타나고 있다. 이러한 가운데 최근 인공지능 과제에 지식 그래프 (KG, Knowledge Graph)를 도입하는 것이 성능 개선에 효과적이라는 연구들이 이루어지고 있다. 따라서 지식 그래프를 구축하는 것은 그 자체로 의미가 있다. 한편, 지식 그래프를 구축하는데에는 많은 사람들의 노력이 들어가기 때문에 비용적인 측면에서 비효율적이라는 한계가 존재한다. 이에 따라 지식 그래프를 기계적으로 구축하려는 연구가 활발히 이루어지고 있다. 지식 그래프를 기계적으로 구축하는 방법 중의 하나로, 자연어 처리를 이용하여 문장으로부터 관계 정보를 추출하는 방법이 알려져 있다.
본 연구는 한국어 문장으로부터 관계를 추출하는 모델을 학습하고, 이를 활용해 한국어 기반의 지식 그래프를 구축하는 방법을 제시한다. 영어 기반의 지식 그래프를 구축하는 것과 관련한 연구는 많이 존재하지만, 한국어의 경우 연구의 수가 상대적으로 적어 지식 그래프를 구축하는데 어려움이 존재한다. 본 연구에서는 부족한 한국어 기반의 지식 그래프 구축 연구의 기반을 마련하기 위해 한국어 관계 추출 모델을 학습하여 지식 그래프를 구축하는 과정을 보여주고자 하였다.
본 연구에서는 한국어 관계 추출 모델을 학습하기에 앞서 BERT를 활용한 관계 추출 태스크의 성능을 향상시키기 위해 2가지의 예비 실험을 진행하였다. 첫 번째로는 문장 구조와 관련된 요인이 관계 추출의 성능에 영향을 미치는지에 대한 실험을 진행하였다. 문장 구조와 관련된 요인으로는 문장 내 두 개체 사이의 거리와 문장 자체의 길이를 선정하였고, 관계 표현을 더 잘 학습할 수 있기 위한 요인의 조건을 알아내고자 하였다. 실험 결과에 따르면 두 요인은 관계 추출의 성능에 큰 영향을 미치지 못하는 것으로 나타났고, 이 결과를 반영하여 한국어 관계 추출 데이터셋을 구축하기 위한 문장들을 수집하였다. 두 번째로는 BERT를 활용하여 관계 추출 태스크를 수행하는 경우 관계 표현으로 사용하는 출력 표현에 따른 성능을 비교하였다. 실험 결과를 반영하여 성능이 가장 좋게 나타난 모델의 출력 표현을 사용하여 한국어 관계 추출 모델을 학습한다.
본 연구에서는 한국어 관계 추출 모델을 학습하기 위해 한국어 관계 추출 데이터셋을 직접 구축하였다. 대용량의 말뭉치로 한국어 위키피디아 코퍼스 데이터를 사용하고, 대규모의 지식 그래프로 위키데이터를 사용하였다. 한국어 위키피디아 코퍼스 데이터로부터 개체명 인식을 통해 개체를 추출하고 위키데이터에 연결하였다. 이를 통해 한국어 관계 추출 데이터셋을 구축하고 이를 GitHub을 통해 공개하였다.
본 연구에서 구축한 데이터셋을 이용하여 한국어를 기반으로 사전 학습된 BERT 모델 중 KR-BERT-MEDIUM 모델을 미세 조정 (fine-tuning)하는 방식으로 한국어 관계 추출 모델을 학습하였다. 모델의 성능을 평가한 후에, 학습된 모델을 이용하여 한국어 관계 추출 태스크를 수행하는 Python 패키지 korre를 제작하고 이를 GitHub을 통해 공개하였다.
본 연구에서는 학습된 한국어 관계 추출 모델을 사용하여 신문 기사 도메인 분야에 대한 지식 그래프를 구축하고, 이를 neo4j를 이용하여 시각화하였다. 이를 통해 학습된 한국어 관계 추출 모델의 한계점을 제시하고, 이를 극복하기 위해 한국어 지식 그래프 구축 연구가 앞으로 나아가야 할 방향성에 대해 제시한다.

Language: kor

URI: https://hdl.handle.net/10371/183640

https://dcollection.snu.ac.kr/common/orgView/000000170155

Files in This Item:

000000170155.pdf 4.58 MB

Appears in Collections:

Graduate School of Data Science (데이터사이언스 대학원)
- Theses (Master's Degree_데이터사이언스학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share