Knowledge Extraction and Integration for Knowledge Base Construction using Machine Learning

Abstract: Knowledge bases have been successfully applied to many real world applications such as question answering, recommender system and natural language understanding. However, building a large knowledge base using human annotations takes a lot of time, effort, and money. Moreover, it is almost impossible to manually update a large amount of newly created relational facts in a timely manner. Accordingly, automated knowledge base construction has attracted a lot of attention over the last decade.
Knowledge fusion is a method to automatically construct knowledge bases from the entire web. It first extracts information from many web pages by using multiple relation extractors. Since the collected information is usually noisy due to extraction errors, knowledge fusion next identifies the correct information by using truth discovery techniques and appends the new information to the knowledge base. In this dissertation, we focus on extending the coverage and improving the accuracy of the knowledge fusion process.
With the development of deep learning, many recent works on relation extraction make use of deep learning techniques to improve accuracy. Since neural relation extraction models require a large amount of training data, they usually use distant supervision which automatically generates training data by assuming that if a relation between a pair of entities exists in a knowledge base, all sentences that contain these two entities express this relation. However, distant supervision inevitably suffers from the wrong labeling problem which degrades the accuracy of relation extraction. We develop a method to effectively train relation extraction models by also using human annotated data to improve their accuracy. To extend the coverage of relation extraction, we also investigate the problem of extracting information about the topic entity. The topic entity of a document is the entity that is mainly described in the document. Since the topic entity is often missing from some sentences, existing relation extraction models often fail to find the relations with the omitted topic entity. To extract those relations, we propose a topic-aware relation extraction model.
After extracting the relations from web pages, a truth discovery algorithm resolves the conflicts in the extracted information and identifies the correct information. Existing works on truth discovery usually assumed that claimed values are mutually exclusive and only one among them is correct. However, many claimed values are not mutually exclusive due to their hierarchical structures and so we need to take account of the hierarchical structure to infer the truths. We propose a probabilistic model that infers the truth by considering the hierarchical structures for the claimed values. Nevertheless, if many relation extractors generate similar errors, some of the errors might not be corrected by unsupervised truth discovery algorithms. Thus, we take advantage of human cognitive abilities by crowdsourcing the refinement of extracted information. We present a task assignment algorithm to optimize accuracy improvement given the constraint of a fixed budget for crowdsourcing.
지식베이스는 질의응답시스템과 추천시스템, 자연어 이해 등 많은 분야에서 성공적으로 이용되고 있다. 그러나 사람이 직접 대용량의 지식베이스를 구축하는 것은 많은 시간과 노력, 금전적 비용을 초래한다. 게다가 끊임없이 새롭게 생성되는 많은 사실들을 사람이 직접 즉각적으로 업데이트하는 것은 불가능에 가깝다. 이에 따라 지난 10년간 지식베이스를 자동으로 구축하는 연구는 많은 관심을 끌어왔다.
지식융합 (knowledge fusion)은 도메인에 제한이 없는 전체 웹에서 데이터를 수집해 지식베이스를 구축하거나 확장하기 위한 대표적인 방법이다. 지식 융합은 먼저 여러 관계추출 (relation extraction) 기술을 이용해 많은 웹페이지에서 정보를 추출한다. 이때 관계추출기의 정확도 문제 등으로 인해 틀린 사실이 추출되는 경우가 자주 발생한다. 지식융합에서는 사실탐지 (truth discovery) 기술을 활용해 추출된 정보 중 정확한 정보를 찾아 지식베이스에 추가하는 일을 수행한다. 이처럼 지식융합이 관계추출과 사실탐지 두단계로 이루어져 있기 때문에 관계추출과 사실탐지의 성능이 지식베이스의 범위와 정확도를 결정한다고 볼 수 있다. 본 학위논문에서는 지식베이스의 범위와 정확도 향상을 위해 관계추출과 사실탐지의 정확도를 높이고 관계추출의 범위를 확장하는 연구를 수행한다.
딥 러닝 기술의 발전으로 많은 최근 연구들은 딥 러닝 기술을 관계추출에 활용하고 있다. 딥 러닝 학습에는 많은 양의 학습데이터가 필요하므로 주로 원격지도를 통해 자동으로 학습데이터를 생성하여 사용한다. 그러나 원격지도를 통해 생성된 데이터는 필연적으로 잘못된 레이블을 생성하는 경우가 많아 관계추출 정확도를 떨어트리는 요인으로 작용한다. 우리는 사람이 레이블을 붙인 적은 양의 학습데이터를 추가로 사용하여 관계추출의 정확도를 높이는 방법을 제안한다. 관계추출의 범위를 넓히기 위해 우리는 토픽 엔티티와 관련된 정보를 추출하는 연구를 수행한다. 토픽 엔티티는 문서에서 주로 서술되는 엔티티이다. 토픽 엔티티는 몇몇 문장에서는 대명사로 대체되거나 생략되는 경우가 많은데 이 경우 기존의 모델들은 해당 정보를 추출하지 못하는 경우가 많다. 이러한 정보를 추출하기 위해 우리는 토픽 엔티티를 고려한 관계추출 모델을 제안한다.
지식융합과정에서 사실탐지기술은 추출된 정보에서 상충되는 부분을 제거하고 정확한 정보를 찾아내는 역할을 수행한다. 사실탐지에 대한 기존 연구들에서는 하나의 대상에 대해 서로 다른 값들은 서로 배타적이어서 이 중 하나의 값만 사실이라고 가정하였다. 그러나 이러한 값 들에는 계층관계가 존재하기 때문에 서로 배타적이지 않은 경우가 많다. 따라서 우리는 계층관계를 고려하여 사실을 찾아야 한다. 우리는 계층구조를 고려한 확률모델과 이에 따른 추론 알고리즘을 제안한다. 그럼에도 불구하고, 많은 관계추출기가 비슷한 오류를 내는 경우 비지도학습인 사실탐지 기술로는 잘못된 정보를 수정하기 어렵다. 사람의 인지능력의 도움을 받아 이러한 오류를 수정하기 위해 우리는 크라우드소싱을 이용하였다. 우리는 크라우드소싱 비용이 제한된 상황에서 최대한의 정확도 증가를 얻기 위한 태스크 할당 알고리즘을 소개한다.
추가로, 태스크 할당 시 지연시간을 줄이기 위한 효율적인 필터링 기술도 제시한다.

Language: English

URI: https://hdl.handle.net/10371/174441

http://dcollection.snu.ac.kr/common/orgView/000000164183

Files in This Item:

000000164183.pdf 6.96 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share