자동화된 팩트체크 과정에서 딥러닝을 이용한 주장(claim) 판별 기법 연구

박성민

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

자동화된 팩트체크 과정에서 딥러닝을 이용한 주장(claim) 판별 기법 연구 : Deep Learning-Based Check-worthy claim detection for Automated Fact-checking

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 박성민

Advisor: 이준환

Issue Date: 2022

Publisher: 서울대학교 대학원

Keywords: 팩트체크 ; 팩트체크 자동화 ; 딥러닝 ; 자연어 처리 ; 데이터셋 ; fact check ; fact-checking ; deep learning ; language model ; check-worthy claims

Description: 학위논문(석사) -- 서울대학교대학원 : 사회과학대학 언론정보학과, 2022.2. 이준환.

Abstract: 본 연구는 팩트체크 과정 중, 검증할 대상이 되는 주장을 판별하는 과정을 자동화해보고자 하였다. 이를 위해 팩트체크 검증대상 분류용 데이터셋을 직접 제작하였으며, 이를 기반으로 해당 데이터셋을 딥러닝 언어모델에 학습시켜 실질적으로 검증대상을 판별할 수 있는 분류 모델을 제작한 후 평가하였다.
본 연구에서 제작한 팩트체크 검증대상 분류용 데이터셋은 총 118,119개의 문장으로 이루어져 있으며 과거 한국 정치인의 발화 데이터(홈페이지, SNS, 국회회의록 등을 포함)를 기반으로 제작되었다. 각 문장은 팩트체크 대상이 되는 주장을 판별하기 위한 두 가지 기준인 '검증가능성'과 '검증중요성' 두 개의 이진 라벨을 지니고 있다. 사실적 정보에 대한 주장(factual claim)을 검증가능성이 참인 문장으로 분류하였으며, 이 중 실제 팩트체크의 대상이 되는 내용의 주장을 검증중요성이 참인 문장으로 보았다.
해당 데이터셋과 텍스트 분류 기법을 활용하여, 한국어 팩트체크에서 검증대상이 되는 주장을 판별할 수 있는 분류 모델을 제작하였다. 해당 분류 모델은 한국어 코퍼스를 사전학습한 딥러닝 언어모델인 KoBERT를 전이학습하는 방식으로 제작되었다. 검증가능성과 검증중요성 두 가지 라벨에 대해 각각 이진 분류 모델을 제작하였으며, 그 성능을 정확도, 정밀도, 재현율, F1-점수를 통해 평가하였다. 검증가능성 기준 분류 모델은 모든 지표에서 높은 성능을 보이는 모델을 제작할 수 있었으며 검증중요성 기준 분류 모델 또한 상당한 성능을 지니는 모델을 제작할 수 있었으나 정밀도 부분에서 약간의 한계를 보였다.
본 연구는 한국어 팩트체크 주장 판별을 위한 유의미한 크기의 데이터셋을 최초로 구축하였다. 해당 데이터셋은 이후 팩트체크 자동화 모델뿐만 아니라 한국어 자연어 처리를 위한 데이터셋으로도 활용될 수 있다. 또한, 직접 제작한 분류 모델이 상당히 높은 정확도로 작동하는 것을 확인하였다. 이는 추후 팩트체크 자동화 과정에서 해당 기법이 검증할 대상을 분류해내는 데 실질적으로 활용될 수 있음을 의미한다.
This study attempted to automate the process of detecting check-worthy claims during the fact-checking process. Specifiaclly, we constructed a dataset to classify check-worthy claims, and based on this, we built a classification model using pre-trained deep learning language model.
Our dataset consists of a total of 118,119 sentences and was produced based on past Korean politicians' utterance data (including websites, SNS, and minutes of the National Assembly). Each sentence has two binary labels: "Verification Possibility" and "Verification Importance," which are two criteria for check-worthiness. Factual claim was classified as a sentence with true verification possibility, of which the claim subject to actual fact check was viewed as a sentence with true verification importance.
Using our dataset and text classification technique, we trained a classification model to detect check-worthy claims for Korean fact check. The final model was built by fine-tuning KoBERT, a deep learning language model which was pre-trained with the Korean corpus. Binary classification model was produced for each of the two labels of verification possibility and verification importance, and their performance was evaluated through accuracy, precision, recall, and F1-score. The verification possibility classification model showed high performance in all indicators, and the verification importance classification model also showed high performance in most indicators but the precision was relatively low.
This study first constructed a dataset of significant size to detect the Korean chech-worthy claims. The dataset may then be used as a data set for Korean natural language processing as well as a automating fact-check. In addition, our check-worthy classification model performed considerably high accuracy. This means that in the process of automating fact-check, the technique can be practically used to detect check-worthy claims.

Language: kor

URI: https://hdl.handle.net/10371/183453

https://dcollection.snu.ac.kr/common/orgView/000000171350

Files in This Item:

000000171350.pdf 1.33 MB

Appears in Collections:

College of Social Sciences (사회과학대학)
- Dept. of Communication (언론정보학과)
  - Theses (Master's Degree_언론정보학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share