Diagnosing and Improving False Positive Bias in Hate Speech Classifier

오주현

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Diagnosing and Improving False Positive Bias in Hate Speech Classifier : 혐오 발언 분류 모델의 거짓 양성 편향 진단 및 개선 연구

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 오주현

Advisor: 신효필

Issue Date: 2022

Publisher: 서울대학교 대학원

Keywords: Hate speech ; Hate speech dataset ; Dataset construction ; False positives ; Bias measurement ; Bias mitigation

Description: 학위논문(석사) -- 서울대학교대학원 : 인문대학 언어학과, 2022.2. 신효필.

Abstract: As the damage caused by hate speech in anonymous online spaces has been growing significantly, research on the detection of hate speech is being actively conducted. Recently, deep learning-based hate speech classifiers have shown great performance, but they tend to fail to generalize on out-of-domain data. I focus on the problem of False Positive detection and build adversarial tests sets of three different domains to diagnose this issue. I illustrate that a BERT-based classification model trained with existing Korean hate speech corpus exhibits False Positives due to over-sensitivity to specific words that have high correlations with hate speech in training datasets. Next, I present two different approaches to address the problem: a data-centric approach that adds data to correct the imbalance of training datasets and a model-centric approach that regularizes the model using post-hoc explanations. Both methods show improvement in reducing False Positives without compromising overall model quality. In addition, I show that strategically adding negative samples from a domain similar to a test set can be a cost-efficient way of greatly reducing false positives. Using Sampling and Occlusion (Jin et al., 2020) explanation, I qualitatively demonstrate that both approaches help model better utilize contextual information.
온라인 등 익명 공간에서의 혐오 발언(Hate speech)으로 인한 피해가 커져감에 따라, 혐오 발언 분류 및 검출에 관한 연구가 활발히 진행되고 있다. 최근 딥러닝 기반의 혐오 발언 분류기가 좋은 성능을 보이고 있지만, 학습 도메인 밖(out-of-domain) 데이터로 일반화함에 있어서는 어려움을 겪고 있다. 본 연구는 모델이 거짓 양성(False Positive)을 검출해내는 문제에 초점을 두고, 해당 문제를 진단하기 위해 세 가지 서로 다른 도메인의(domain)의 대립적(adversarial) 데이터를 활용하여 테스트셋을 만든다. 이를 통해 기존의 한국어 혐오 표현 데이터셋을 학습한 BERT 기반의 분류 모델이 학습 데이터 상에서 혐오 표현과 높은 상관관계를 가지는 특정 단어들에 민감하게 반응하여 거짓 양성(False Positive) 결과를 예측하는 현상을 보인다. 다음으로, 이를 해결하기 위한 두 가지 방법을 제시한다. 학습 데이터셋의 불균형을 수정하기 위한 데이터를 추가하는 데이터 중점(data-centric) 방법과 특정 단어들에 대한 모델의 사후 설명(post-hoc explanation)을 활용하여 모델을 정규화(regularize) 하는 모델 중점(model-centric) 방법을 적용하고, 두 접근 방법 모두 전반적인 모델 성능을 해치지 않으며 거짓 양성의 비율을 줄일 수 있음을 보인다. 또한, 테스트 도메인의 특성을 알고 있을 경우, 유사한 도메인에서 학습 데이터의 불균형 수정을 위한 샘플 추가를 통해 적은 비용으로 모델의 거짓양성을 큰 폭으로 줄일 수 있음을 보인다. 또한, Samping and Occlusion (Jin et al., 2020) 설명을 통해 두 접근 방식 모두에서 문맥 정보를 더 잘 활용하게 됨을 정성적으로 확인한다.

Language: eng

URI: https://hdl.handle.net/10371/181195

https://dcollection.snu.ac.kr/common/orgView/000000170064

Files in This Item:

000000170064.pdf 2.48 MB

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Master's Degree_언어학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share