Clustering Words from Biased Contexts using Dimensionality Reduction

Catherine Sullivan

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Clustering Words from Biased Contexts using Dimensionality Reduction : 차원 축소를 이용한 편향적 문맥에서의 단어 클러스터링

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: Catherine Sullivan

Advisor: 신효필

Issue Date: 2020

Publisher: 서울대학교 대학원

Keywords: bias ; bias neutralization ; clustering ; k-means ; dimensionality reduction ; PCA ; word embeddings ; 편향성 ; 편향성 제거 ; 클러스터링 ; 차원축소 ; 주성분분석 ; 워드 임베딩

Description: 학위논문 (석사) -- 서울대학교 대학원 : 인문대학 언어학과, 2020. 8. 신효필.

Abstract: 편향성(Bias)은 어떤 사물, 사람 혹은 그룹 등에서 한쪽에 불균형적으로 주어지는 가중치라고 정의할 수 있다. 최근에는 기계학습에서의 편향성 문제와, 자연언어처리에서 이러한 편향성을 완화하고자 하는 연구에 대한 관심이 늘고 있다. 본 연구의 목표는 언어에 존재하는 편향성을 확인하고 워드 임베딩에서 그 편향성이 어떻게 표현되고 있는지 살펴보는 것이다.
본 연구에서 사용하는 데이터는 Wikipedia Neutrality Corpus(WNC)이고 이에 대한 워드 임베딩으로는 Pryzant et al.(2019)의 편향성을 제거하는 모듈러 모델(modular model)을 이용하였다. 또한 K-means Clustering을 이용하여 편향성 정보를 포함한 v 벡터를 추가하기 전과 후의 워드 임베딩을 시각화하였고, 클러스터링(Clustering) 성능의 개선을 위해 주성분분석(Principal Component Analysis/PCA)을 사용하였다.
본 연구에서는 워드 임베딩에서 언어적 특징에 따라 클러스터링 되는 것과 같이 편향성을 갖는 단어들 역시 편향성의 유형(인식론적 편향성, 프레이밍에 따른 편향성, 인구학적 편향성 등)에 따라서 클러스터링 된다는 것을 확인할 수 있었다. 또한, 워드 임베딩이 모듈러 모델의 고유한 v 벡터와 결합할 경우 다양한 언어 정보를 포함하게 되므로, 이러한 연구는 편향성을 인식하고 제거하는 task뿐만 아니라 문맥(context) 정보를 이해하는 데에도 도움이 될 것이다.
Bias can be defined as disproportionate weight in favor of or against one thing, person, or group compared with another. Recently, the issue of bias in machine learning and how to de-bias natural language processing has been a topic of increasing interest. This research examines bias in language, the effect of context on biased-judgements, and the clustering of biased- and neutral-judged words taken from biased contexts.
The data for this study comes from the Wikipedia Neutrality Corpus (WNC) and its representation as word embeddings is from the bias neutralizing modular model by Pryzant et al. (2019). Visualization of the embeddings is done using K-means clustering to compare before and after the addition of the v vector, which holds bias information. Principal Component Analysis (PCA) is also used in an attempt to boost performance of clustering.
This study finds that because the word embeddings cluster according linguistic features, the biased words also cluster according to bias type: epistemological bias, framing bias, and demographic bias. It also presents evidence that the word embeddings after being combined with the unique v vector from the modular model contain discrete linguistic information that helps not only in the task of detecting and neutralizing bias, but also recognizing context.

Language: eng

URI: https://hdl.handle.net/10371/170587

http://dcollection.snu.ac.kr/common/orgView/000000162626

Files in This Item:

000000162626.pdf 2.00 MB

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Master's Degree_언어학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share