Clustering Words from Biased Contexts using Dimensionality Reduction

Catherine Sullivan

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Clustering Words from Biased Contexts using Dimensionality Reduction : 차원 축소를 이용한 편향적 문맥에서의 단어 클러스터링

DC Field	Value	Language
dc.contributor.advisor	신효필	-
dc.contributor.author	Catherine Sullivan	-
dc.date.accessioned	2020-10-13T03:55:38Z	-
dc.date.available	2020-10-13T03:55:38Z	-
dc.date.issued	2020	-
dc.identifier.other	000000162626	-
dc.identifier.uri	https://hdl.handle.net/10371/170587	-
dc.identifier.uri	http://dcollection.snu.ac.kr/common/orgView/000000162626	ko_KR
dc.description	학위논문 (석사) -- 서울대학교 대학원 : 인문대학 언어학과, 2020. 8. 신효필.	-
dc.description.abstract	편향성(Bias)은 어떤 사물, 사람 혹은 그룹 등에서 한쪽에 불균형적으로 주어지는 가중치라고 정의할 수 있다. 최근에는 기계학습에서의 편향성 문제와, 자연언어처리에서 이러한 편향성을 완화하고자 하는 연구에 대한 관심이 늘고 있다. 본 연구의 목표는 언어에 존재하는 편향성을 확인하고 워드 임베딩에서 그 편향성이 어떻게 표현되고 있는지 살펴보는 것이다. 본 연구에서 사용하는 데이터는 Wikipedia Neutrality Corpus(WNC)이고 이에 대한 워드 임베딩으로는 Pryzant et al.(2019)의 편향성을 제거하는 모듈러 모델(modular model)을 이용하였다. 또한 K-means Clustering을 이용하여 편향성 정보를 포함한 v 벡터를 추가하기 전과 후의 워드 임베딩을 시각화하였고, 클러스터링(Clustering) 성능의 개선을 위해 주성분분석(Principal Component Analysis/PCA)을 사용하였다. 본 연구에서는 워드 임베딩에서 언어적 특징에 따라 클러스터링 되는 것과 같이 편향성을 갖는 단어들 역시 편향성의 유형(인식론적 편향성, 프레이밍에 따른 편향성, 인구학적 편향성 등)에 따라서 클러스터링 된다는 것을 확인할 수 있었다. 또한, 워드 임베딩이 모듈러 모델의 고유한 v 벡터와 결합할 경우 다양한 언어 정보를 포함하게 되므로, 이러한 연구는 편향성을 인식하고 제거하는 task뿐만 아니라 문맥(context) 정보를 이해하는 데에도 도움이 될 것이다.	-
dc.description.abstract	Bias can be defined as disproportionate weight in favor of or against one thing, person, or group compared with another. Recently, the issue of bias in machine learning and how to de-bias natural language processing has been a topic of increasing interest. This research examines bias in language, the effect of context on biased-judgements, and the clustering of biased- and neutral-judged words taken from biased contexts. The data for this study comes from the Wikipedia Neutrality Corpus (WNC) and its representation as word embeddings is from the bias neutralizing modular model by Pryzant et al. (2019). Visualization of the embeddings is done using K-means clustering to compare before and after the addition of the v vector, which holds bias information. Principal Component Analysis (PCA) is also used in an attempt to boost performance of clustering. This study finds that because the word embeddings cluster according linguistic features, the biased words also cluster according to bias type: epistemological bias, framing bias, and demographic bias. It also presents evidence that the word embeddings after being combined with the unique v vector from the modular model contain discrete linguistic information that helps not only in the task of detecting and neutralizing bias, but also recognizing context.	-
dc.description.tableofcontents	1. Introduction 1 1.1. What is Bias? 1 1.2. De-biasing Techniques 7 1.3. Purpose and Significance of this Study 11 2. Background Information 13 2.1. Previous Research 13 2.2. Wikipedia Neutrality Corpus 18 2.3. Modular Model 20 2.4. Methodology 24 2.2.1. Clustering 25 2.2.2. Dimensionality Reduction Algorithm 28 3. Experiment 35 4. Results 43 4.1. Clustering of Entire Data Set 43 4.1.1. Most Frequently Biased-Judged Words 52 4.1.2. Cosine Similarity 58 4.2. Clustering of Small Random Sample 66 4.3. Significance of Results 69 5. Conclusion 71 References 73 Appendix 80 Abstract in Korean 83	-
dc.language.iso	eng	-
dc.publisher	서울대학교 대학원	-
dc.subject	bias	-
dc.subject	bias neutralization	-
dc.subject	clustering	-
dc.subject	k-means	-
dc.subject	dimensionality reduction	-
dc.subject	PCA	-
dc.subject	word embeddings	-
dc.subject	편향성	-
dc.subject	편향성 제거	-
dc.subject	클러스터링	-
dc.subject	차원축소	-
dc.subject	주성분분석	-
dc.subject	워드 임베딩	-
dc.subject.ddc	401	-
dc.title	Clustering Words from Biased Contexts using Dimensionality Reduction	-
dc.title.alternative	차원 축소를 이용한 편향적 문맥에서의 단어 클러스터링	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.AlternativeAuthor	서린	-
dc.contributor.department	인문대학 언어학과	-
dc.description.degree	Master	-
dc.date.awarded	2020-08	-
dc.identifier.uci	I804:11032-000000162626	-
dc.identifier.holdings	000000000043▲000000000048▲000000162626▲	-

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Master's Degree_언어학과)

Files in This Item:

000000162626.pdf 2.00 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share