Cross-lingual Transfer Learning for Offensive Language Identification in Low-resource Settings

Abstract: Offensive content is pervasive in social media, and the damaging effects on its users have incurred a considerable amount of social cost. Accordingly, automatic methods to identify offensive languages in social media have been an area of rising interest both to academia and industry. However, the clear majority of recent studies deal with English models as the resources required to build such automatic systems, such as annotated data, are unavailable in most languages. In such a context, cross-lingual transfer learning has been a promising research field aimed at bridging the resource gap between languages. Generally, knowledge learned from annotated datasets and models in resource-rich transfer languages is transferred to the resource-poor, target language, to alleviate the need for annotated data in the target language.

In this thesis, we take a cross-lingual transfer learning approach to tackle offensive language identification in resource-poor languages --- where labeled data are either insufficient or unavailable in the target language, set as Danish in our case. Concretely, we harness the abundance of labeled data in English by applying multilingual contextual word embeddings and language-adversarial training to improve task performance in Danish. Experimental results validate the effectiveness of language-adversarial training compared to other cross-lingual transfer methods under various resource conditions. We further investigate why language-adversarial training is helpful through qualitative analysis and visualization methods to confirm that the language-adversarial training procedure encourages the alignment between the two language spaces.
악성 댓글과 같이 공격성을 띤 콘텐츠는 인터넷, 특히 소셜 미디어 사용자 간에 빈번하게 발생하고 있으며 현대 사회에서 심각한 문제로 대두되고 있다. 악성 게시글이 사용자들에게 미치는 정신적인 피해는 막대하며 그로 인해 부과되는 사회적인 비용 또한 상당하다. 이에 따라 혐오 표현이나 공격적 발화 등 비하적인 의도를 지닌 악성 게시글을 자동으로 판별하는 기술은 학계와 산업 모두에서 활발하게 연구되어왔다. 하지만 자동 판별 시스템을 구축하기 위해 요구되는 기술적 자원은 대부분 영어로 제공되기 때문에 관련된 대다수의 연구 또한 자연스럽게 영어로만 진행되어왔다. 이와 같은 맥락에서 언어 간 전이 학습(Cross-lingual transfer learning)은 자연어 처리 분야 내에서도 언어 간의 자원 불균형을 해소하고자 하는 목적으로 활발하게 연구되고 있는 분야이다. 언어 간의 전이 학습은 주로 자원이 풍부한 전이 언어(Transfer language)에서 학습된 지식을 자원이 부족한 타깃 언어(Target language)의 모델로 전달하고자 하며, 타깃 언어에 라벨링 된 데이터가 적은 상황에서도 높은 성능을 낼 수 있다는 이점을 지닌다.

본 학위논문에서는 다국어 악성 발화 탐지 문제에 언어 간 전이 학습을 접목하여 저-자원 타깃 언어에서의 성능을 향상하고자 한다. 구체적으로는 영어의 라벨링 데이터를 이용하여 라벨링 데이터가 부족하거나 존재하지 않는 덴마크어에서의 성능을 높이는 것이 목적이며, 최근 발표된 다국어 사전 학습 단어 임베딩(multilingual pre-trained word embeddings)과 언어-적대적 학습법(language-adversarial training)을 활용하였다. 다양한 자원 조건에서 진행된 실험을 통해 결과적으로 언어 간 전이 학습의 효과를 검증하였으며, 나아가 그 효과가 언어-적대적 학습 절차에 의해 증대된다는 것을 확인하였다. 이외에도 표상 공간 시각화와 같은 추가적인 분석을 통해 언어-적대적 학습이 어떻게 타깃 언어에서의 성능 향상에 도움이 되는지 확인하였다.

Language: eng

URI: https://hdl.handle.net/10371/175190

https://dcollection.snu.ac.kr/common/orgView/000000165350

Files in This Item:

000000165350.pdf 5.65 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Industrial Engineering (산업공학과)
  - Theses (Master's Degree_산업공학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share