The Construction of a Korean Pre-Trained Model and an Enhanced Application on Sentiment Analysis

Abstract: Recently, as interest in the Bidirectional Encoder Representations from Transformers (BERT) model has increased, many studies have also been actively conducted in Natural Language Processing based on the model. Such sentence-level contextualized embedding models are generally known to capture and model lexical, syntactic, and semantic information in sentences during training. Therefore, such models, including ELMo, GPT, and BERT, function as a universal model that can impressively perform a wide range of NLP tasks.
This study proposes a monolingual BERT model trained based on Korean texts. The first released BERT model that can handle the Korean language was Google Researchs multilingual BERT (M-BERT), which was constructed with training data and a vocabulary composed of 104 languages, including Korean and English, and can handle the text of any language contained in the single model. However, despite the advantages of multilingualism, this model does not fully reflect each languages characteristics, so that its text processing performance in each language is lower than that of a monolingual model. While mitigating those shortcomings, we built monolingual models using the training data and a vocabulary organized to better capture Korean texts linguistic knowledge.
Therefore, in this study, a model named KR-BERT was built using training data composed of Korean Wikipedia text and news articles, and was released through GitHub so that it could be used for processing Korean texts. Additionally, we trained a KR-BERT-MEDIUM model based on expanded data by adding comments and legal texts to the training data of KR-BERT. Each model used a list of tokens composed mainly of Hangul characters as its vocabulary, organized using WordPiece algorithms based on the corresponding training data. These models reported competent performances in various Korean NLP tasks such as Named Entity Recognition, Question Answering, Semantic Textual Similarity, and Sentiment Analysis.
In addition, we added sentiment features to the BERT model to specialize it to better function in sentiment analysis. We constructed a sentiment-combined model including sentiment features, where the features consist of polarity and intensity values assigned to each token in the training data corresponding to that of Korean Sentiment Analysis Corpus (KOSAC). The sentiment features assigned to each token compose polarity and intensity embeddings and are infused to the basic BERT input embeddings. The sentiment-combined model is constructed by training the BERT model with these embeddings.
We trained a model named KR-BERT-KOSAC that contains sentiment features while maintaining the same training data, vocabulary, and model configurations as KR-BERT and distributed it through GitHub. Then we analyzed the effects of using sentiment features in comparison to KR-BERT by observing their performance in language modeling during the training process and sentiment analysis tasks. Additionally, we determined how much each of the polarity and intensity features contributes to improving the model performance by separately organizing a model that utilizes each of the features, respectively. We obtained some increase in language modeling and sentiment analysis performances by using both the sentiment features, compared to other models with different feature composition. Here, we included the problems of binary positivity classification of movie reviews and hate speech detection on offensive comments as the sentiment analysis tasks.
On the other hand, training these embedding models requires a lot of training time and hardware resources. Therefore, this study proposes a simple model fusing method that requires relatively little time. We trained a smaller-scaled sentiment-combined model consisting of a smaller number of encoder layers and attention heads and smaller hidden sizes for a few steps, combining it with an existing pre-trained BERT model. Since those pre-trained models are expected to function universally to handle various NLP problems based on good language modeling, this combination will allow two models with different advantages to interact and have better text processing capabilities. In this study, experiments on sentiment analysis problems have confirmed that combining the two models is efficient in training time and usage of hardware resources, while it can produce more accurate predictions than single models that do not include sentiment features.
최근 트랜스포머 양방향 인코더 표현 (Bidirectional Encoder Representations from Transformers, BERT) 모델에 대한 관심이 높아지면서 자연어처리 분야에서 이에 기반한 연구 역시 활발히 이루어지고 있다. 이러한 문장 단위의 임베딩을 위한 모델들은 보통 학습 과정에서 문장 내 어휘, 통사, 의미 정보를 포착하여 모델링한다고 알려져 있다. 따라서 ELMo, GPT, BERT 등은 그 자체가 다양한 자연어처리 문제를 해결할 수 있는 보편적인 모델로서 기능한다.
본 연구는 한국어 자료로 학습한 단일 언어 BERT 모델을 제안한다. 가장 먼저 공개된 한국어를 다룰 수 있는 BERT 모델은 Google Research의 multilingual BERT (M-BERT)였다. 이는 한국어와 영어를 포함하여 104개 언어로 구성된 학습 데이터와 어휘 목록을 가지고 학습한 모델이며, 모델 하나로 포함된 모든 언어의 텍스트를 처리할 수 있다. 그러나 이는 그 다중언어성이 갖는 장점에도 불구하고, 각 언어의 특성을 충분히 반영하지 못하여 단일 언어 모델보다 각 언어의 텍스트 처리 성능이 낮다는 단점을 보인다. 본 연구는 그러한 단점들을 완화하면서 텍스트에 포함되어 있는 언어 정보를 보다 잘 포착할 수 있도록 구성된 데이터와 어휘 목록을 이용하여 모델을 구축하고자 하였다.
따라서 본 연구에서는 한국어 Wikipedia 텍스트와 뉴스 기사로 구성된 데이터를 이용하여 KR-BERT 모델을 구현하고, 이를 GitHub을 통해 공개하여 한국어 정보처리를 위해 사용될 수 있도록 하였다. 또한 해당 학습 데이터에 댓글 데이터와 법조문과 판결문을 덧붙여 확장한 텍스트에 기반해서 다시 KR-BERT-MEDIUM 모델을 학습하였다. 이 모델은 해당 학습 데이터로부터 WordPiece 알고리즘을 이용해 구성한 한글 중심의 토큰 목록을 사전으로 이용하였다. 이들 모델은 개체명 인식, 질의응답, 문장 유사도 판단, 감정 분석 등의 다양한 한국어 자연어처리 문제에 적용되어 우수한 성능을 보고했다.
또한 본 연구에서는 BERT 모델에 감정 자질을 추가하여 그것이 감정 분석에 특화된 모델로서 확장된 기능을 하도록 하였다. 감정 자질을 포함하여 별도의 임베딩 모델을 학습시켰는데, 이때 감정 자질은 문장 내의 각 토큰에 한국어 감정 분석 코퍼스 (KOSAC)에 대응하는 감정 극성(polarity)과 강도(intensity) 값을 부여한 것이다. 각 토큰에 부여된 자질은 그 자체로 극성 임베딩과 강도 임베딩을 구성하고, BERT가 기본으로 하는 토큰 임베딩에 더해진다. 이렇게 만들어진 임베딩을 학습한 것이 감정 자질 모델(sentiment-combined model)이 된다.
KR-BERT와 같은 학습 데이터와 모델 구성을 유지하면서 감정 자질을 결합한 모델인 KR-BERT-KOSAC를 구현하고, 이를 GitHub을 통해 배포하였다. 또한 그로부터 학습 과정 내 언어 모델링과 감정 분석 과제에서의 성능을 얻은 뒤 KR-BERT와 비교하여 감정 자질 추가의 효과를 살펴보았다. 또한 감정 자질 중 극성과 강도 값을 각각 적용한 모델을 별도 구성하여 각 자질이 모델 성능 향상에 얼마나 기여하는지도 확인하였다. 이를 통해 두 가지 감정 자질을 모두 추가한 경우에, 그렇지 않은 다른 모델들에 비하여 언어 모델링이나 감정 분석 문제에서 성능이 어느 정도 향상되는 것을 관찰할 수 있었다. 이때 감정 분석 문제로는 영화평의 긍부정 여부 분류와 댓글의 악플 여부 분류를 포함하였다.
그런데 위와 같은 임베딩 모델을 사전학습하는 것은 많은 시간과 하드웨어 등의 자원을 요구한다. 따라서 본 연구에서는 비교적 적은 시간과 자원을 사용하는 간단한 모델 결합 방법을 제시한다. 적은 수의 인코더 레이어, 어텐션 헤드, 적은 임베딩 차원 수로 구성한 감정 자질 모델을 적은 스텝 수까지만 학습하고, 이를 기존에 큰 규모로 사전학습되어 있는 임베딩 모델과 결합한다. 기존의 사전학습모델에는 충분한 언어 모델링을 통해 다양한 언어 처리 문제를 처리할 수 있는 보편적인 기능이 기대되므로, 이러한 결합은 서로 다른 장점을 갖는 두 모델이 상호작용하여 더 우수한 자연어처리 능력을 갖도록 할 것이다. 본 연구에서는 감정 분석 문제들에 대한 실험을 통해 두 가지 모델의 결합이 학습 시간에 있어 효율적이면서도, 감정 자질을 더하지 않은 모델보다 더 정확한 예측을 할 수 있다는 것을 확인하였다.

Language: eng

URI: https://hdl.handle.net/10371/175838

https://dcollection.snu.ac.kr/common/orgView/000000165135

Files in This Item:

000000165135.pdf 5.48 MB

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Ph.D. / Sc.D._언어학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share