English to Korean Multilingual Transfer Learning  with Sentence-BERT

Abstract: This study focuses on constructing a Korean Sentence-BERT model in a novel method, using student-teacher knowledge distillation. The limitations of BERT have been well explored in previous publications. BERT has proven to be ineffective in deriving sentence-level embeddings and not applicable in practical situations where large amounts of sentence-level embeddings are required, such as document classification and clustering. Sentence-BERT was developed to alleviate these issues and create a model that can derive sentence embeddings in an efficient and accurate manner.
This study explores a transfer learning method in Sentence-BERT, which allows even low-resource language models to leverage the power of models trained in high-resource languages such as Korean. Using translated sentence pairs in the source and target languages, the student model learns to map the translated sentence to the same points in the vector space as the teacher model using a simple mean squared error loss method. In this experiment, an English model was used as the teacher model and a cross-linguistic model was used as the student model. To the knowledge of this author, no Korean Sentence-BERT model has been trained using this novel method to the date of publication of this paper.
To conduct this knowledge distillation for Sentence-BERT, a large number of source and target language translated sentence pairs are needed. After collecting available datasets on the web, the data was augmented with crawled data from the web, which was then aligned using a novel method and then pre-processed for cleaning. This research evaluates the model trained on this data using the knowledge distillation method on sentence-level tasks and multilingual tasks. The model successfully performs well on all tasks, proving its wide applicability and cross-lingual abilities.
본 연구는 student-teacher knowledge distillation 전이학습 비법을 사용하여 영어-한국어 Sentence-BERT 모델을 학습한다. BERT의 한계는 이전 발표된 많은 연구에서 잘 탐구되었다. BERT는 문장 단위의 임베딩을 도출하는 데 효과적이지 않으며 문서 분류 및 클러스터링과 같이 대량의 문장 임베딩이 필요한 실용적인 상황에서는 적용할 수 없는 것으로 입증되었다. Sentence-BERT는 이러한 문제를 완화하고 효율적이고 정확한 방식으로 문장 임베딩을 도출할 수 있는 모델을 만들기 위해 개발되었다.
본 연구에서는 한국어와 같은 저자원 언어 모델도 영어와 같은 고자원 언어로 훈련된 모델과 같은 성능을 보일 수 있는 Sentence-BERT의 전이 학습 방법을 살펴본다. 의 모델은 Source 언어와 Target 언어에서 번역된 문장 쌍을 사용하여 Mean Squared Error Loss를 통해서 번역된 문장을 Teacher 모델과 동일한 벡터 공간에 매핑한다. 이 실험에서는 영어 모델을 Teacher 모델로, Cross-lingual 모델을 Student 모델로 사용한다. 저자가 알기로는, 이 논문의 출판일까지 이 새로운 방법을 사용하여 학습된 한국어 Sentence-BERT 모델은 없다.
Sentence-BERT에 대해 이러한 knowledge distillation을 수행하려면 많은 수의 소스 언어 및 대상 언어 번역 문장의 쌍이 필요하다. 웹에서 사용 가능한 데이터 세트를 수집한 후, 데이터는 웹에서 크롤링된 데이터로 증강되었고, 이 데이터를 새로운 방법을 사용하여 정렬한 이후에 전처리를 했다. 이 연구는 문장 단위 테스트 및 영어-한국어 테스트를 통해서 훈련된 모델을 평가했으며 모델의 광범위한 적용 가능성과 다국어 능력을 입증하였다.

Language: eng

URI: https://hdl.handle.net/10371/181300

https://dcollection.snu.ac.kr/common/orgView/000000169722

Files in This Item:

000000169722.pdf 0.71 MB

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Master's Degree_언어학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share