소규모 데이터 기반 한국어 버트 모델

이상아; 장한솔; 백연미; 박수지; 신효필

doi:10.5626/JOK.2020.47.7.682

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

소규모 데이터 기반 한국어 버트 모델 : A Small-Scale Korean-Specific BERT Language Model

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 이상아; 장한솔; 백연미; 박수지; 신효필

Issue Date: 2020-07

Publisher: 한국정보과학회

Citation: 정보과학회논문지, Vol.47 No.7, pp.682-692

Abstract: 최근 자연어처리에서 문장 단위의 임베딩을 위한 모델들은 거대한 말뭉치와 파라미터를 이용하기 때문에 큰 하드웨어와 데이터를 요구하고 학습하는 데 시간이 오래 걸린다는 단점을 갖는다. 따라서 규모가 크지 않더라도 학습 데이터를 경제적으로 활용하면서 필적할만한 성능을 가지는 모델의 필요성이 제기된다. 본 연구는 음절 단위의 한국어 사전, 자소 단위의 한국어 사전을 구축하고 자소 단위의 학습과 양방향 WordPiece 토크나이저를 새롭게 소개하였다. 그 결과 기존 모델의 1/10 사이즈의 학습 데이터를 이용하고 적절한 크기의 사전을 사용해 더 적은 파라미터로 계산량은 줄고 성능은 비슷한 KR-BERT 모델을 구현할 수 있었다. 이로써 한국어와 같이 고유의 문자 체계를 가지고 형태론적으로 복잡하며 자원이 적은 언어에 대해 모델을 구축할 때는 해당 언어에 특화된 언어학적 현상을 반영해야 한다는 것을 확인하였다.

Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.

ISSN: 2383-630X

URI: https://hdl.handle.net/10371/190981

DOI: https://doi.org/10.5626/JOK.2020.47.7.682

Files in This Item:: There are no files associated with this item.

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Journal Papers (저널논문_언어학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share