Linguistically Explicit BERT with Part-of-Speech Information

백연미

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Linguistically Explicit BERT with Part-of-Speech Information : 품사 임베딩 정보를 결합한 언어학적 BERT 모델

DC Field	Value	Language
dc.contributor.advisor	신효필	-
dc.contributor.author	백연미	-
dc.date.accessioned	2021-11-30T04:36:41Z	-
dc.date.available	2021-11-30T04:36:41Z	-
dc.date.issued	2021-02	-
dc.identifier.other	000000164745	-
dc.identifier.uri	https://hdl.handle.net/10371/175835	-
dc.identifier.uri	https://dcollection.snu.ac.kr/common/orgView/000000164745	ko_KR
dc.description	학위논문 (석사) -- 서울대학교 대학원 : 인문대학 언어학과, 2021. 2. 신효필.	-
dc.description.abstract	본 연구에서는 BERT 모델에 품사라는 언어학적 정보를 결합하여 모델의 성능을 높이고 이를 언어학적으로 분석하고자 하였다. BERT는 그 자체로 강력한 성능을 내는 모델이지만 모델에 명시적으로 언어학적 정보를 결합하여 주입했을 때 그 성능이 더욱 올라갈 수 있는 여지가 있다는 연구가 이루어지고 있다. 또한 최근 언어 모델이 어떠한 언어학적 지식을 학습했는지 분석하는 연구가 활발하게 이루어지고 있으나 한국어를 대상으로는 사전학습된 모델의 언어학적 표상을 해석하는 분류기(probing classifier) 연구가 아직 미비한 상황이다. 실험을 위해 본 연구에서는 사전학습 단계에서 다양한 방법으로 기존 BERT 모델의 입력 임베딩에 품사 임베딩 정보를 추가하였다. 이에는 (1) 품사 임베딩을 더하는 방법(addPOS), (2) 품사 임베딩을 곱하고 더하는 방법(multiaddPOS), 그리고 (3) 품사 임베딩을 마스킹하는 방법(maskPOS)이 사용되었다. 사전학습 말뭉치로는 한국어 위키피디아와 뉴스기사가 사용되었고 이때 품사는 MeCab 형태소 분석기를 이용하여 태깅되었으며 이는 모델이 말뭉치를 토큰화하는 토큰의 단위로 사용되기도 했다. 이후 학습된 모델을 이용하여 5개의 한국어 하위 실험(downstream task)을 진행하였다(NSMC, NER, KorQuaD, KorNLI, KorSTS). 실험 결과 품사를 명시적으로 결합한 모델, 그 중에서도 maskPOS 모델이 품사 정보가 제공되지 않은 모델보다 높은 성능을 보였다. 하지만 최신 모델에 비해서는 낮은 결과를 내었다. 이후 품사 임베딩 정보가 결합되어 학습된 모델을 대상으로 언어학적 분석을 진행하였다. 모델이 학습한 통사 정보를 확인하기 위해 Hewitt and Manning (2019)에서 제안된 structural probe를 한국어 데이터셋에 적용하여 실험이 이루어졌다. 그 결과 품사 임베딩을 결합하여 명시적으로 언어학적 정보를 준 모델이 한국어 통사 정보를 학습했다는 사실을 확인할 수 있었다. 추가로 품사 모델의 성능을 더 높이기 위해 추가 실험을 진행하였고 품사 모델의 성능을 높일 수 있는 여지가 있다는 결론을 낼 수 있었다. 본 연구는 한국어를 대상으로 BERT 사전학습 모델에 언어학적 정보를 명시적으로 결합하는 새로운 방법을 제시한다. 또한 한국어 모델로는 최초로 모델의 언어학적 표상을 해석하는 연구(probe)를 적용했다. 마지막으로 본 연구는 컴퓨터 공학의 딥러닝 기법과 언어학 이론을 결합하며 앞으로 한국어 자연언어처리가 나아가야 할 방향을 제시한다.	-
dc.description.abstract	This study incorporates part-of-speech, one of the most well-known linguistic features, to the input embedding of the BERT model to enhance the ability of the language model and investigates what linguistic knowledge the model learns from pre-training. Although BERT shows powerful performance on many downstream tasks of Natural Language Processing, many studies have reported that injecting explicit linguistic knowledge improves the performance of the BERT model. Also, several studies have inspected the linguistic representation encoded in BERT using probing classifiers. Probing task on the Korean dataset, however, has not yet been conducted. In this study, we fuse POS embedding to the input embedding of the BERT model by (1) adding POS embedding to the BERT embedding(addPOS), (2) multiplying and then adding it to the input embedding(multiaddPOS), and (3) masking the POS of the masked token while adding it to the input representation(maskPOS) in pre-training. We use Korean Wikipedia and news data as a corpus and MeCab POS tagger as a POS tagger and a tokenizer. In fine-tuning, we conduct 5 Korean downstream tasks (NSMC, NER, KorQuaD, KorNLI, KorSTS). As a result, the proposed POS models, especially the maskPOS model, show better performance on the tasks than the base MeCab-tokenized model which does not fuse POS information. In comparison to the state-of-the-art models, however, the POS models show low performance on the tasks. We conduct a linguistic analysis of the maskPOS model. To identify syntactic information encoded in the model, the structural probe (Hewitt and Manning, 2019) is adapted on Korean datasets. The probe results show that the proposed POS model embeds syntax trees, encoding linguistic knowledge in its word representations. Further experiments are conducted for better performance of the POS models on the downstream task. We conclude that there is a possibility for improving the POS models. This study suggests new methods to fuse linguistic information to the Korean pre-trained BERT model, and to the best of our knowledge, it is the first study to use probe on Korean datasets with the Korean-specific model. In this study deep learning architectures and linguistic theory are integrated, suggesting directions for future Korean NLP research.	-
dc.description.tableofcontents	1. Introduction 1 2. Literature Review 4 2.1. Embeddings 4 2.2. Models with Linguistic Information 5 2.3. Interpretation of Linguistic Knowledge of a Model 7 3. Transformer Architectures 9 3.1. Transformer 9 3.2. Bidirectional Encoder Representations from Transformer (BERT) 11 4. Part-of-Speech Models 14 4.1. Model Structure (Input Representation) 14 4.1.1. addPOS 15 4.1.2. multiaddPOS 16 4.1.3. maskPOS 17 5. Experiments 18 5.1. Pre-training 18 5.1.1. Data 18 5.1.2. Tokenizer 18 5.1.3. Vocabulary 19 5.1.4. Part-of-Speech Tag Vocabulary 19 5.1.5. Training Details 20 5.2. Pre-training Results 20 5.3. Downstream Tasks 22 5.3.1. Tasks 23 5.3.2. Evaluation Metrics 23 5.4. Downstream Task Results 25 5.5. Analysis 27 5.5.1. Correlation Heatmap 28 5.5.2. Limitations 30 6. Linguistic Analysis 32 6.1. Syntactic Probing Analysis 32 6.1.1. The Structural Probe 32 6.1.2. Experiment Details 33 6.1.3. Probe Evaluation Metrics 34 6.1.4. Probe Results 35 6.2. Further Analysis 40 6.2.1. POS Tag Combination 40 6.2.2. Vocabulary Size 41 6.2.3. POS Tagging 42 7. Conclusion 46 References 48 Appendix 53 국문 초록 59	-
dc.format.extent	vi, 60	-
dc.language.iso	eng	-
dc.publisher	서울대학교 대학원	-
dc.subject	Natural Language Processing	-
dc.subject	Language Modeling	-
dc.subject	BERT	-
dc.subject	Word Embeddings	-
dc.subject	Part-of-Speech	-
dc.subject	Interpretability	-
dc.subject	Probe	-
dc.subject	Parse Tree	-
dc.subject	자연언어처리	-
dc.subject	언어 모델	-
dc.subject	임베딩	-
dc.subject	품사	-
dc.subject	모델 해석	-
dc.subject	파스 트리	-
dc.subject.ddc	401	-
dc.title	Linguistically Explicit BERT with Part-of-Speech Information	-
dc.title.alternative	품사 임베딩 정보를 결합한 언어학적 BERT 모델	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.AlternativeAuthor	Baik, Yunmee	-
dc.contributor.department	인문대학 언어학과	-
dc.description.degree	Master	-
dc.date.awarded	2021-02	-
dc.identifier.uci	I804:11032-000000164745	-
dc.identifier.holdings	000000000044▲000000000050▲000000164745▲	-

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Master's Degree_언어학과)

Files in This Item:

000000164745.pdf 6.53 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share