Parameter-Efficient Knowledge Distillation on Transformer

전효진

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Parameter-Efficient Knowledge Distillation on Transformer : 파라미터 효율적인 트랜스포머 지식 증류

DC Field	Value	Language
dc.contributor.advisor	강유	-
dc.contributor.author	전효진	-
dc.date.accessioned	2023-06-29T02:00:03Z	-
dc.date.available	2023-06-29T02:00:03Z	-
dc.date.issued	2023	-
dc.identifier.other	000000176565	-
dc.identifier.uri	https://hdl.handle.net/10371/193347	-
dc.identifier.uri	https://dcollection.snu.ac.kr/common/orgView/000000176565	ko_KR
dc.description	학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2023. 2. 강유.	-
dc.description.abstract	How can we obtain a small and computationally efficient transformer model, maintaining the performance of a large model? Transformers have shown significant performance in recent years. However, their large model size, expensive computation cost, and long inference time prohibit them to be deployed on resource-restricted devices. Existing transformer compression methods have mainly focused on only reducing an encoder although a decoder takes up most of their long inference time. In this paper, we propose PET (Parameter-Efficient Knowledge Distillation on Transformer), an efficient transformer compression method reducing the size of both the encoder and decoder. PET improves the knowledge distillation of the Transformer, designing an efficient compressed structure of both the encoder and decoder and enhancing the performance of the small model through an efficient pre-training task. Experiments show that PET succeeds in obtaining memory and time efficiencies by 81.20% and 45.20%, respectively, minimizing accuracy drop below 1%p. It outperforms the competitors for various datasets in machine translation tasks.	-
dc.description.abstract	어떻게 하면 큰 모델의 성능을 유지하면서 작은 크기의 효율적인 연산량을 가진 트랜 스포머모델을구할수있을까? 트랜스포머모델은 지난 몇 년 간 자연어 처리,컴퓨터 비전 등 다양한 분야에 걸쳐 뛰어난 성과를 보여주고 있다. 최근 연구의 주요 추세는 모델의 크기를 늘려 모델의 성능을 높이는 것이나, 모델의 크기를 무한정 늘리는 것은 현실적인 측면에서 바람직하지 않다. 실제 서비스에 기술이 적용되기 위해선 높은 성능 뿐만 아니라 메모리 효율, 빠른 추론 속도, 에너지 소모량 등에 대한 고려가 필요하며, 대부분 모델의 크기가 큰 경우 이를 만족하기 어렵다. 따라서 큰 모델의 성능을 유지하면서 작고 빠른 모델을 얻기 위한 효과적인 트랜스포머 모델 압축 기술이 필요하다. 기존의 트랜스포머 압축 연구들은 트랜스포머 인코더 기반 모델에 대한 것이 대부분으로, BERT 압축이 대표적이다. 기존의 인코더 압축 기법을 기계 번역 모델 등 인코더와 디코더가 혼재하는 모델에 적용할 경우 정확도가 크게 손실되었다. 디코더는 동일 임베딩 사이즈와 레이어 수의 인코더 보다 크기가 크며, 긴 추론 시간의 주요 원인이므로, 실용적인 트랜스포머 모델을 만들기 위해선 디코더 압축이 필수적이다. 이 논문에서는 트랜스포머의 인코더와 디코더를 모두 압축하기 위한 PET (Parameter- Efficient Knowledge Distillation on Transformer)를 제안한다. 제안 기법은 효과적인 모델 구조 설계와 초기화 기법의 개선을 통해 트랜스포머 지식 증류 기법의성능을 높였다. 또한, 추가적인 최적화 기법을 제안하여 압축 모델의 정확도를 더욱 높이는 데에 성공하였다.실험을 통해 제안기법이 다양한 기계 번역 데이터 셋에서 경쟁 모델보다 우수한 성능을 보이는 것을 확인하였고, 독일어 영어 번역 데이터 셋에서는 원본 모델보다 18.30% (임베딩 레이어 제외 시 9.51%)의 파라미터 수로 45.2% 빠르면서 정확도 감소를1% p이내로 줄이는 데 성공하였다.	-
dc.description.tableofcontents	I. Introduction 1 II. Background and Related Works 4 * 2.1 Transformers 4 * 2.1.1 The Architecture of Transformer 4 * 2.1.2 The Output Structure of Transformer 5 * 2.1.3 Multi-headAttention 5 * 2.2 Knowledge Distillation on Transformers 6 * 2.2.1 Knowledge Distillation on Transformer Encoders 7 * 2.2.2 Knowledge Distillation on Transformer Decoders 8 * 2.2.3 Knowledge Distillation on Transformer Encoders and Decoders 8 III. Proposed Method 11 * 3.1 Finding Replaceable Pairs in Encoder and Decoder 13 * 3.2 Warmup with Simplified Task 14 * 3.2.1 Simplified task by Reducing the Number of Target Classes 16 * 3.2.2 Modeling the Prediction Probabilities to Simplified Task Labels 17 * 3.3 Layer-wise Attention Head Sampling 19 IV. Experiments 21 4.1 Experimental Settings 21 4.1.1 Dataset 21 4.1.2 Competitors 22 4.1.3 Evaluation Metric 22 * 4.2 Translation Accuracy of PET 23 * 4.3 Translation Speed of PET 25 * 4.4 Effectiveness of Replaceable Pair 25 * 4.5 Effectiveness of Simplified Task 26 * 4.6 Effectiveness of Layer-wise Attention Head Sampling 27 * 4.7 Sensitivity Analysis 28 V. Conclusion 30 References 31 Abstract in Korean 33	-
dc.format.extent	vi, 34	-
dc.language.iso	eng	-
dc.publisher	서울대학교 대학원	-
dc.subject	Model compression	-
dc.subject	Transformer	-
dc.subject	Knowledge Distillation	-
dc.subject.ddc	621.39	-
dc.title	Parameter-Efficient Knowledge Distillation on Transformer	-
dc.title.alternative	파라미터 효율적인 트랜스포머 지식 증류	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.AlternativeAuthor	Hyojin Jeon	-
dc.contributor.department	공과대학 컴퓨터공학부	-
dc.description.degree	석사	-
dc.date.awarded	2023-02	-
dc.identifier.uci	I804:11032-000000176565	-
dc.identifier.holdings	000000000049▲000000000056▲000000176565▲	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Files in This Item:

000000176565.pdf 2.42 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share