SensiMix: Sensitivity-Aware 8-bit Index & 1-bit Value Mixed Precision Quantization for BERT Compression

박태임

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

SensiMix: Sensitivity-Aware 8-bit Index & 1-bit Value Mixed Precision Quantization for BERT Compression : SensiMix: BERT 압축을 위한 민감도 인식 8비트 색인과 1비트 값 혼합 정밀도 양자화

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 박태임

Advisor: 강유

Issue Date: 2021

Publisher: 서울대학교 대학원

Keywords: Deep learning ; BERT ; Model compression ; Mixed precision quantization ; 딥러닝 ; 버트 ; 모델 압축 ; 혼합 정밀도 양자화

Description: 학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2021.8. 강유.

Abstract: Given a pre-trained BERT, how can we compress it to a fast and lightweight one while maintaining its accuracy? Pre-training language model, such as BERT, is effective for improving the performance of natural language processing (NLP) tasks. However, heavy models like BERT have problems of large memory cost and long inference time.
In this paper, we propose SensiMix (Sensitivity-Aware Mixed Precision Quantization), a novel quantization-based BERT compression method that considers the sensitivity of different modules of BERT. SensiMix effectively applies 8-bit index quantization and 1-bit value quantization to the sensitive and insensitive parts of BERT, maximizing the compression rate while minimizing the accuracy drop. We also propose three novel 1-bit training methods to minimize the accuracy drop: Absolute Binary Weight Regularization, Prioritized Training, and Inverse Layer-wise Fine-tuning. Moreover, for fast inference, we apply FP16 general matrix multiplication (GEMM) and XNOR-Count GEMM for 8-bit and 1-bit quantization parts of the model, respectively. Experiments on four GLUE downstream tasks show that SensiMix compresses the original BERT model to an equally effective but lightweight one, reducing the model size by a factor of 8x and shrinking the inference time by around 80% without noticeable accuracy drop.
사전 학습된 BERT 모델이 주어졌을 때, 어떻게 하면 무거운 BERT 모델을 압축하여 빠르고 가벼운 모델을 얻을 수 있을까? BERT와 같은 사전 학습 언어 모델 (pre-training language model)들은 자연어 처리 작업 개선에 효과적이다. 그런데 이러한 모델들은 메모리 사용량이 크고 추론 시간이 긴 단점이 있다.

본 논문에서는 새로운 양자화 기반 BERT 압축 방법 SensiMix (민감도를 고려한 혼합 정밀도 양자화) 를 제안한다. SensiMix는 BERT 모듈들의 민감도를 고려하여 8비트 색인와 1비트 값 양자화를 효율적으로 결합한다. 나아가 압축된 모델의 정확도를 향상시키기 위해 1비트 양자화에 대한 3가지 학습 방법 ABWR, PT, ILF를 제안한다. 또한, 빠른 추론을 위해 모델의 8비트 및 1비트 양자화 된 부분에 각각 FP16 일반 행렬 곱셈과 XNOR-COUNT 일반 행렬 곱셈을 도입하여 FP32 연산을 대체한다. NLP 모델 평가에 가장 널리 이용되는 GLUE 테스크를 통한 실험 결과, SensiMix는 기존 BERT-base 모델의 정확도를 유지하면서 모델 크기를 8배 압축하고 추론 시간을 약 5배 이상 단축한다.

Language: eng

URI: https://hdl.handle.net/10371/178953

https://dcollection.snu.ac.kr/common/orgView/000000166739

Files in This Item:

000000166739.pdf 6.48 MB

Appears in Collections:

College of Dentistry/School of Dentistry (치과대학/치의학대학원)
- Dept. of Dental Science(치의과학과)
  - Theses (Ph.D. / Sc.D._치의과학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share