SensiMix: Sensitivity-Aware 8-bit index &amp; 1-bit value mixed precision quantization for BERT compression

Piao, Tairen; Cho, Ikhyun; Kang, U

doi:10.1371/journal.pone.0265621

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression

DC Field	Value	Language
dc.contributor.author	Piao, Tairen	-
dc.contributor.author	Cho, Ikhyun	-
dc.contributor.author	Kang, U	-
dc.date.accessioned	2022-06-23T03:59:43Z	-
dc.date.available	2022-06-23T03:59:43Z	-
dc.date.created	2022-05-09	-
dc.date.issued	2022-04	-
dc.identifier.citation	PLoS ONE, Vol.17 No.4, p. e0265621	-
dc.identifier.issn	1932-6203	-
dc.identifier.uri	https://hdl.handle.net/10371/183704	-
dc.description.abstract	© 2022 Piao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Given a pre-trained BERT, how can we compress it to a fast and lightweight one while maintaining its accuracy? Pre-training language model, such as BERT, is effective for improving the performance of natural language processing (NLP) tasks. However, heavy models like BERT have problems of large memory cost and long inference time. In this paper, we propose SENSIMIX (Sensitivity-Aware Mixed Precision Quantization), a novel quantizationbased BERT compression method that considers the sensitivity of different modules of BERT. SENSIMIX effectively applies 8-bit index quantization and 1-bit value quantization to the sensitive and insensitive parts of BERT, maximizing the compression rate while minimizing the accuracy drop. We also propose three novel 1-bit training methods to minimize the accuracy drop: Absolute Binary Weight Regularization, Prioritized Training, and Inverse Layer-wise Fine-tuning. Moreover, for fast inference, we apply FP16 general matrix multiplication (GEMM) and XNOR-Count GEMM for 8-bit and 1-bit quantization parts of the model, respectively. Experiments on four GLUE downstream tasks show that SENSIMIX compresses the original BERT model to an equally effective but lightweight one, reducing the model size by a factor of 8× and shrinking the inference time by around 80% without noticeable accuracy drop.	-
dc.language	영어	-
dc.publisher	Public Library of Science	-
dc.title	SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression	-
dc.type	Article	-
dc.identifier.doi	10.1371/journal.pone.0265621	-
dc.citation.journaltitle	PLoS ONE	-
dc.identifier.wosid	000791258900033	-
dc.identifier.scopusid	2-s2.0-85128527851	-
dc.citation.number	4	-
dc.citation.startpage	e0265621	-
dc.citation.volume	17	-
dc.description.isOpenAccess	N	-
dc.contributor.affiliatedAuthor	Kang, U	-
dc.type.docType	Article	-
dc.description.journalClass	1	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Journal Papers (저널논문_컴퓨터공학부)

Files in This Item:: There are no files associated with this item.

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share