Quantization Algorithm and Methodology for Efficient Deep Neural Network

Park, Eunhyeok

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Quantization Algorithm and Methodology for Efficient Deep Neural Network : 효율적인 심층 신경망을 위한 양자화 알고리즘 및 방법론

DC Field	Value	Language
dc.contributor.advisor	유승주	-
dc.contributor.author	Park, Eunhyeok	-
dc.date.accessioned	2020-05-19T08:03:08Z	-
dc.date.available	2020-05-19T08:03:08Z	-
dc.date.issued	2020	-
dc.identifier.other	000000160741	-
dc.identifier.uri	https://hdl.handle.net/10371/167999	-
dc.identifier.uri	http://dcollection.snu.ac.kr/common/orgView/000000160741	ko_KR
dc.description	학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 유승주.	-
dc.description.abstract	Deep neural networks (DNN) are becoming increasingly popular and widely adopted for various applications. Energy efficiency of neural networks is critically important for both edge devices and servers. It is imperative to optimize neural networks in terms of both speed and energy consumption while maintaining the accuracy of the network. Quantization is one of the most effective optimization techniques. By reducing the bit-width of activations and weights, both the speed and energy can be improved by executing more computations using the same amount of memory access and computational resources (e.g. silicon chip area and battery). It is expected that computations with 4-bit and lower precision will contribute to the energy efficient and real-time characteristics of future deep learning applications. One major drawback of quantization is the drop in accuracy, resulting from the reduction in the degree of freedom of data representation. Recently, there have been several studies that demonstrated that the inference of DNNs can be accurately done by using 8-bit precision. However, many studies show that the network quantized into 4-bit or less precision suffers from significant quality degradation. Especially, the state-of-the art networks cannot be quantized easily due to their optimized structure. In this dissertation, several methods are proposed that use different approaches to minimize the reduction in the accuracy of the quantized DNNs. Weighted- entropy-based quantization is designed to fully utilize the limited number of quantization levels by maximizing the weighted information of the quantized data. This work shows the potential of multi-bit quantization for both activation and weight. Value-aware quantization, or outlier-aware quantization is designed to support sub-4-bit quantization, while allowing a small amount (1 ~ 3 %) of large values in high precision. This helps the quantized data to maintain the statistics, e.g. mean and variance corresponding to the full-precision, thus minimizing the accuracy drop after quantization. The dedicated hardware accelerator, called OLAccel, is also proposed to maximize the performance of the network quantized by the outlier-aware quantization. The hardware takes advantage of the benefit of reduced precision, i.e. 4-bit, with minimal accuracy drop by the proposed quantization algorithm. Precision-highway is the structural concept that forms an end-to-end high-precision information flow while performing ultra-low-precision computations. This minimizes the accumulated quantization error, which helps to improve the accuracy of the network even with extremely low precision. BLast, the training methodology, and differentiable and unified quantization (DuQ), a novel quantization algorithm, are designed to support sub-4-bit quantization for the optimized mobile networks, i.e. MobileNet-v3. These methods allow the MobileNet-v3 network to be quantized into 4-bit for both activation and weight with negligible accuracy loss.	-
dc.description.abstract	딥 뉴럴 네트워크 (DNN)는 활용 범위를 점차 넓혀가며 다양한 분야에 적용되고 있다. 뉴럴 네트워크는 서버 뿐만 아니라 임베디드 기기에서도 널리 활용되고 있으며 이로인해 뉴럴 네트워크의 효율성을 높이는 것은 점점 더 중요해지는 중이다. 이제 정확도를 유지하면서 속도를 빠르게 하고 에너지 소모를 줄이는 뉴럴 네트워크의 최적화는 필수적 요소로 자리잡았다. 양자화는 가장 효과적인 최적화 기법 중 하나이다. 뉴런의 활성도 (activation) 및 학습 가중치 (weight)를 저장하는데 필요한 비트 수를 줄임으로써 동일한 양의 데이터 접근과 연산 비용 (칩 면적 및 에너지 소모 등)으로 더 많은 연산이 가능해지며 이로인해 속도와 에너지 소모를 동시에 최적화할 수 있다. 추후 딥 러닝을 활용하기 위하여 필요할 것으로 예측되는 에너지 효율 및 연산 속도를 만족시키기 위해서 4 비트 혹은 더 적은 정밀도 기반의 양자화 연산이 지대한 공헌을 할 것으로 기대된다. 그러나 양자화의 가장 중요한 단점 중 하나는 데이터의 표현형을 제한하여 자유도가 떨어지게 됨으로서 발생하는 정확도의 손실이다. 이러한 단점을 해결하기 위하여 다양한 연구들이 진행중이다. 최근 일부 연구들은 8 비트의 정밀도에서 뉴럴 네트워크를 활용해 결과를 추론 (inference)하는데 정확도 손실이 거의 없음을 보고하고 있다. 반면 그 외의 다양한 연구들을 통해 4 비트 혹은 더 낮은 정밀도에서 양자화를 적용했을 때 많은 네트워크들의 정확도가 크게 손상되는 현상도 함께 보고되고 있다. 특히 최근 제안된 네트워크들의 경우 성능 향상을 위해 도입한 최적화된 구조가 양자화 하기 어려운 특성을 가져 이러한 현상이 심화된다. 본 논문에서는 양자화된 DNN의 정확도 손실을 최소화하기위한 다양한 방법들을 제안하였다. 가중 엔트로피 기반 양자화 (Weighted-entropy-based quantization)은 제한된 개수의 양자화 레벨을 최대한 활용하기 위하여 양자화된 데이터의 정보량을 최대화하는 방향으로 양자화를 진행하도록 설계되었다. 이 연구를 통해 아주 깊은 네트워크에서도 뉴런의 활성도와 학습 가중치 모두의 양자화가 적용 가능함을 보였다. 값-의식 양자화 (value-aware quantization), 혹은 예외-의식 양자화 (outlier-aware quantization)는 빈도는 낮지만 큰 값을 가지는 데이터를 큰 정밀도로 저장하는 대신 나머지 데이터에 4 비트 이하의 양자화를 적용하도록 설계된 알고리즘이다. 이는 원본 데이터의 평균과 분산 같은 특성이 양자화된 후에도 유지하도록 도와주어 양자화된 네트워크의 정확도를 유지하는데 기여한다. 이에 더하여 OLAccel이라 명명된 특화 가속기를 제안하였다. 이 가속기는 값-의식 양자화 알고리즘을 통해 양자화된 네트워크를 가속함으로써 정확도 감소는 최소화 하면서 낮은 정밀도의 성능 이득을 최대화한다. 고정밀도-통로 구조 (precision-highway)는 네트워크의 구조를 개선하여 초저정밀도 연산을 수행하면서도 고정밀도 정보 통로를 생성한다. 이는 양자화로 인하여 에러가 누적되는 현상을 완화하여 매우 낮은 정밀도에서 정확도를 개선하는데 기여한다. 학습 기법인 BLast와 미분 가능하고 통합된 양자화 알고리즘 (DuQ)는 MobileNet-v3과 같은 최적화된 모바일향 네트워크를 최적화하기 위하여 제안되었다. 이 방법들을 통해 미미한 정확도 손실만으로 MobileNet-v3의 활성도 및 학습 가중치 모두를 4 비트 정밀도로 양자화하는데 성공하였다.	-
dc.description.tableofcontents	Chapter 1. Introduction 1 Chapter 2. Background and RelatedWork 4 Chapter 3. Weighted-entropy-based Quantization 15 3.1 Introduction 15 3.2 Motivation 17 3.3 Quantization based on Weighted Entropy 20 3.3.1 Weight Quantization 20 3.3.2 Activation Quantization 24 3.3.3 IntegratingWeight/Activation Quantization into the Training Algorithm 27 3.4 Experiment 28 3.4.1 Image Classification: AlexNet, GoogLeNet and ResNet-50/101 28 3.4.2 Object Detection: R-FCN with ResNet-50 35 3.4.3 Language Modeling: An LSTM 37 3.5 Conclusion 38 Chapter 4. Value-aware Quantization for Training and Inference of Neural Networks 40 4.1 Introduction 40 4.2 Motivation 41 4.3 Proposed Method 43 4.3.1 Quantized Back-Propagation 44 4.3.2 Back-Propagation of Full-Precision Loss 46 4.3.3 Potential of Further Reduction in Computation Cost 47 4.3.4 Local Sorting in Data Parallel Training 48 4.3.5 ReLU and Value-aware Quantization (RV-Quant) 49 4.3.6 Activation Annealing 50 4.3.7 Quantized Inference 50 4.4 Experiments 51 4.4.1 Training Results 52 4.4.2 Inference Results 59 4.4.3 LSTM Language Model 61 4.5 Conclusions 62 Chapter 5. Energy-efficient Neural Network Accelerator Based on Outlier-aware Low-precision Computation 63 5.1 Introduction 63 5.2 Proposed Architecture 65 5.2.1 Overall Structure 65 5.2.2 Dataflow 68 5.2.3 PE Cluster 72 5.2.4 Normal PE Group 72 5.2.5 Outlier PE Group and Cluster Output Tri-buffer 75 5.3 Evaluation Methodology 78 5.4 Experimental Results 80 5.5 Conclusion 90 Chapter 6. Precision Highway for Ultra Low-Precision Quantization 92 6.1 Introduction 92 6.2 Proposed Method 93 6.2.1 Precision Highway on Residual Network 94 6.2.2 Precision Highway on Recurrent Neural Network 96 6.2.3 Practical Issues with Precision Highway 98 6.3 Training 99 6.3.1 LinearWeight Quantization based on Laplace Distribution Model 99 6.3.2 Fine-tuning for Weight/Activation Quantization 100 6.4 Experiments 101 6.4.1 Experimental Setup 101 6.4.2 Analysis of Accumulated Quantization Error 101 6.4.3 Loss Surface Analysis of Quantized Model Training 103 6.4.4 Evaluating the Accuracy of Quantized Model 103 6.4.5 Hardware Cost Evaluation of Quantized Model 108 6.5 Conclusion 109 Chapter 7. Towards Sub-4-bit Quantization of Optimized Mobile Netowrks 114 7.1 Introduction 114 7.2 BLast Training 117 7.2.1 Notation 118 7.2.2 Observation 118 7.2.3 Activation Instability Metric 120 7.2.4 BLast Training 122 7.3 Differentiable and Unified Quantization 124 7.3.1 Rounding and Truncation Errors 124 7.3.2 Limitations of State-of-the-Art Methods 124 7.3.3 Proposed Method: DuQ 126 7.3.4 Handling Negative Values 128 7.4 Experiments 131 7.4.1 Accuracy on ImageNet Dataset 131 7.4.2 Discussion on Fused-BatchNorm 133 7.4.3 Ablation Study 134 7.5 Conclusion 137 Chapter 8 Conclusion 138 Bibliography 141 국문초록 154 Acknowledgements 157	-
dc.language.iso	eng	-
dc.publisher	서울대학교 대학원	-
dc.subject.ddc	621.39	-
dc.title	Quantization Algorithm and Methodology for Efficient Deep Neural Network	-
dc.title.alternative	효율적인 심층 신경망을 위한 양자화 알고리즘 및 방법론	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.AlternativeAuthor	박은혁	-
dc.contributor.department	공과대학 컴퓨터공학부	-
dc.description.degree	Doctor	-
dc.date.awarded	2020-02	-
dc.identifier.uci	I804:11032-000000160741	-
dc.identifier.holdings	000000000042▲000000000044▲000000160741▲	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Files in This Item:

000000160741.pdf 4.73 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share