Characterization and Optimization of Quantized Deep Neural Networks

Abstract: Deep neural networks (DNNs) have achieved impressive performance on various machine learning tasks. However, performance improvements are usually accompanied by increased network complexity incurring vast arithmetic operations and memory accesses. In addition, the recent increase in demand for utilizing DNNs in resource-limited devices leads to a plethora of explorations in model compression and acceleration. Among them, network quantization is one of the most cost-efficient implementation methods for DNNs. Network quantization converts the precision of parameters and signals from 32-bit floating-point to 8, 4, or 2-bit fixed-point precision. The weight quantization can directly compress DNNs by reducing the representation levels of the parameters. Activation outputs can also be quantized to reduce the computational costs and working memory footprint. However, severe quantization degrades the performance of the network. Many previous studies focused on developing optimization methods for the quantization of given models without considering the effects of the quantization on DNNs. Therefore, extreme simulation is required to obtain quantization precision that maintains performance on different models or datasets.
In this dissertation, we attempt to measure the per-parameter capacity of DNN models and interpret the results to obtain insights on the optimum quantization of parameters. The uniform random vectors are sampled and used for training generic forms of fully connected DNNs, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). We conduct memorization and classification tests to study the effects of the parameters number and precision on the performance. The model and the per-parameter capacities are assessed by measuring the mutual information between the input and the classified output. To get insight for parameter quantization when performing real tasks, the training and the test performances are compared.
In addition, we analyze and demonstrate that quantization noise of weight and activation are disparate in inference. Synthesized data is designed to visualize the effects of weight and activation quantization. The results indicate that deeper models are more prone to activation quantization, while wider models improve the resiliency to both weight and activation quantization. Considering the characteristics of the quantization errors, we propose a holistic approach for the optimization of QDNNs, which contains QDNN training methods as well as quantization-friendly architecture design.
Based on the observation that the activation quantization induces noised prediction, we propose the Stochastic Precision Ensemble training for QDNNs (SPEQ). The SPEQ is teacher-student learning, but the teacher and the student share the model parameters. We obtain the teacher's soft labels by changing the bit-precision of the activation stochastically at each layer of the forward-pass computation. The student model is trained with these soft labels to reduce the activation quantization noise. Instead of the KL-divergence, the cosine-distance loss is employed for the KD training. Since the teacher model changes continuously by random bit-precision assignment, it exploits the effect of stochastic ensemble KD. The SPEQ method outperforms various tasks, such as image classification, question-answering, and transfer learning without requiring cumbersome teacher networks.
최근 깊은 신경망(deep neural network, DNN)은 다양한 분야에서 매우 인상적인 성능을 보이고 있다. 그러나, 신경망의 복잡도가 함께 증가하면서, 점점 더 많은 계산 및 메모리 접근 비용이 발생하고 있다. 인공신경망의 양자화(quantization)는 깊은 신경망의 동작 비용을 줄일 수 있는 효과적인 방법 중 하나이다. 일반적으로, 신경망의 가중치(weights) 및 활성화된 신호(activation outputs)는 32 비트 부동 소수점(floating-point) 정밀도를 가진다. 고정 소수점 양자화는 이를 더 낮은 정밀도로 표현함으로써 신경망의 크기 및 연산 비용을 줄인다. 그러나, 1또는 2비트 등 매우 낮은 정밀로도 양자화된 신경망은 부동 소수점 신경망과 비교하여 큰 성능 하락을 보인다. 기존의 연구들은 양자화 에러(error)에 대한 분석 없이 주어진 데이터와 모델에 대한 최적화 방법을 제시한다. 이러한 연구 결과를 다른 모델과 데이터에 적용하기 위해서는 수많은 시뮬레이션을 수행하여 성능을 유지할 수 있는 양자화 정밀도의 한계를 찾아야 한다.
본 연구에서는 신경망에서의 양자화 특성을 분석하고, 양자화로 인한 신경망의 성능 저하 원인을 제시한다. 신경망의 양자화는 크게 가중치 양자화(weight quantization)와 활성화 함수 양자화(activation quantization)로 나뉜다. 먼저, 가중치 양자화의 특성을 분석하기 위해 무작위 훈련 샘플을 생성하고, 이 데이터로 신경망을 훈련시키면서 신경망의 암기 능력(memorization capacity)을 정량화 한다. 신경망이 자신의 암기 능력을 최대로 활용하도록 훈련시킨 뒤 성능이 하락하는 양자화 정밀도의 한계를 분석한다. 분석 결과, 가중치가 정보량을 잃기 시작하는 양자화 정밀도는 파라미터의 수와 관계가 없음을 확인하였다. 뿐만 아니라, 파라미터에 저장된 정보를 유지할 수 있는 한계 양자화 정밀도는 모델의 구조에 따라 달라진다.
또한, 본 연구에서는 활성화 함수 양자화와 가중치 양자화로 인한 에러의 차이점을 분석한다. 합성 데이터(synthesized data)를 생성하고, 이 데이터로 훈련된 모델을 양자화 한 뒤 양자화 에러를 시각화 한다. 분석 결과 가중치 양자화는 신경망의 용량(capacity)을 감소시키며, 신경망의 파라미터 수를 증가시키면 가중치 양자화 에러가 감소한다. 반면, 활성화 함수의 양자화는 추론 과정(inference)에서 잡음(noise)을 유발하며 신경망의 깊이가 깊어질 수록 활성화 함수의 에러가 증폭된다. 본 연구에서는, 두 양자화 에러의 차이를 바탕으로 양자화 친화적 아키텍처 설계와 고정 소수점 훈련 방법을 포함하는 포괄적인 고정 소수점 최적화 방법을 제안한다.
뿐만 아니라, 활성화 함수가 양자화된 신경망의 성능 복원력을 높이는 방법으로 SPEQ 훈련 방법을 제안한다. 제안하는 훈련 방법은 지식 증류 (knowledge distillation, KD) 기반 학습 방법으로, 매 훈련 단계 마다 서로 다른 선생 모델의 정보를 활용한다. 선생 모델의 파라미터는 학생 모델과 동일하며, 활성화 함수의 양자화 정밀도를 확률적으로 선택함으로써 선생 모델의 소프트 라벨(soft label)을 생성한다. 따라서 선생 모델은 학생 모델에서 유발되는 양자화 잡음을 고려한 지식을 제공해 준다. 학생 모델은 훈련 단계마다 다른 종류의 양자화 잡음을 고려한 지식으로 훈련되기 때문에 앙상블 학습(ensemble training) 효과를 얻을 수 있다. 제안하는 SPEQ 훈련 방법은 다양한 분야에서 양자화된 신경망의 성능을 크게 향상시켰다.

Language: eng

URI: https://hdl.handle.net/10371/169278

http://dcollection.snu.ac.kr/common/orgView/000000162353

Files in This Item:

000000162353.pdf 10.68 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share