Pruning Deep Convolutional Neural Networks for Fast Inference

Cited 0 time in Web of Science Cited 0 time in Scopus

안워 사지드

Prof. Wonyong Sung
공과대학 전기·컴퓨터공학부
Issue Date
서울대학교 대학원
Deep learningConvolutional Neural NetworksPruningQuantization
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 성원용.
Deep learning algorithms have recently achieved human level classification performance on several diverse classification benchmarks including object and speech recognition. However these algorithms are computationally very expensive especially for resource limited portable machines. Several researches have proposed ideas to lower this cost and in this dissertation, we have addressed this problem. We have
proposed pruning and fixed-point optimization techniques to reduce the computational complexity of deep neural networks. Pruning is a promising technique where a problem is first approximated with a large sized network followed by removing
unimportant parameters.
The proposed work induces sparsity in a deep convolutional neural network (CNN) at three levels: feature map, kernels, and intra-kernel. Feature map pruning removes a large number of kernels and directly reduces the width of a layer and does not require any sparse representation. Thus the resulting network is thinner and runs faster than the predecessor unpruned network. However, feature map pruning removes all the incoming and outgoing kernels and thus affects a large number of parameters.We therefore may not achieve higher pruning ratios with feature map pruning. Kernel pruning eliminates kk kernels and is neither too fine nor too coarse. It can change the dense kernel connectivity pattern to a sparse one. Each convolution connection involves WxHxkxk multiply and accumulate (MAC) operations where W, H and k represents the feature map width, height and the kernel size, respectively. Further the sparse representation for kernel pruning is also very simple. A single flag is enough to represent one convolution connection. Intra-kernel pruning removes scalar weights
at the finest scale. The conventional pruning techniques induce irregular sparsity at the finest granularity by zeroing scalar weights. This sparsity can be induced in much
higher rates but requires sparse representation in order to be translated into computational speedups in VLSI or parallel computer based implementations. Coarse pruning
granularities demand very simple sparse representation but higher pruning ratios are comparatively difficult to achieve. On the contrary, fine grained pruning granularities
can achieve higher pruning ratios but the sparse representation is more complicated. In this dissertation, we propose pruning techniques at the aforementioned three pruning granularities. We further show that various pruning granularities can be applied in combinations to compress the network size to the maximum limit. The scalar weights inside a kernel is generally pruned in an irregular pattern. In
this dissertation, we have proposed intra-kernel strided sparsity (IKSS). The IKSS prunes scalar weights at strided indices. We further impose a condition that all the
outgoing kernels from a feature map must have the same stride and offset for IKSS. This has a direct impact on the sizes of matrices when convolutions are unrolled for
matrix-matrix multiplications. The sparse representation for the constrained IKSS is only two numbers (stride, offset) per feature map. During pruning, it is important to
locate the least adversarial pruning candidates. We have proposed three techniques for pruning candidate selection
particle filter, selecting the best of N random pruning masks, and activation sum voting for feature map pruning. The dissertation extensively discuss the best of N random masks technique and provide detailed analysis. We obtain more than 80% pruning ratios with various pruning granularities. Moreover, the pruned networks can be further compressed by quantizing the weights and signals. This dissertation discusses our fixed-point optimization algorithm for deep convolutional neural network, where the network weights and signals are represented with 3-8 bits precision. We also discuss the layer-wise sensitivity analysis of deep convolutional neural networks. Thus we reduce the computational complexity of a CNN with pruning and fixed-point optimization. In this dissertation, the proposed pruning techniques fit well to Graphics Processing
Units. The IKSS can reduce the size of matrices and GPUs are quite good at multiplying matrices. Secondly, FFT based CNN implementations can benefit from kernel level pruning. For VLSI implementations, the fixed-point optimized techniques reduce the memory requirements and hence the networks can be hosted in the on-chip memory. Thus the proposed techniques can be exploited on a generic set of modern computing platforms.
Files in This Item:
Appears in Collections:
College of Engineering/Engineering Practice School (공과대학/대학원)Dept. of Electrical and Computer Engineering (전기·정보공학부)Theses (Ph.D. / Sc.D._전기·정보공학부)
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.