Memory Layout and Computing Techniques for High Performance Neural Networks

Abstract: 인공 신경망 연산을 수행하고자 하는 수요가 꾸준히 증가하고 있지만, 깊은 인공 신경망에는 과도한 메모리와 계산 비용이 수반되기 때문에 많은 설계 문제가 있다. 본 논문에서는 인공 신경망 추론 연산을 효과적으로 처리하기 위한 여러 가지 새로운 기술을 연구한다.
첫 번째로, 최대 계산 속도 향상이 가중치의 0 아닌 비트의 총 수에 의해 제한되는 한계의 극복을 시도한다. 구체적으로, 부호있는 숫자 인코딩에 기반한 본 연구에서, (1) 모든 가중치의 2의 보수 표현을 필수 비트를 최소로 하는 부호있는 숫자 표현의 집합으로 변환하는 변환 기법을 제안하며, (2) 가중치의 비트 단위 곱셈의 병렬성을 최대하화는 가중치의 부호있는 숫자 표현을 선택하는 문제를 숫자 인덱스 (열 단위) 압축 최대화를 달성하도록 다목적 최단 경로 문제로 공식화하여 근사 알고리즘을 사용하여 효율적으로 해결하며, (3) 주요 하드웨어를 추가로 포함하지 않고 앞서 제안한 기법을 지원하는 새로운 가속기 아키텍처(DWP)를 제안한다. 또한, 우리는 (4) 병렬 처리에서 최악의 지연 시간을 엄격하게 예측할 수 있는 기능이 포함된 비트 단위 병렬 곱셈을 지원하도록 다른 형태의 DWP를 제안한다. 실험을 통해 본 연구에서 제안하는 접근 방법은 필수 비트 수를 AlexNet에서 69%, VGG-16에서 74%, ResNet-152에서 68%까지 줄일 수 있음을 보여주었다. 또한 이를 지원하는 가속기는 추론 연산 시간을 기존의 비트 단위 가중치 가지치기 방법에 비해 최대 3.57배까지 감소시켰다.
두 번째로, 이진 및 삼진 가중치의 컨볼루션 인공 신경망에서 컨볼루션 간의 중복 연산을 최대한 제거하기 위하여 공통 커널 및 컨볼루션을 추출하는 새로운 알고리즘을 제시한다. 구체적으로, (1) 기존 방법에서 공통 커널 후보의 국부적이고 제한적인 탐색을 극복하기 위한 새로운 공통 커널 추출 알고리즘을 제안하고, 이후에 (2) 컨볼루션 연산에서의 중복성을 최대한으로 제거하기 위한 새로운 개념의 공통 컨볼루션 추출을 적용한다. 또한, 우리의 알고리즘은 (3) 컨볼루션에 대해 최종적으로 도출된 커널 수를 최소화하여 커널에 대한 총 메모리 접근 지연 시간을 절약할 수 있다. 삼진 가중치의 VGG-16에 대한 실험 결과로 모든 컨볼루션에 대한 총 연산 수를 25.8-26.3% 감소시켜, 최신 알고리즘으로 추출한 공통 커널을 사용하는 컨볼루션에 비해 2.7-3.8% 더 적은 커널을 사용하는 동안 하드웨어 플랫폼에서의 총 수행 사이클을 22.4% 감소시킴으로써 우리가 제안한 컨볼루션 최적화 알고리즘이 매우 효과적임을 보였다.
마지막으로, 우리는 압축된 DNN의 모든 고유 가중치들을 온-칩 메모리에 완전히 포함할 수 없는 경우 정확도 유지를 위해 부적합 압축을 사용하는 DNN 솔루션을 제안한다. 구체적으로, 가중치의 접근 시퀀스가 주어지면, (1) 첫 번째 문제는 오프-칩 메모리의 메모리 접근 수(접근에 의해 소비되는 에너지)를 최소화하도록 오프-칩 메모리에 가중치를 배열하는 것이고, (2) 두 번째 문제는 블록 교체를 위한 인덱스 탐색에 소비되는 오버헤드와 오프-칩 메모리 접근에 소모되는 총 에너지의 최소화를 목적으로 하여 블록 미스 발생 시 온-칩 메모리에서 교체될 가중치 블록을 선택하는 전략을 고안하는 것이다. 압축된 AlexNet 모델을 사용한 실험을 통해 우리의 솔루션은 최적화되지 않은 메모리 레이아웃 및 LRU 교체 방법을 사용하는 경우에 비해 탐색 오버헤드를 포함하여 오프-칩 메모리 접근에 필요한 총 에너지 소비를 평균 34.2%까지 줄일 수 있음을 보였다.
Although the demand for exploiting neural networks is steadily increasing, there are many design challenges since deep neural networks (DNNs) entail excessive memory and computation cost. This dissertation studies a number of new techniques for effectively processing DNN inference operations.
Firstly, we attempt to overcome that the maximal computation speedup is bounded by the total number of non-zero bits of the weights. Precisely, this work, based on the signed-digit encoding, (1) proposes a transformation technique which converts the twos complement representation of every weight into a set of signed-digit representations of the minimal number of essential bits, (2) formulates the problem of selecting signed-digit representations of weights that maximize the parallelism of bit-level multiplication on the weights into a multi-objective shortest path problem to achieve a maximal digit-index by digit-index (i.e. column-wise) compression for the weights and solves it efficiently using an approximation algorithm, and (3) proposes a supporting novel acceleration architecture (DWP) with no additional inclusion of non-trivial hardware. In addition, we (4) propose a variant of DWP to support bit-level parallel multiplication with the capability of predicting a tight worst-case latency of the parallel processing. Through experiments on several representative models using the ImageNet dataset, it is shown that our proposed approach is able to reduce the number of essential bits by 69% on AlexNet, 74% on VGG-16, and 68% on ResNet-152, by which our accelerator is able to reduce the inference computation time by up to 3.57x over the conventional bit-level weight pruning.
Secondly, a new algorithm for extracting common kernels and convolutions to maximally eliminate the redundant operations among the convolutions in binary- and ternary-weight convolutional neural networks is presented. Specifically, we propose (1) a new algorithm of common kernel extraction to overcome the local and limited exploration of common kernel candidates by the existing method, and subsequently apply (2) a new concept of common convolution extraction to maximally eliminate the redundancy in the convolution operations. In addition, our algorithm is able to (3) tune in minimizing the number of resulting kernels for convolutions, thereby saving the total memory access latency for kernels. Experimental results on ternary-weight VGG-16 demonstrate that our convolution optimization algorithm is very effective, reducing the total number of operations for all convolutions by 25.8-26.3%, thereby reducing the total number of execution cycles on hardware platform by 22.4% while using 2.7-3.8% fewer kernels over that of the convolution utilizing the common kernels extracted by the state-of-the-art algorithm.
Finally, we propose solutions for DNNs with unfitted compression to maintain the accuracy, in which all distinct weights of the compressed DNNs could not be entirely contained in on-chip memory. Precisely, given an access sequence of weights, (1) the first problem is to arrange the weights in off-chip memory, so that the number of memory accesses to the off-chip memory (equivalently the energy consumed by the accesses) be minimized, and (2) the second problem is to devise a strategy of selecting a weight block in on-chip memory for replacement when a block miss occurs, with the objective of minimizing the total energy consumed by the off-chip memory accesses and the overhead of scanning indexes for block replacement. Through experiments with the model of compressed AlexNet, it is shown that our solutions are able to reduce the total energy consumption of the off-chip memory accesses including the scanning overhead by 34.2% on average over the use of unoptimized memory layout and LRU replacement scheme.

Language: eng

URI: https://hdl.handle.net/10371/175288

https://dcollection.snu.ac.kr/common/orgView/000000163737

Files in This Item:

000000163737.pdf 14.19 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share