오프 칩 메모리 접근이 NPU 시스템의 성능에 끼치는 영향에 대한 분석

Abstract: NPU는 계산집약적인 합성 곱 신경망 (Convolutional Neural Network, CNN) 연산을 효율적으로 가속하기 위해 최적화되어 왔고 다양한 연구와 함께 빠르게 발전하고 있다. NPU의 성능에 큰 영향을 주는 요인 중 하나는 오프 칩 DRAM 접근 지연 시간으로 이를 줄이기 위해 일반적으로 사용하는 방법은 온 칩 내부에 큰 사이즈의 SRAM을 두고 온 칩 안에서의 데이터 재사용을 최대화하는 것이다. 하지만 온 칩 데이터 최적화를 통해 성능을 높이는 많은 NPU 디자인 연구들이 진행되고 있는 반면에 오프 칩 DRAM 접근 지연 시간 자체에 대한 분석은 상대적으로 적다.

본 논문에서는 오프 칩 메모리 접근 지연 시간이 NPU 성능에 미치는 영향에 대해 시뮬레이션 결과을 통한 자세한 분석과 함께 살펴 본다. 실험을 위해 사이클 단위의 정확한 (Cycle-accurate) 시뮬레이션 환경을 구축하였고 NPU 시뮬레이터로는 MIDAP, DRAM 시뮬레이터로는 Ramulator, 버스 시뮬레이터로는 AMBA AXI4 TLM 2.0 시뮬레이터, 그리고 CACTI 툴 기반으로 구현한 커스텀 SRAM 시뮬레이터를 사용하였다.

다양한 CNN 모델로 실험을 진행하여 오프 칩 DRAM 접근 시간이 NPU 성능에서 매우 큰 비중을 가지고 있음을 확인하였고 NPU와 오프 칩 DRAM 사이에 위치한 큰 사이즈의 오프 칩 SRAM이 NPU 성능에 주는 영향에 대해 살펴보았다. 오프 칩 SRAM으로는 Cache와 스크래치패드 메모리 (Scratchpad Memory, SPM)을 사용하였고 SPM이 Cache보다 NPU 시스템에 더 적합한 SRAM임을 확인하였다. ResNet50은 SPM을 통해 오프 칩 메모리 접근 지연 시간이 약 70.7% 감소하였고 전체 시뮬레이션 사이클은 약 26.5% 감소하였다.

마지막으로 단일 NPU 시스템 대비 멀티 NPU 시스템의 오프 칩 메모리 접근에 대한 영향을 확인할 수 있도록 기존의 시뮬레이션 환경을 확장하였다. 여러 개의 NPU와 SPM 뱅크들을 연결할 수 있는 버스 구조를 통해 다양한 조건의 시스템 설정으로 설계 영역 탐색 (Design Space Exploration, DSE)이 가능하도록 하였다.
Numerous CNN accelerators, called neural processing units (NPUs), have been proposed and developed recently to accelerate CNN computation with a customized chip. To minimize the DRAM access volume, NPUs commonly have a large on-chip memory and try to reuse the fetched data from the off-chip DRAM maximally. While extensive researches have been conducted to minimize the effect of off-chip DRAM access on the performance in the NPU design, little attention is paid to the detailed analysis of the DRAM access overhead and the use of memory hierarchy to minimize the off-chip DRAM access overhead. In this paper, I analyze the effects of DRAM access latency and the use of a large SRAM on the NPU performance based on a cycle-accurate system simulation environment.

A cycle-accurate system simulation environment is built that consists of an NPU simulator, a Ramulator for DRAM simulation, a custom SRAM simulator based on CACTI, and the AMBA AXI4 TLM 2.0 simulator. Through extensive simulations with various CNN models, it is shown that using the SRAM as an SPM is more effective than using it as a cache. For the NPU simulator used for experiments, as an example, adding an SPM could reduce the memory access overhead by about 70.7% in ResNet50 and improve the end-to-end performance by 26.5%. Finally, I extended the existing NPU system to a multi-NPU system that makes multiple NPUs shares the SPM and enables Design Space Exploration (DSE) through system setting under various conditions.

Language: kor

URI: https://hdl.handle.net/10371/183385

https://dcollection.snu.ac.kr/common/orgView/000000169750

Files in This Item:

000000169750.pdf 4.34 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share