TrainBox: An Extreme-Scale Neural Network Training Server Architecture by Systematically Balancing Operations

Park, Pyeongsu; Jeong, Heetaek; Kim, Jangwoo

doi:10.1109/MICRO50266.2020.00072

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

TrainBox: An Extreme-Scale Neural Network Training Server Architecture by Systematically Balancing Operations

Cited 4 time in Web of Science Cited 5 time in Scopus

Export

Authors: Park, Pyeongsu; Jeong, Heetaek; Kim, Jangwoo

Issue Date: 2020-10

Publisher: IEEE COMPUTER SOC

Citation: 2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), pp.825-838

Abstract: Neural network is a major driving force of another golden age of computing; the computer architects have proposed specialized accelerators (e.g., TPU), high-speed interconnects (e.g., NVLink), and algorithms (e.g., ring-based reduction) to efficiently support neural network applications. As a result, they achieve orders of magnitude higher efficiency for neural network computation and inter-accelerator communication over traditional computing platforms. In this paper, we identify that the emerging platforms have shifted the performance bottleneck of neural network from model computation and inter-accelerator communication to data preparation. Although overlapping data preparation and the others has hidden the preparation overhead, the higher input processing demands of emerging platforms start to reverse the situation; at scale, data preparation requires an infeasible amount of the host-side CPU, memory, and PCIe resources. Our detailed analysis reveals that this heavy resource consumption comes from data transformation for neural network specific formats, and buffering for communication among devices. Therefore, we propose a scalable neural network server architecture by balancing data preparation and the others. To achieve extreme scalability, our design relies on a scalable device array, rather than the limited host resources, with three key ideas. First, we offload CPU-intensive operations to the customized data preparation accelerators to scale the training performance regardless of the host-side CPU performance. Second, we apply direct inter-device communication to eliminate unnecessary data copies and reduce the pressure on the host memory. Lastly, we cluster underlying devices considering unique communication patterns of the neural network processing and interconnect characteristics to efficiently utilize aggregated interconnect bandwidth. Our evaluation shows that the proposed architecture achieves 44.4x higher training throughput on average over a naively extended server architecture with 256 neural network accelerators.

URI: https://hdl.handle.net/10371/186296

DOI: https://doi.org/10.1109/MICRO50266.2020.00072

Files in This Item:: There are no files associated with this item.

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Journal Papers (저널논문_전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share