FlashNeuron: SSD-Enabled large-Batch training of very deep neural networks

Bae, Jonghyun; Lee, Jongsung; Jin, Yunho; Son, Sam; Kim, Shine; Jang, Hakbeom; Ham, Tae Jun; Lee, Jae Wook

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

FlashNeuron: SSD-Enabled large-Batch training of very deep neural networks

Cited 26 time in Web of Science Cited 31 time in Scopus

Export

Authors: Bae, Jonghyun; Lee, Jongsung; Jin, Yunho; Son, Sam; Kim, Shine; Jang, Hakbeom; Ham, Tae Jun; Lee, Jae Wook

Issue Date: 2021-02

Publisher: USENIX Association

Citation: Proceedings of the 19th USENIX Conference on File and Storage Technologies, FAST 2021, pp.387-401

Abstract: © 2021 by The USENIX Association.Deep neural networks (DNNs) are widely used in various AI application domains such as computer vision, natural language processing, autonomous driving, and bioinformatics. As DNNs continue to get wider and deeper to improve accuracy, the limited DRAM capacity of a training platform like GPU often becomes the limiting factor on the size of DNNs and batch size—called memory capacity wall. Since increasing the batch size is a popular technique to improve hardware utilization, this can yield a suboptimal training throughput. Recent proposals address this problem by offloading some of the intermediate data (e.g., feature maps) to the host memory. However, they fail to provide robust performance as the training process on a GPU contends with applications running on a CPU for memory bandwidth and capacity. Thus, we propose FlashNeuron, the first DNN training system using an NVMe SSD as a backing store. To fully utilize the limited SSD write bandwidth, FlashNeuron introduces an offloading scheduler, which selectively offloads a set of intermediate data to the SSD in a compressed format without increasing DNN evaluation time. FlashNeuron causes minimal interference to CPU processes as the GPU and the SSD directly communicate for data transfers. Our evaluation of FlashNeuron with four state-of-the-art DNNs shows that FlashNeuron can increase the batch size by a factor of 12.4× to 14.0× over the maximum allowable batch size on NVIDIA Tesla V100 GPU with 16GB DRAM. By employing a larger batch size, FlashNeuron also improves the training throughput by up to 37.8% (with an average of 30.3%) over the baseline using GPU memory only, while minimally disturbing applications running on CPU.

URI: https://hdl.handle.net/10371/183763

Files in This Item:: There are no files associated with this item.

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Journal Papers (저널논문_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share