Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling

Oh, Young H.; Kim, Seonghak; Jin, Yunho; Son, Sam; Bae, Jonghyun; Lee, Jongsung; Park, Yeonhong; Kim, Dong Uk; Ham, Tae Jun; Lee, Jae Wook

doi:10.1109/HPCA51647.2021.00056

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling

Cited 22 time in Web of Science Cited 27 time in Scopus

Export

Authors: Oh, Young H.; Kim, Seonghak; Jin, Yunho; Son, Sam; Bae, Jonghyun; Lee, Jongsung; Park, Yeonhong; Kim, Dong Uk; Ham, Tae Jun; Lee, Jae Wook

Issue Date: 2021-02

Publisher: IEEE

Citation: IEEE High-Performance Computer Architecture Symposium Proceedings, Vol.2021-February, pp.584-597

Abstract: © 2021 IEEE.To meet surging demands for deep learning inference services, many cloud computing vendors employ high-performance specialized accelerators, called neural processing units (NPUs). One important challenge for effective use of NPUs is to achieve high resource utilization over a wide spectrum of deep neural network (DNN) models with diverse arithmetic intensities. There is often an intrinsic mismatch between the compute-To-memory bandwidth ratio of an NPU and the arithmetic intensity of the model it executes, leading to under-utilization of either compute resources or memory bandwidth. Ideally, we want to saturate both compute TOP/s and DRAM bandwidth to achieve high system throughput. Thus, we propose Layerweaver, an inference serving system with a novel multi-model time-multiplexing scheduler for NPUs. Layerweaver reduces the temporal waste of computation resources by interweaving layer execution of multiple different models with opposing characteristics: compute-intensive and memory-intensive. Layerweaver hides the memory time of a memory-intensive model by overlapping it with the relatively long computation time of a compute-intensive model, thereby minimizing the idle time of the computation units waiting for off-chip data transfers. For a two-model serving scenario of batch 1 with 16 different pairs of compute-and memory-intensive models, Layerweaver improves the temporal utilization of computation units and memory channels by 44.0% and 28.7%, respectively, to increase the system throughput by 60.1% on average, over the baseline executing one model at a time.

ISSN: 1530-0897

URI: https://hdl.handle.net/10371/183761

DOI: https://doi.org/10.1109/HPCA51647.2021.00056

Files in This Item:: There are no files associated with this item.

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Journal Papers (저널논문_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share