Publications

Detailed Information

Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling

Cited 22 time in Web of Science Cited 27 time in Scopus
Authors

Oh, Young H.; Kim, Seonghak; Jin, Yunho; Son, Sam; Bae, Jonghyun; Lee, Jongsung; Park, Yeonhong; Kim, Dong Uk; Ham, Tae Jun; Lee, Jae Wook

Issue Date
2021-02
Publisher
IEEE
Citation
IEEE High-Performance Computer Architecture Symposium Proceedings, Vol.2021-February, pp.584-597
Abstract
© 2021 IEEE.To meet surging demands for deep learning inference services, many cloud computing vendors employ high-performance specialized accelerators, called neural processing units (NPUs). One important challenge for effective use of NPUs is to achieve high resource utilization over a wide spectrum of deep neural network (DNN) models with diverse arithmetic intensities. There is often an intrinsic mismatch between the compute-To-memory bandwidth ratio of an NPU and the arithmetic intensity of the model it executes, leading to under-utilization of either compute resources or memory bandwidth. Ideally, we want to saturate both compute TOP/s and DRAM bandwidth to achieve high system throughput. Thus, we propose Layerweaver, an inference serving system with a novel multi-model time-multiplexing scheduler for NPUs. Layerweaver reduces the temporal waste of computation resources by interweaving layer execution of multiple different models with opposing characteristics: compute-intensive and memory-intensive. Layerweaver hides the memory time of a memory-intensive model by overlapping it with the relatively long computation time of a compute-intensive model, thereby minimizing the idle time of the computation units waiting for off-chip data transfers. For a two-model serving scenario of batch 1 with 16 different pairs of compute-and memory-intensive models, Layerweaver improves the temporal utilization of computation units and memory channels by 44.0% and 28.7%, respectively, to increase the system throughput by 60.1% on average, over the baseline executing one model at a time.
ISSN
1530-0897
URI
https://hdl.handle.net/10371/183761
DOI
https://doi.org/10.1109/HPCA51647.2021.00056
Files in This Item:
There are no files associated with this item.
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share