Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling

Oh, Young H.; Kim, Seonghak; Jin, Yunho; Son, Sam; Bae, Jonghyun; Lee, Jongsung; Park, Yeonhong; Kim, Dong Uk; Ham, Tae Jun; Lee, Jae Wook

doi:10.1109/HPCA51647.2021.00056

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling

DC Field	Value	Language
dc.contributor.author	Oh, Young H.	-
dc.contributor.author	Kim, Seonghak	-
dc.contributor.author	Jin, Yunho	-
dc.contributor.author	Son, Sam	-
dc.contributor.author	Bae, Jonghyun	-
dc.contributor.author	Lee, Jongsung	-
dc.contributor.author	Park, Yeonhong	-
dc.contributor.author	Kim, Dong Uk	-
dc.contributor.author	Ham, Tae Jun	-
dc.contributor.author	Lee, Jae Wook	-
dc.date.accessioned	2022-06-24T00:26:16Z	-
dc.date.available	2022-06-24T00:26:16Z	-
dc.date.created	2022-05-09	-
dc.date.issued	2021-02	-
dc.identifier.citation	IEEE High-Performance Computer Architecture Symposium Proceedings, Vol.2021-February, pp.584-597	-
dc.identifier.issn	1530-0897	-
dc.identifier.uri	https://hdl.handle.net/10371/183761	-
dc.description.abstract	© 2021 IEEE.To meet surging demands for deep learning inference services, many cloud computing vendors employ high-performance specialized accelerators, called neural processing units (NPUs). One important challenge for effective use of NPUs is to achieve high resource utilization over a wide spectrum of deep neural network (DNN) models with diverse arithmetic intensities. There is often an intrinsic mismatch between the compute-To-memory bandwidth ratio of an NPU and the arithmetic intensity of the model it executes, leading to under-utilization of either compute resources or memory bandwidth. Ideally, we want to saturate both compute TOP/s and DRAM bandwidth to achieve high system throughput. Thus, we propose Layerweaver, an inference serving system with a novel multi-model time-multiplexing scheduler for NPUs. Layerweaver reduces the temporal waste of computation resources by interweaving layer execution of multiple different models with opposing characteristics: compute-intensive and memory-intensive. Layerweaver hides the memory time of a memory-intensive model by overlapping it with the relatively long computation time of a compute-intensive model, thereby minimizing the idle time of the computation units waiting for off-chip data transfers. For a two-model serving scenario of batch 1 with 16 different pairs of compute-and memory-intensive models, Layerweaver improves the temporal utilization of computation units and memory channels by 44.0% and 28.7%, respectively, to increase the system throughput by 60.1% on average, over the baseline executing one model at a time.	-
dc.language	영어	-
dc.publisher	IEEE	-
dc.title	Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling	-
dc.type	Article	-
dc.identifier.doi	10.1109/HPCA51647.2021.00056	-
dc.citation.journaltitle	IEEE High-Performance Computer Architecture Symposium Proceedings	-
dc.identifier.wosid	000671076000044	-
dc.identifier.scopusid	2-s2.0-85102981446	-
dc.citation.endpage	597	-
dc.citation.startpage	584	-
dc.citation.volume	2021-February	-
dc.description.isOpenAccess	N	-
dc.contributor.affiliatedAuthor	Lee, Jae Wook	-
dc.type.docType	Conference Paper	-
dc.description.journalClass	1	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Journal Papers (저널논문_컴퓨터공학부)

Files in This Item:: There are no files associated with this item.

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share