Publications

Detailed Information

Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling

DC Field Value Language
dc.contributor.authorOh, Young H.-
dc.contributor.authorKim, Seonghak-
dc.contributor.authorJin, Yunho-
dc.contributor.authorSon, Sam-
dc.contributor.authorBae, Jonghyun-
dc.contributor.authorLee, Jongsung-
dc.contributor.authorPark, Yeonhong-
dc.contributor.authorKim, Dong Uk-
dc.contributor.authorHam, Tae Jun-
dc.contributor.authorLee, Jae Wook-
dc.date.accessioned2022-06-24T00:26:16Z-
dc.date.available2022-06-24T00:26:16Z-
dc.date.created2022-05-09-
dc.date.issued2021-02-
dc.identifier.citationIEEE High-Performance Computer Architecture Symposium Proceedings, Vol.2021-February, pp.584-597-
dc.identifier.issn1530-0897-
dc.identifier.urihttps://hdl.handle.net/10371/183761-
dc.description.abstract© 2021 IEEE.To meet surging demands for deep learning inference services, many cloud computing vendors employ high-performance specialized accelerators, called neural processing units (NPUs). One important challenge for effective use of NPUs is to achieve high resource utilization over a wide spectrum of deep neural network (DNN) models with diverse arithmetic intensities. There is often an intrinsic mismatch between the compute-To-memory bandwidth ratio of an NPU and the arithmetic intensity of the model it executes, leading to under-utilization of either compute resources or memory bandwidth. Ideally, we want to saturate both compute TOP/s and DRAM bandwidth to achieve high system throughput. Thus, we propose Layerweaver, an inference serving system with a novel multi-model time-multiplexing scheduler for NPUs. Layerweaver reduces the temporal waste of computation resources by interweaving layer execution of multiple different models with opposing characteristics: compute-intensive and memory-intensive. Layerweaver hides the memory time of a memory-intensive model by overlapping it with the relatively long computation time of a compute-intensive model, thereby minimizing the idle time of the computation units waiting for off-chip data transfers. For a two-model serving scenario of batch 1 with 16 different pairs of compute-and memory-intensive models, Layerweaver improves the temporal utilization of computation units and memory channels by 44.0% and 28.7%, respectively, to increase the system throughput by 60.1% on average, over the baseline executing one model at a time.-
dc.language영어-
dc.publisherIEEE-
dc.titleLayerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling-
dc.typeArticle-
dc.identifier.doi10.1109/HPCA51647.2021.00056-
dc.citation.journaltitleIEEE High-Performance Computer Architecture Symposium Proceedings-
dc.identifier.wosid000671076000044-
dc.identifier.scopusid2-s2.0-85102981446-
dc.citation.endpage597-
dc.citation.startpage584-
dc.citation.volume2021-February-
dc.description.isOpenAccessN-
dc.contributor.affiliatedAuthorLee, Jae Wook-
dc.type.docTypeConference Paper-
dc.description.journalClass1-
Appears in Collections:
Files in This Item:
There are no files associated with this item.

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share