Publications
Detailed Information
Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Oh, Young H. | - |
dc.contributor.author | Kim, Seonghak | - |
dc.contributor.author | Jin, Yunho | - |
dc.contributor.author | Son, Sam | - |
dc.contributor.author | Bae, Jonghyun | - |
dc.contributor.author | Lee, Jongsung | - |
dc.contributor.author | Park, Yeonhong | - |
dc.contributor.author | Kim, Dong Uk | - |
dc.contributor.author | Ham, Tae Jun | - |
dc.contributor.author | Lee, Jae Wook | - |
dc.date.accessioned | 2022-06-24T00:26:16Z | - |
dc.date.available | 2022-06-24T00:26:16Z | - |
dc.date.created | 2022-05-09 | - |
dc.date.issued | 2021-02 | - |
dc.identifier.citation | IEEE High-Performance Computer Architecture Symposium Proceedings, Vol.2021-February, pp.584-597 | - |
dc.identifier.issn | 1530-0897 | - |
dc.identifier.uri | https://hdl.handle.net/10371/183761 | - |
dc.description.abstract | © 2021 IEEE.To meet surging demands for deep learning inference services, many cloud computing vendors employ high-performance specialized accelerators, called neural processing units (NPUs). One important challenge for effective use of NPUs is to achieve high resource utilization over a wide spectrum of deep neural network (DNN) models with diverse arithmetic intensities. There is often an intrinsic mismatch between the compute-To-memory bandwidth ratio of an NPU and the arithmetic intensity of the model it executes, leading to under-utilization of either compute resources or memory bandwidth. Ideally, we want to saturate both compute TOP/s and DRAM bandwidth to achieve high system throughput. Thus, we propose Layerweaver, an inference serving system with a novel multi-model time-multiplexing scheduler for NPUs. Layerweaver reduces the temporal waste of computation resources by interweaving layer execution of multiple different models with opposing characteristics: compute-intensive and memory-intensive. Layerweaver hides the memory time of a memory-intensive model by overlapping it with the relatively long computation time of a compute-intensive model, thereby minimizing the idle time of the computation units waiting for off-chip data transfers. For a two-model serving scenario of batch 1 with 16 different pairs of compute-and memory-intensive models, Layerweaver improves the temporal utilization of computation units and memory channels by 44.0% and 28.7%, respectively, to increase the system throughput by 60.1% on average, over the baseline executing one model at a time. | - |
dc.language | 영어 | - |
dc.publisher | IEEE | - |
dc.title | Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling | - |
dc.type | Article | - |
dc.identifier.doi | 10.1109/HPCA51647.2021.00056 | - |
dc.citation.journaltitle | IEEE High-Performance Computer Architecture Symposium Proceedings | - |
dc.identifier.wosid | 000671076000044 | - |
dc.identifier.scopusid | 2-s2.0-85102981446 | - |
dc.citation.endpage | 597 | - |
dc.citation.startpage | 584 | - |
dc.citation.volume | 2021-February | - |
dc.description.isOpenAccess | N | - |
dc.contributor.affiliatedAuthor | Lee, Jae Wook | - |
dc.type.docType | Conference Paper | - |
dc.description.journalClass | 1 | - |
- Appears in Collections:
- Files in This Item:
- There are no files associated with this item.
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.