ShortcutFusion: From Tensorflow to FPGA-Based Accelerator With a Reuse-Aware Memory Allocation for Shortcut Data

Nguyen, Duy Thanh; Je, Hyeonseung; Nguyen, Tuan Nghia; Ryu, Soojung; Lee, Kyujoong; Lee, Hyuk-Jae

doi:10.1109/TCSI.2022.3153288

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

ShortcutFusion: From Tensorflow to FPGA-Based Accelerator With a Reuse-Aware Memory Allocation for Shortcut Data

DC Field	Value	Language
dc.contributor.author	Nguyen, Duy Thanh	-
dc.contributor.author	Je, Hyeonseung	-
dc.contributor.author	Nguyen, Tuan Nghia	-
dc.contributor.author	Ryu, Soojung	-
dc.contributor.author	Lee, Kyujoong	-
dc.contributor.author	Lee, Hyuk-Jae	-
dc.date.accessioned	2022-08-25T01:15:27Z	-
dc.date.available	2022-08-25T01:15:27Z	-
dc.date.created	2022-05-10	-
dc.date.issued	2022-06	-
dc.identifier.citation	IEEE Transactions on Circuits and Systems I: Regular Papers, Vol.69 No.6, pp.2477-2489	-
dc.identifier.issn	1549-8328	-
dc.identifier.uri	https://hdl.handle.net/10371/184432	-
dc.description.abstract	IEEEResidual block is a very common component in recent state-of-the art CNNs such as EfficientNet/EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152. Most of the previous DNN compilers/accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card 2.8x faster and 9.9x more power efficient than NVIDIA RTX 2080 Ti for 256x 256 input size. Compared to the result from baseline, in which the weights/inputs/outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining, which also ``mine'' the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.	-
dc.language	영어	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.title	ShortcutFusion: From Tensorflow to FPGA-Based Accelerator With a Reuse-Aware Memory Allocation for Shortcut Data	-
dc.type	Article	-
dc.identifier.doi	10.1109/TCSI.2022.3153288	-
dc.citation.journaltitle	IEEE Transactions on Circuits and Systems I: Regular Papers	-
dc.identifier.wosid	000767821900001	-
dc.identifier.scopusid	2-s2.0-85125712808	-
dc.citation.endpage	2489	-
dc.citation.number	6	-
dc.citation.startpage	2477	-
dc.citation.volume	69	-
dc.description.isOpenAccess	N	-
dc.contributor.affiliatedAuthor	Lee, Hyuk-Jae	-
dc.type.docType	Article	-
dc.description.journalClass	1	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Journal Papers (저널논문_전기·정보공학부)

Files in This Item:: There are no files associated with this item.

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share