Refine and Recycle: A Method to Increase Decompression Parallelism

Fang, Jian; Chen, Jianyu; Lee, Jinho; Al-Ars, Zaid; Hofstee, H. Peter

doi:10.1109/ASAP.2019.00017

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Refine and Recycle: A Method to Increase Decompression Parallelism

DC Field	Value	Language
dc.contributor.author	Fang, Jian	-
dc.contributor.author	Chen, Jianyu	-
dc.contributor.author	Lee, Jinho	-
dc.contributor.author	Al-Ars, Zaid	-
dc.contributor.author	Hofstee, H. Peter	-
dc.date.accessioned	2024-05-02T06:01:28Z	-
dc.date.available	2024-05-02T06:01:28Z	-
dc.date.created	2024-04-23	-
dc.date.created	2024-04-23	-
dc.date.issued	2019	-
dc.identifier.citation	2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019), pp.272-280	-
dc.identifier.issn	2160-0511	-
dc.identifier.uri	https://hdl.handle.net/10371/200539	-
dc.description.abstract	Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s.	-
dc.language	영어	-
dc.publisher	IEEE COMPUTER SOC	-
dc.title	Refine and Recycle: A Method to Increase Decompression Parallelism	-
dc.type	Article	-
dc.identifier.doi	10.1109/ASAP.2019.00017	-
dc.citation.journaltitle	2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)	-
dc.identifier.wosid	000574772800053	-
dc.identifier.scopusid	2-s2.0-85072601520	-
dc.citation.endpage	280	-
dc.citation.startpage	272	-
dc.description.isOpenAccess	N	-
dc.contributor.affiliatedAuthor	Lee, Jinho	-
dc.type.docType	Proceedings Paper	-
dc.description.journalClass	1	-
dc.subject.keywordAuthor	Snappy	-
dc.subject.keywordAuthor	decompression	-
dc.subject.keywordAuthor	FPGA	-
dc.subject.keywordAuthor	Acceleration	-