Publications

Detailed Information

Refine and Recycle: A Method to Increase Decompression Parallelism

DC Field Value Language
dc.contributor.authorFang, Jian-
dc.contributor.authorChen, Jianyu-
dc.contributor.authorLee, Jinho-
dc.contributor.authorAl-Ars, Zaid-
dc.contributor.authorHofstee, H. Peter-
dc.date.accessioned2024-05-02T06:01:28Z-
dc.date.available2024-05-02T06:01:28Z-
dc.date.created2024-04-23-
dc.date.created2024-04-23-
dc.date.issued2019-
dc.identifier.citation2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019), pp.272-280-
dc.identifier.issn2160-0511-
dc.identifier.urihttps://hdl.handle.net/10371/200539-
dc.description.abstractRapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s.-
dc.language영어-
dc.publisherIEEE COMPUTER SOC-
dc.titleRefine and Recycle: A Method to Increase Decompression Parallelism-
dc.typeArticle-
dc.identifier.doi10.1109/ASAP.2019.00017-
dc.citation.journaltitle2019 IEEE 30TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2019)-
dc.identifier.wosid000574772800053-
dc.identifier.scopusid2-s2.0-85072601520-
dc.citation.endpage280-
dc.citation.startpage272-
dc.description.isOpenAccessN-
dc.contributor.affiliatedAuthorLee, Jinho-
dc.type.docTypeProceedings Paper-
dc.description.journalClass1-
dc.subject.keywordAuthorSnappy-
dc.subject.keywordAuthordecompression-
dc.subject.keywordAuthorFPGA-
dc.subject.keywordAuthorAcceleration-
Appears in Collections:
Files in This Item:
There are no files associated with this item.

Related Researcher

  • College of Engineering
  • Department of Electrical and Computer Engineering
Research Area AI Accelerators, Distributed Deep Learning, Neural Architecture Search

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share