research-article

Skew-Oblivious Data Routing for Data Intensive Applications on FPGAs with HLS

Authors:

Deming ChenAuthors Info & Claims

2021 58th ACM/IEEE Design Automation Conference (DAC)

Pages 937 - 942

https://doi.org/10.1109/DAC18074.2021.9586184

Published: 05 December 2021 Publication History

Abstract

FPGAs have become emerging computing infrastructures for accelerating applications in datacenters. Meanwhile, high-level synthesis (HLS) tools have been proposed to ease the programming of FPGAs. Even with HLS, irregular data-intensive applications require explicit optimizations, among which multiple processing elements (PEs) with each owning a private BRAM-based buffer are usually adopted to process multiple data per cycle. Data routing, which dynamically dispatches multiple data to designated PEs, avoids data replication in buffers compared to statically assigning data to PEs, hence saving BRAM usage. However, the workload imbalance among PEs vastly diminishes performance when processing skew datasets. In this paper, we propose a skew-oblivious data routing architecture that allocates secondary PEs and schedules them to share the workload of the overloaded PEs at run-time. In addition, we integrate the proposed architecture into a framework called Ditto to minimize the development efforts for applications that require skew handling. We evaluate Ditto on five commonly used applications: histogram building, data partitioning, pagerank, heavy hitter detection and hyperloglog. The results demonstrate that the generated implementations are robust to skew datasets and outperform the state-of-the-art designs in both throughput and BRAM usage efficiency.

References

[1]

R. Nane et al., “A survey and evaluation of fpga high-level synthesis tools,” TCAD, 2015.

[2]

Z. Ruan et al., “St-accel: A high-level programming platform for streaming applications on fpga,” in FCCM, 2018.

[3]

J. Cong et al., “Automated accelerator generation and optimization with composable, parallel and pipeline architecture,” in DAC, 2018.

[4]

J. Thomas et al., “Fleet: A framework for massively parallel streaming on fpgas,” in ASPLOS, 2020.

[5]

J. Cong et al., “Bandwidth optimization through on-chip memory restructuring for hls,” in DAC, 2017.

[6]

Z. Li et al., “Aggressive pipelining of irregular applications on reconfigurable hardware,” in ISCA, 2017.

[7]

J. Fowers et al., “A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication,” in FCCM, 2014.

[8]

X. Chen et al., “On-the-fly parallel data shuffling for graph processing on opencl-based fpgas,” in FPL, 2019.

[9]

X. Chen et al., “ThunderGP: HLS-based graph processing framework on fpgas,” in FPGA, 2021.

[10]

X. Chen et al., “Is fpga useful for hash joins?” in CIDR, 2020.

[11]

N. Ramanathan et al., “A case for work-stealing on fpgas with opencl atomics,” in FPGA, 2016.

[12]

J. Jiang et al., “Boyi: A systematic framework for automatically deciding the right execution model of opencl applications on fpgas,” in FPGA, 2020.

[13]

C. Balkesen et al., “Main-memory hash joins on multi-core cpus: Tuning to the underlying hardware,” in ICDE, 2013.

[14]

T. Geng et al., “Awb-gcn: A graph convolutional network accelerator with runtime workload rebalancing,” in MICRO, 2020.

[15]

Intel. (2020) Intel FPGA SDK for opencl pro edition programming guide.

[16]

H. Röger and R. Mayer, “A comprehensive survey on parallelization and elasticity in stream processing,” CSUR, 2019.

[17]

K. Kara et al., “Fpga-based data partitioning,” in SIGMOD, 2017.

[18]

Z. Wang et al., “Multikernel data partitioning with channel on opencl-based FPGAs,” TVLSI, 2017.

[19]

D. Tong et al., “High throughput sketch based online heavy hitter detection on fpga,” Comput Architect News, 2016.

[20]

A. Kulkarni et al., “Hyperloglog sketch acceleration on fpga,” in FPL, 2020.

[21]

S. Zhou et al., “Hitgraph: High-throughput graph processing framework on fpga,” TPDS, 2019.

[22]

R. Rossi et al., “The network data repository with interactive graph analytics and visualization,” in AAAI, 2015.

[23]

H. Yan et al., “Constructing concurrent data structures on fpga with channels,” in FPGA, 2019.

[24]

J. Fang et al., “Parallel stream processing against workload skewness and variance,” in HPDC, 2017.

[25]

Z. Wang et al., “Melia: A mapreduce framework on opencl-based fpgas,” TPDS, 2016.

Cited By

Chen XCheng FTan HChen YHe BWong WChen D(2022)ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLSACM Transactions on Reconfigurable Technology and Systems10.1145/351714115:4(1-31)Online publication date: 9-Dec-2022
https://dl.acm.org/doi/10.1145/3517141

Index Terms

Skew-Oblivious Data Routing for Data Intensive Applications on FPGAs with HLS
1. Computer systems organization
2. Hardware
  1. Electronic design automation

Index terms have been assigned to the content through auto-classification.

Recommendations

Transformation synthesis for data intensive applications to FPGAs
GLSVLSI '06: Proceedings of the 16th ACM Great Lakes symposium on VLSI

Without the adequate awareness of trade-off between different resources, it is extremely difficult for system synthesis tools to achieve high performance solutions when mapping the applications to FPGA-based computing engines. In this paper, we present ...
Accelerating Big Data Analytics Using FPGAs
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines

Emerging big data analytics applications require a significant amount of server computational power. As chips are hitting power limits, computing systems are moving away from general-purpose designs and toward greater specialization. Hardware ...
Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

This article describes the design and implementation of a novel compilation flow that implements circuits in FPGAs from a streaming programming language. The streaming language supported is called FPGA Brook and is based on the existing Brook language. ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

2021 58th ACM/IEEE Design Automation Conference (DAC)

Dec 2021

1380 pages

Copyright © 2021.

Publisher

IEEE Press

Publication History

Published: 05 December 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen XCheng FTan HChen YHe BWong WChen D(2022)ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLSACM Transactions on Reconfigurable Technology and Systems10.1145/351714115:4(1-31)Online publication date: 9-Dec-2022
https://dl.acm.org/doi/10.1145/3517141

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents