research-article

Free access

Just Accepted

SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators

Authors:

Giovanni Gozzi,

Michele Fiorito,

Claudio Barone,

Vito Giovanni Castellana,

Marco Minutoli,

Antonino Tumeo,

Fabrizio FerrandiAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems

Accepted on 03 July 2024

https://doi.org/10.1145/3677035

Online AM: 12 July 2024 Publication History

Abstract

This paper presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average, 2.29x faster than state-of-the-art HLS methodologies.

References

[1]

2023. Graph500. https://graph500.org

[2]

Hamzeh Ahangari, Muhammet Mustafa Özdal, and Özcan Öztürk. 2023. HLS-Based High-Throughput and Work-Efficient Synthesizable Graph Processing Template Pipeline. ACM Trans. Embed. Comput. Syst. 22, 2 (2023).

Digital Library

[3]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA). 105–117.

Digital Library

[4]

ARM Developers. 2020. AMBA AXI and ACE Protocol Specification. https://developer.arm.com/documentation/ihi0022/e/AMBA-AXI3-and-AXI4-Protocol-Specification

[5]

Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019. Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 373–386.

[6]

Scott Beamer, Krste Asanovic, and David A. Patterson. 2012. Direction-optimizing breadth-first search. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15, 2012. 12.

[7]

Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. http://arxiv.org/abs/1508.03619

[8]

B. Betkaoui, Yu Wang, D.B. Thomas, and W. Luk. 2012. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration. In International Conference on Application-Specific Systems, Architectures and Processors (ASAP). 8–15.

[9]

Brahim Betkaoui, Yu Wang, David B. Thomas, and Wayne Luk. 2012. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration. In IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors (ASAP). 8–15.

Digital Library

[10]

Nicolas Bohm Agostini, Serena Curzel, Vinay Amatya, Cheng Tan, Marco Minutoli, Vito Giovanni Castellana, Joseph Manzano, et al. 2022. An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration. In IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.

[11]

Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. 2013. From software to accelerators with LegUp high-level synthesis. In 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 1–9.

Digital Library

[12]

Vito Giovanni Castellana, Marco Minutoli, Alessandro Morari, Antonino Tumeo, Marco Lattuada, and Fabrizio Ferrandi. 2015. High Level Synthesis of RDF Queries for Graph Analytics. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 323–330.

[13]

Rohit Chandra, Leo Dagum, Ramesh Menon, David Kohr, Dror Maydan, and Jeff McDonald. 2001. Parallel programming in OpenMP. Morgan Kaufmann.

[14]

Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-Based Graph Processing Framework on FPGAs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 69–80.

Digital Library

[15]

Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending High-Level Synthesis for Task-Parallel Programs. In IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 204–213.

[16]

Jongsok Choi, Stephen Brown, and Jason Anderson. 2013. From software threads to parallel hardware in high-level synthesis for FPGAs. In 2013 International Conference on Field-Programmable Technology (FPT). 270–277.

[17]

Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA HLS Today: Successes, Challenges, and Opportunities. ACM Transactions on Reconfigurable Technology and Systems 15, 4 (2022), 1–42.

Digital Library

[18]

Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS: A Case Study on SpMV. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 54–64.

Digital Library

[19]

José Duato, Sudhakar Yalamanchili, and Lionel Ni. 2003. Interconnection Networks An Engineering Approach. 359–444 pages.

[20]

F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, and A. Tumeo. 2021. Bambu: an Open-Source Research Framework for the High-Level Synthesis of Complex Applications. In Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC). 1327–1330.

[21]

Yingxue Gao, Teng Wang, Lei Gong, Chao Wang, Xi Li, and Xuehai Zhou. 2023. FastRW: A Dataflow-Efficient and Memory-Aware Accelerator for Graph Random Walk on FPGAs. In Design, Automation & Test in Europe Conference & Exhibition (DATE). 1–6.

[22]

Roberto Gioiosa, Antonino Tumeo, Jian Yin, Thomas Warfel, David Haglin, and Santiago Betelu. 2017. Exploring DataVortex Systems for Irregular Applications. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 409–418. https://doi.org/10.1109/IPDPS.2017.121

[23]

Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, and Hai Jin. 2019. A survey on graph processing accelerators: Challenges and opportunities. Journal of Computer Science and Technology 34 (2019), 339–371.

[24]

Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics 3, 2-3 (2005), 158–182.

Digital Library

[25]

R. J. Halstead and W. Najjar. 2013. Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads. In Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 1–10.

[26]

T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13.

[27]

Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs. In IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.

[28]

J. Huthmann, J. Oppermann, and A. Koch. 2014. Automatic high-level synthesis of multi-threaded hardware accelerators. In International Conference on Field Programmable Logic and Applications (FPL). 1–4.

[29]

Data Vortex Inc. 2023. Data Vortex Network on Chip IP Block. https://www.datavortex.com/dv-noc-brief

[30]

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. 2018. GraFBoost: Using Accelerated Flash Storage for External Graph Analytics. In ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 411–424.

[31]

Nachiket Kapre and Jan Gray. 2015. Hoplite: Building austere overlay NoCs for FPGAs. In 25th International Conference on Field Programmable Logic and Applications (FPL). 1–8.

[32]

Nachiket Kapre and Jan Gray. 2017. Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs. ACM Trans. Reconfigurable Technol. Syst. 10, 2, Article 14 (mar 2017).

Digital Library

[33]

Khronos Group. [n. d.]. OpenCL - Open Standard for Parallel Programming of Heterogeneous Systems. https://www.khronos.org/opencl/

[34]

Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2019. GPOP: A Cache and Memory-Efficient Framework for Graph Processing over Partitions. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 393–394.

Digital Library

[35]

Odile Liboiron-Ladouceur, Assaf Shacham, Benjamin A. Small, Benjamin G. Lee, Howard Wang, Caroline P. Lai, Aleksandr Biberman, and Keren Bergman. 2008. The Data Vortex Optical Packet Switched Interconnection Network. Journal of Lightwave Technology 26, 13 (2008), 1777–1789. https://doi.org/10.1109/JLT.2007.913739

[36]

LLVM developers. [n. d.]. LLVM OpenMP Runtime Library. https://raw.githubusercontent.com/llvm/llvm-project/main/openmp/runtime/doc/Reference.pdf

[37]

Hao Lu, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. Parallel Heuristics for Scalable Community Detection. Parallel Comput. 47, C (aug 2015), 19–37.

[38]

Gurshaant Singh Malik and Nachiket Kapre. 2019. Enhancing Butterfly Fat Tree NoCs for FPGAs with Lightweight Flow Control. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 154–162. https://doi.org/10.1109/FCCM.2019.00030

[39]

Fredrik Manne and Mahantesh Halappanavar. 2014. New Effective Multithreaded Matching Algorithms. In IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS). 519–528.

[40]

Marco Minutoli, Vito Giovanni Castellana, Nicola Saporetti, Stefano Devecchi, Marco Lattuada, Pietro Fezzardi, Antonino Tumeo, and Fabrizio Ferrandi. 2022. Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics. IEEE Trans. Comput. 71, 3 (2022), 520–533.

Digital Library

[41]

Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Marco Lattuada, and Fabrizio Ferrandi. 2016. Efficient synthesis of graph methods: a dynamically scheduled architecture. In Proceedings of the 35th International Conference on Computer-Aided Design, (ICCAD).

Digital Library

[42]

Thomas Moscibroda and Onur Mutlu. 2009. A Case for Bufferless Routing in On-Chip Networks. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA). 196–207.

Digital Library

[43]

Razvan Nane, Vlad Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao, Stephen Dean Brown, Fabrizio Ferrandi, Jason Helge Anderson, and Koen Bertels. 2016. A Survey and Evaluation of FPGA High-Level Synthesis Tools. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 35, 10 (2016), 1591–1604.

Digital Library

[44]

Tan Nguyen, Yao Cheny, Kyle Rupnow, Swathi Gurumani, and Deming Chen. 2016. SoC, NoC and Hierarchical Bus Implementations of Applications on FPGAs Using the FCUDA Flow. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 661–666.

[45]

E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation. In IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM). 25–28.

[46]

Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 111–117.

Digital Library

[47]

Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini, Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, and Zhiru Zhang. 2022. Accelerator Design with Decoupled Hardware Customizations: Benefits and Challenges. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC). 1351–1354.

Digital Library

[48]

Artur Podobas and Mats Brorsson. 2016. Empowering OpenMP with automatically generated hardware. In International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, (SAMOS). IEEE.

[49]

Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110–119.

[50]

Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (second ed.). SIAM.

Digital Library

[51]

Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022. Automated Accelerator Optimization Aided by Graph Neural Networks. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC). 55–60.

Digital Library

[52]

L. Sommer, J. Oppermann, J. Hofmann, and A. Koch. 2017. Synthesis of interleaved multithreaded accelerators from OpenMP loops. In International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–7.

[53]

Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating Graph Processing Using ReRAM. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 531–543.

[54]

Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. 2015. ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 78–85.

[55]

J. Villarreal, A. Park, W. Najjar, and R. Halstead. 2010. Designing Modular Hardware Accelerators in C with ROCCC 2.0. In 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 127–134.

[56]

S. Windh, P. Budhkar, and W. A. Najjar. 2015. CAMs as synchronizing caches for multithreaded irregular applications on FPGAs. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 331–336.

[57]

Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. ScalaGraph: A Scalable Accelerator for Massively Parallel Graph Processing. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). 199–212.

[58]

Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). 741–755.

Digital Library

[59]

Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-Based Graph Processor Using Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 207–216.

Digital Library

[60]

Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of pagerank. In International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–6.

[61]

Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2249–2264.

[62]

Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA Framework for Edge-centric Graph Processing. In ACM International Conference on Computing Frontiers (CF). 69–77.

Index Terms

SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
1. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
  2. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

High-level synthesis of functional patterns with Lift
ARRAY 2019: Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming

High-level languages are commonly seen as a good fit to tackle the problem of performance portability across parallel architectures. The Lift framework is a recent approach which combines high-level, array-based programming abstractions, with a system ...
Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics
Graph analytics are an emerging class of irregular applications. Operating on very large datasets, they present unique behaviors, such as fine-grained, unpredictable memory accesses, and highly unbalanced task level parallelism, that make existing high-...
From software to accelerators with LegUp high-level synthesis
CASES '13: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Embedded system designers can achieve energy and performance benefits by using dedicated hardware accelerators. However, implementing custom hardware accelerators for an application can be difficult and time intensive. LegUp is an open-source high-level ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Just Accepted

ISSN:1936-7406

EISSN:1936-7414

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 12 July 2024

Accepted: 03 July 2024

Revised: 28 May 2024

Received: 31 December 2023

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
89
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)88

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables