Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access
Just Accepted

SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators

Online AM: 12 July 2024 Publication History
  • Get Citation Alerts
  • Abstract

    This paper presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average, 2.29x faster than state-of-the-art HLS methodologies.

    References

    [1]
    2023. Graph500. https://graph500.org
    [2]
    Hamzeh Ahangari, Muhammet Mustafa Özdal, and Özcan Öztürk. 2023. HLS-Based High-Throughput and Work-Efficient Synthesizable Graph Processing Template Pipeline. ACM Trans. Embed. Comput. Syst. 22, 2 (2023).
    [3]
    Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA). 105–117.
    [4]
    ARM Developers. 2020. AMBA AXI and ACE Protocol Specification. https://developer.arm.com/documentation/ihi0022/e/AMBA-AXI3-and-AXI4-Protocol-Specification
    [5]
    Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019. Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 373–386.
    [6]
    Scott Beamer, Krste Asanovic, and David A. Patterson. 2012. Direction-optimizing breadth-first search. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15, 2012. 12.
    [7]
    Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. http://arxiv.org/abs/1508.03619
    [8]
    B. Betkaoui, Yu Wang, D.B. Thomas, and W. Luk. 2012. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration. In International Conference on Application-Specific Systems, Architectures and Processors (ASAP). 8–15.
    [9]
    Brahim Betkaoui, Yu Wang, David B. Thomas, and Wayne Luk. 2012. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration. In IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors (ASAP). 8–15.
    [10]
    Nicolas Bohm Agostini, Serena Curzel, Vinay Amatya, Cheng Tan, Marco Minutoli, Vito Giovanni Castellana, Joseph Manzano, et al. 2022. An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration. In IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.
    [11]
    Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. 2013. From software to accelerators with LegUp high-level synthesis. In 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 1–9.
    [12]
    Vito Giovanni Castellana, Marco Minutoli, Alessandro Morari, Antonino Tumeo, Marco Lattuada, and Fabrizio Ferrandi. 2015. High Level Synthesis of RDF Queries for Graph Analytics. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 323–330.
    [13]
    Rohit Chandra, Leo Dagum, Ramesh Menon, David Kohr, Dror Maydan, and Jeff McDonald. 2001. Parallel programming in OpenMP. Morgan Kaufmann.
    [14]
    Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-Based Graph Processing Framework on FPGAs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 69–80.
    [15]
    Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending High-Level Synthesis for Task-Parallel Programs. In IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 204–213.
    [16]
    Jongsok Choi, Stephen Brown, and Jason Anderson. 2013. From software threads to parallel hardware in high-level synthesis for FPGAs. In 2013 International Conference on Field-Programmable Technology (FPT). 270–277.
    [17]
    Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA HLS Today: Successes, Challenges, and Opportunities. ACM Transactions on Reconfigurable Technology and Systems 15, 4 (2022), 1–42.
    [18]
    Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS: A Case Study on SpMV. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 54–64.
    [19]
    José Duato, Sudhakar Yalamanchili, and Lionel Ni. 2003. Interconnection Networks An Engineering Approach. 359–444 pages.
    [20]
    F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, and A. Tumeo. 2021. Bambu: an Open-Source Research Framework for the High-Level Synthesis of Complex Applications. In Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC). 1327–1330.
    [21]
    Yingxue Gao, Teng Wang, Lei Gong, Chao Wang, Xi Li, and Xuehai Zhou. 2023. FastRW: A Dataflow-Efficient and Memory-Aware Accelerator for Graph Random Walk on FPGAs. In Design, Automation & Test in Europe Conference & Exhibition (DATE). 1–6.
    [22]
    Roberto Gioiosa, Antonino Tumeo, Jian Yin, Thomas Warfel, David Haglin, and Santiago Betelu. 2017. Exploring DataVortex Systems for Irregular Applications. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 409–418. https://doi.org/10.1109/IPDPS.2017.121
    [23]
    Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, and Hai Jin. 2019. A survey on graph processing accelerators: Challenges and opportunities. Journal of Computer Science and Technology 34 (2019), 339–371.
    [24]
    Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics 3, 2-3 (2005), 158–182.
    [25]
    R. J. Halstead and W. Najjar. 2013. Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads. In Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 1–10.
    [26]
    T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13.
    [27]
    Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs. In IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.
    [28]
    J. Huthmann, J. Oppermann, and A. Koch. 2014. Automatic high-level synthesis of multi-threaded hardware accelerators. In International Conference on Field Programmable Logic and Applications (FPL). 1–4.
    [29]
    Data Vortex Inc. 2023. Data Vortex Network on Chip IP Block. https://www.datavortex.com/dv-noc-brief
    [30]
    Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. 2018. GraFBoost: Using Accelerated Flash Storage for External Graph Analytics. In ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 411–424.
    [31]
    Nachiket Kapre and Jan Gray. 2015. Hoplite: Building austere overlay NoCs for FPGAs. In 25th International Conference on Field Programmable Logic and Applications (FPL). 1–8.
    [32]
    Nachiket Kapre and Jan Gray. 2017. Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs. ACM Trans. Reconfigurable Technol. Syst. 10, 2, Article 14 (mar 2017).
    [33]
    Khronos Group. [n. d.]. OpenCL - Open Standard for Parallel Programming of Heterogeneous Systems. https://www.khronos.org/opencl/
    [34]
    Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2019. GPOP: A Cache and Memory-Efficient Framework for Graph Processing over Partitions. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 393–394.
    [35]
    Odile Liboiron-Ladouceur, Assaf Shacham, Benjamin A. Small, Benjamin G. Lee, Howard Wang, Caroline P. Lai, Aleksandr Biberman, and Keren Bergman. 2008. The Data Vortex Optical Packet Switched Interconnection Network. Journal of Lightwave Technology 26, 13 (2008), 1777–1789. https://doi.org/10.1109/JLT.2007.913739
    [36]
    LLVM developers. [n. d.]. LLVM OpenMP Runtime Library. https://raw.githubusercontent.com/llvm/llvm-project/main/openmp/runtime/doc/Reference.pdf
    [37]
    Hao Lu, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. Parallel Heuristics for Scalable Community Detection. Parallel Comput. 47, C (aug 2015), 19–37.
    [38]
    Gurshaant Singh Malik and Nachiket Kapre. 2019. Enhancing Butterfly Fat Tree NoCs for FPGAs with Lightweight Flow Control. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 154–162. https://doi.org/10.1109/FCCM.2019.00030
    [39]
    Fredrik Manne and Mahantesh Halappanavar. 2014. New Effective Multithreaded Matching Algorithms. In IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS). 519–528.
    [40]
    Marco Minutoli, Vito Giovanni Castellana, Nicola Saporetti, Stefano Devecchi, Marco Lattuada, Pietro Fezzardi, Antonino Tumeo, and Fabrizio Ferrandi. 2022. Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics. IEEE Trans. Comput. 71, 3 (2022), 520–533.
    [41]
    Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Marco Lattuada, and Fabrizio Ferrandi. 2016. Efficient synthesis of graph methods: a dynamically scheduled architecture. In Proceedings of the 35th International Conference on Computer-Aided Design, (ICCAD).
    [42]
    Thomas Moscibroda and Onur Mutlu. 2009. A Case for Bufferless Routing in On-Chip Networks. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA). 196–207.
    [43]
    Razvan Nane, Vlad Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao, Stephen Dean Brown, Fabrizio Ferrandi, Jason Helge Anderson, and Koen Bertels. 2016. A Survey and Evaluation of FPGA High-Level Synthesis Tools. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 35, 10 (2016), 1591–1604.
    [44]
    Tan Nguyen, Yao Cheny, Kyle Rupnow, Swathi Gurumani, and Deming Chen. 2016. SoC, NoC and Hierarchical Bus Implementations of Applications on FPGAs Using the FCUDA Flow. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 661–666.
    [45]
    E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation. In IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM). 25–28.
    [46]
    Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 111–117.
    [47]
    Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini, Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, and Zhiru Zhang. 2022. Accelerator Design with Decoupled Hardware Customizations: Benefits and Challenges. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC). 1351–1354.
    [48]
    Artur Podobas and Mats Brorsson. 2016. Empowering OpenMP with automatically generated hardware. In International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, (SAMOS). IEEE.
    [49]
    Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110–119.
    [50]
    Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (second ed.). SIAM.
    [51]
    Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022. Automated Accelerator Optimization Aided by Graph Neural Networks. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC). 55–60.
    [52]
    L. Sommer, J. Oppermann, J. Hofmann, and A. Koch. 2017. Synthesis of interleaved multithreaded accelerators from OpenMP loops. In International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–7.
    [53]
    Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating Graph Processing Using ReRAM. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 531–543.
    [54]
    Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. 2015. ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 78–85.
    [55]
    J. Villarreal, A. Park, W. Najjar, and R. Halstead. 2010. Designing Modular Hardware Accelerators in C with ROCCC 2.0. In 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 127–134.
    [56]
    S. Windh, P. Budhkar, and W. A. Najjar. 2015. CAMs as synchronizing caches for multithreaded irregular applications on FPGAs. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 331–336.
    [57]
    Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. ScalaGraph: A Scalable Accelerator for Massively Parallel Graph Processing. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). 199–212.
    [58]
    Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). 741–755.
    [59]
    Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-Based Graph Processor Using Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 207–216.
    [60]
    Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of pagerank. In International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–6.
    [61]
    Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2249–2264.
    [62]
    Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA Framework for Edge-centric Graph Processing. In ACM International Conference on Computing Frontiers (CF). 69–77.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Reconfigurable Technology and Systems
    ACM Transactions on Reconfigurable Technology and Systems Just Accepted
    ISSN:1936-7406
    EISSN:1936-7414
    Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Online AM: 12 July 2024
    Accepted: 03 July 2024
    Revised: 28 May 2024
    Received: 31 December 2023

    Check for updates

    Author Tags

    1. Design automation
    2. FPGA architecture
    3. Graph algorithms
    4. Parallelism

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 89
      Total Downloads
    • Downloads (Last 12 months)89
    • Downloads (Last 6 weeks)88
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media