research-article

Open access

Just Accepted

PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration

Authors:

Jingling XueAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization

Accepted on 26 August 2024

https://doi.org/10.1145/3701998

Online AM: 26 November 2024 Publication History

Abstract

Graph processing is pivotal in deriving insights from complex data structures but faces performance limitations due to the irregular nature of graphs. Traditional general-purpose processors often struggle with low instruction-level parallelism and energy inefficiency when handling graph data. In response, modern graph accelerators have embraced an intra-edge-parallel model to enhance parallelization, significantly outperforming conventional processors. However, the indiscriminate processing of edges in existing systems results in substantial computational redundancy, negatively impacting overall efficiency.

This paper introduces PRAGA, an innovative graph accelerator designed to optimize efficiency by selectively processing edges that significantly contribute to final results while preserving high computational parallelism. PRAGA utilizes an intra-edge-sequential model, prioritizing edge processing to capitalize on coarse-grained vertex-level parallelism and minimize unnecessary computations. It incorporates a hot-value manager to alleviate network-on-chip congestion and a memory-aware coalescer to minimize redundant data accesses. Our experimental results, obtained using a Xilinx Alveo U280 FPGA accelerator card, demonstrate that PRAGA achieves speedups of 17.88 × and 5.86 × over state-of-the-art accelerators ScalaGraph and GraphDyns, respectively, and outperforms the advanced GPU-based system Gunrock by 22.52 × on average. This substantial improvement underscores PRAGA’s potential to redefine performance benchmarks in graph processing.

References

[1]

Yu Yao, Zhaolong Ning, and Guo Lei. 2016. A secure routing scheme based on social network analysis in wireless mesh networks. Sci. China Inform. Sci. 59, 12 (2016), 122310:1–122310:12.

[2]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media. In Proceedings of the 19th International Conference on World Wide Web (WWW). 591–600.

Digital Library

[3]

Zhen Tian, Haichuan Fang, Zhixia Teng, and Yangdong Ye. 2022. GOGCN: Graph convolutional network on gene ontology for functional similarity analysis of genes. IEEE/ACM Trans. Comput. Biol. Bioinform. 20, 2 (2022), 1053–1064.

Digital Library

[4]

Ed Bullmore and Olaf Sporns. 2009. Complex brain networks: Graph theoretical analysis of structural and functional systems. Nature Rev. Neurosci. 10, 3 (2009), 186–198.

[5]

Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. In Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 593–607.

Digital Library

[6]

Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019. Analysis and optimization of the memory hierarchy for graph processing workloads. In Proceedings of the 25th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 373–386.

[7]

Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13.

[8]

Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 17–30.

[9]

Julian Shun and Guy E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 135–146.

Digital Library

[10]

Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the 2012 International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC). 1–10.

Digital Library

[11]

Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2021. Polygraph: Exposing the value of flexibility for graph processing accelerators. In Proceedings of the 48th ACM/IEEE International Symposium on Computer Architecture (ISCA). 595–608.

Digital Library

[12]

Yu Zhang, Xiaofei Liao, Hai Jin, Ligang He, Bingsheng He, Haikun Liu, and Lin Gu. 2021. Depgraph: A dependency-driven accelerator for efficient iterative graph processing. In Proceedings of the 21st IEEE International Symposium on High-Performance Computer Architecture (HPCA). 371–384.

[13]

Yu Zhang, Da Peng, Xiaofei Liao, Hai Jin, Haikun Liu, Lin Gu, and Bingsheng He. 2021. LargeGraph: An efficient dependency-aware GPU-accelerated large-scale graph processing. ACM Trans. Archit. Code Optim. 18, 4 (2021), 1–24.

Digital Library

[14]

Yifan Yang, Zhaoshi Li, Yangdong Deng, Zhiwei Liu, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2020. GraphABCD: Scaling out graph analytics with asynchronous block coordinate descent. In Proceedings of the 47th ACM/IEEE International Symposium on Computer Architecture (ISCA). 419–432.

Digital Library

[15]

Wenjie Liu and Zhanhuai Li. 2019. An efficient parallel algorithm of N-hop neighborhoods on graphs in distributed environment. Front. Comput. Sci. 13, 6 (2019), 1309–1325.

Digital Library

[16]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD). 135–146.

Digital Library

[17]

Xinyu Chen, Yao Chen, Feng Cheng, Hongshi Tan, Bingsheng He, and WengFai Wong. 2022. ReGraph: Scaling graph processing on HBM-enabled FPGAs with heterogeneous pipelines. In Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1342–1358.

Digital Library

[18]

Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 599–613.

[19]

Chengbo Yang, Long Zheng, Chuangyi Gui, and Hai Jin. 2020. Efficient FPGA-based graph processing with hybrid pull-push computational model. Front. Comput. Sci. 14, 4 (2020), 144102:1–144102:16.

[20]

NVIDIA. 2016. nvGRAPH. https://developer.nvidia.com/nvgraph.

[21]

Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 43rd ACM/IEEE International Symposium on Computer Architecture (ISCA). 166–177.

Digital Library

[22]

Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In Proceedings of the 18th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 544–557.

[23]

Tsunyu Yang, Yizou Chen, Yuhong Liang, and Mingchang Yang. 2024. Seraph: Towards scalable and efficient fully-external graph computation via on-demand processing. In Proceedings of the 22nd USENIX Conference on File and Storage Technologies (FAST). 373–387.

[24]

Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks. In Proceedings of the 23rd IEEE International Symposium on High-Performance Computer Architecture (HPCA). 457–468.

[25]

Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. Scalagraph: A scalable accelerator for massively parallel graph processing. In Proceedings of the 22nd IEEE International Symposium on High-Performance Computer Architecture (HPCA). 199–212.

[26]

Mingyu Yan, Xing Hu, Shuangchen Li, Abanti Basak, Han Li, Xin Ma, Itir Akgun, Yujing Feng, Peng Gu, Lei Deng, Xiaochun Ye, Zhimin Zhang, Dongrui Fan, and Yuan Xie. 2019. Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach. In Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO). 615–628.

Digital Library

[27]

Intel Corporation. 2020. Intel Xeon Gold 6330 Processor. https://www.intel.com/content/www/us/en/products/sku/212458/intel-xeon-gold-6330-processor-42m-cache-2-00-ghz/specifications.html

[28]

NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf

[29]

Pengcheng Yao, Long Zheng, Xiaofei Liao, Hai Jin, and Bingsheng He. 2018. An efficient graph accelerator with parallel data conflict management. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT). 8:1–8:12.

Digital Library

[30]

Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A high-performance graph DSL. Proc. ACM Program. Lang. 2, OOPSLA (2018), 121:1–121:30.

Digital Library

[31]

Vignesh Balaji, Neal Crago, Aamer Jaleel, and Brandon Lucia. 2021. P-opt: Practical optimal cache replacement for graph analytics. In Proceedings of the 21st IEEE International Symposium on High-Performance Computer Architecture (HPCA). 668–681.

[32]

Jung H. Ahn, Nathan Binkert, Al Davis, Moray McLaren, and Robert S. Schreiber. 2009. HyperX: Topology, routing, and packaging of efficient large-scale networks. In Proceedings of the Conference for High-Performance Computing, Networking, Storage, and Analysis (SC). 41:1–41:11.

[33]

Duncan H. Lawrie. 1975. Access and alignment of data in an array processor. IEEE Trans. Comput. C-24, 12 (1975), 1145–1155.

Digital Library

[34]

Masoumeh Ebrahimi and Masoud Daneshtalab. 2017. EbDa: A new theory on design and verification of deadlock-free interconnection networks. In Proceedings of the 44th ACM/IEEE International Symposium on Computer Architecture (ISCA). 703–715.

Digital Library

[35]

Nie McDonald, Mikhail Isaev, Adriana Flores, Al Davis, and John Kim. 2019. Practical and efficient incremental adaptive routing for HyperX networks. In Proceedings of the 2019 International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC). 15:1–15:13.

[36]

Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–14.

Digital Library

[37]

Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2019. PHI: Architectural support for synchronization-and bandwidth-efficient commutative scatter updates. In Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO). 1009–1022.

Digital Library

[38]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 11:1–11:12.

Digital Library

[39]

Shang Li, Zhiyuan Yang, Dhiraj Reddy, Ankur Srivastava, and Bruce Jacob. 2020. DRAMsim3: A cycle-accurate, thermal-capable DRAM simulator. IEEE Comput. Archit. Lett. 19, 2 (2020), 106–109.

Digital Library

[40]

Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data

[41]

Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. 2010. Kronecker graphs: An approach to modeling networks.J. Mach. Learn. Res. 11, 2 (2010), 985–1042.

Digital Library

[42]

Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. 2010. Introducing the Graph 500. Cray Users Group (CUG). 19, 22 (2010), 45–74.

[43]

NVIDIA. 2021. NVIDIA System Management Interface. https://developer.nvidia.com/system-management-interface

[44]

Xilinx. 2019. Xilinx Board Utility. https://www.xilinx.com/html docs/xilinx2019 1/sdaccel doc/xilinx-board-swiss-armyknife-utility-ufa1504034339078.html.

[45]

Amir H. N. Sabet, Zhijia Zhao, and Rajiv Gupta. 2020. Subway: Minimizing data transfer during out-of-GPU-memory graph processing. In Proceedings of the 15th European Conference on Computer Systems (EuroSys). 12:1–12:16.

Digital Library

[46]

Lingxiao Ma, Zhi Yang, Han Chen, Jilong Xue, and Yafei Dai. 2017. Garaph: Efficient GPU-accelerated graph processing on a single machine with balanced replication. In Proceedings of the 2017 USENIX Annual Technical Conference (ATC). 195–207.

[47]

Jianlong Zhong and Bingsheng He. 2013. Medusa: Simplified graph processing on GPUs. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2013), 1543–1552.

Digital Library

[48]

Long Zheng, Xianliang Li, Yaohui Zheng, Yu Huang, Xiaofei Liao, Hai Jin, Jingling Xue, Zhiyuan Shao, and Qiangsheng Hua. 2020. Scaph: Scalable GPU-accelerated graph processing with value-driven differential scheduling. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC). 573–588.

[49]

Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, and Huazhong Yang. 2017. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 217–226.

Digital Library

[50]

Pengcheng Yao, Long Zheng, Zhen Zeng, Yu Huang, Chuangyi Gui, Xiaofei Liao, Hai Jin, and Jingling Xue. 2020. A locality-aware energy-efficient accelerator for graph mining applications. In Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO). 895–907.

[51]

Priyank Faldu, Jeff Diamond, and Boris Grot. 2020. Domain-specialized cache management for graph analytics. In Proceedings of the 20th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 234–248.

[52]

Jonas Dann, Daniel Ritter, and Holger Fröning. 2022. GraphScale: Scalable bandwidth-efficient graph processing on FPGAs. In Proceedings of the 32nd International Conference on Field-Programmable Logic and Applications (FPL). 24–32.

[53]

Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs. In Proceedings of the 2021 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–9.

Digital Library

[54]

Alberto Parravicini, Francesco Sgherzi, and Marco D. Santambrogio. 2021. A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA. In Proceedings of the 26th Asia and South Pacific Design Automation Conference (ASP-DAC). 378–383.

Digital Library

[55]

Shafiur Rahman, Nael A. Ghazaleh, and Rajiv Gupta. 2020. Graphpulse: An event-driven hardware accelerator for asynchronous graph processing. In Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO). 908–921.

[56]

Dan Chen, Chuangyi Gui, Yi Zhang, Hai Jin, Long Zheng, Yu Huang, and Xiaofei Liao. 2022. Graphfly: Efficient asynchronous streaming graphs processing via dependency-flow. In Proceedings of the 2022 International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC). 45:1–45:14.

Index Terms

PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration
1. Computer systems organization
  1. Architectures
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs

Recommendations

Towards High-Performance Graph Processing: From a Hardware/Software Co-Design Perspective
Abstract
Graph processing has been widely used in many scenarios, from scientific computing to artificial intelligence. Graph processing exhibits irregular computational parallelism and random memory accesses, unlike traditional workloads. Therefore, ...
Software pipelining for graphic processing unit acceleration

The graphic processing unit GPU is becoming increasingly popular as a performance accelerator in various applications requiring high-performance parallel computing capability. In a central processing unit CPU or GPU hybrid system, software pipelining is ...
Neural acceleration for GPU throughput processors
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Graphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Just Accepted

EISSN:1544-3973

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 26 November 2024

Accepted: 26 August 2024

Revised: 10 July 2024

Received: 08 March 2024

Check for updates

Author Tag

Graph processing; accelerators; priority-aware; redundancy; parallelism

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
160
Total Downloads

Downloads (Last 12 months)160
Downloads (Last 6 weeks)147

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables