Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access
Just Accepted

PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration

Online AM: 26 November 2024 Publication History

Abstract

Graph processing is pivotal in deriving insights from complex data structures but faces performance limitations due to the irregular nature of graphs. Traditional general-purpose processors often struggle with low instruction-level parallelism and energy inefficiency when handling graph data. In response, modern graph accelerators have embraced an intra-edge-parallel model to enhance parallelization, significantly outperforming conventional processors. However, the indiscriminate processing of edges in existing systems results in substantial computational redundancy, negatively impacting overall efficiency.
This paper introduces PRAGA, an innovative graph accelerator designed to optimize efficiency by selectively processing edges that significantly contribute to final results while preserving high computational parallelism. PRAGA utilizes an intra-edge-sequential model, prioritizing edge processing to capitalize on coarse-grained vertex-level parallelism and minimize unnecessary computations. It incorporates a hot-value manager to alleviate network-on-chip congestion and a memory-aware coalescer to minimize redundant data accesses. Our experimental results, obtained using a Xilinx Alveo U280 FPGA accelerator card, demonstrate that PRAGA achieves speedups of 17.88 × and 5.86 × over state-of-the-art accelerators ScalaGraph and GraphDyns, respectively, and outperforms the advanced GPU-based system Gunrock by 22.52 × on average. This substantial improvement underscores PRAGA’s potential to redefine performance benchmarks in graph processing.

References

[1]
Yu Yao, Zhaolong Ning, and Guo Lei. 2016. A secure routing scheme based on social network analysis in wireless mesh networks. Sci. China Inform. Sci. 59, 12 (2016), 122310:1–122310:12.
[2]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media. In Proceedings of the 19th International Conference on World Wide Web (WWW). 591–600.
[3]
Zhen Tian, Haichuan Fang, Zhixia Teng, and Yangdong Ye. 2022. GOGCN: Graph convolutional network on gene ontology for functional similarity analysis of genes. IEEE/ACM Trans. Comput. Biol. Bioinform. 20, 2 (2022), 1053–1064.
[4]
Ed Bullmore and Olaf Sporns. 2009. Complex brain networks: Graph theoretical analysis of structural and functional systems. Nature Rev. Neurosci. 10, 3 (2009), 186–198.
[5]
Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. In Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 593–607.
[6]
Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019. Analysis and optimization of the memory hierarchy for graph processing workloads. In Proceedings of the 25th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 373–386.
[7]
Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13.
[8]
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 17–30.
[9]
Julian Shun and Guy E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 135–146.
[10]
Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the 2012 International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC). 1–10.
[11]
Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2021. Polygraph: Exposing the value of flexibility for graph processing accelerators. In Proceedings of the 48th ACM/IEEE International Symposium on Computer Architecture (ISCA). 595–608.
[12]
Yu Zhang, Xiaofei Liao, Hai Jin, Ligang He, Bingsheng He, Haikun Liu, and Lin Gu. 2021. Depgraph: A dependency-driven accelerator for efficient iterative graph processing. In Proceedings of the 21st IEEE International Symposium on High-Performance Computer Architecture (HPCA). 371–384.
[13]
Yu Zhang, Da Peng, Xiaofei Liao, Hai Jin, Haikun Liu, Lin Gu, and Bingsheng He. 2021. LargeGraph: An efficient dependency-aware GPU-accelerated large-scale graph processing. ACM Trans. Archit. Code Optim. 18, 4 (2021), 1–24.
[14]
Yifan Yang, Zhaoshi Li, Yangdong Deng, Zhiwei Liu, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2020. GraphABCD: Scaling out graph analytics with asynchronous block coordinate descent. In Proceedings of the 47th ACM/IEEE International Symposium on Computer Architecture (ISCA). 419–432.
[15]
Wenjie Liu and Zhanhuai Li. 2019. An efficient parallel algorithm of N-hop neighborhoods on graphs in distributed environment. Front. Comput. Sci. 13, 6 (2019), 1309–1325.
[16]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD). 135–146.
[17]
Xinyu Chen, Yao Chen, Feng Cheng, Hongshi Tan, Bingsheng He, and WengFai Wong. 2022. ReGraph: Scaling graph processing on HBM-enabled FPGAs with heterogeneous pipelines. In Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1342–1358.
[18]
Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 599–613.
[19]
Chengbo Yang, Long Zheng, Chuangyi Gui, and Hai Jin. 2020. Efficient FPGA-based graph processing with hybrid pull-push computational model. Front. Comput. Sci. 14, 4 (2020), 144102:1–144102:16.
[20]
NVIDIA. 2016. nvGRAPH. https://developer.nvidia.com/nvgraph.
[21]
Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 43rd ACM/IEEE International Symposium on Computer Architecture (ISCA). 166–177.
[22]
Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In Proceedings of the 18th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 544–557.
[23]
Tsunyu Yang, Yizou Chen, Yuhong Liang, and Mingchang Yang. 2024. Seraph: Towards scalable and efficient fully-external graph computation via on-demand processing. In Proceedings of the 22nd USENIX Conference on File and Storage Technologies (FAST). 373–387.
[24]
Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks. In Proceedings of the 23rd IEEE International Symposium on High-Performance Computer Architecture (HPCA). 457–468.
[25]
Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. Scalagraph: A scalable accelerator for massively parallel graph processing. In Proceedings of the 22nd IEEE International Symposium on High-Performance Computer Architecture (HPCA). 199–212.
[26]
Mingyu Yan, Xing Hu, Shuangchen Li, Abanti Basak, Han Li, Xin Ma, Itir Akgun, Yujing Feng, Peng Gu, Lei Deng, Xiaochun Ye, Zhimin Zhang, Dongrui Fan, and Yuan Xie. 2019. Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach. In Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO). 615–628.
[27]
Intel Corporation. 2020. Intel Xeon Gold 6330 Processor. https://www.intel.com/content/www/us/en/products/sku/212458/intel-xeon-gold-6330-processor-42m-cache-2-00-ghz/specifications.html
[28]
NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
[29]
Pengcheng Yao, Long Zheng, Xiaofei Liao, Hai Jin, and Bingsheng He. 2018. An efficient graph accelerator with parallel data conflict management. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT). 8:1–8:12.
[30]
Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A high-performance graph DSL. Proc. ACM Program. Lang. 2, OOPSLA (2018), 121:1–121:30.
[31]
Vignesh Balaji, Neal Crago, Aamer Jaleel, and Brandon Lucia. 2021. P-opt: Practical optimal cache replacement for graph analytics. In Proceedings of the 21st IEEE International Symposium on High-Performance Computer Architecture (HPCA). 668–681.
[32]
Jung H. Ahn, Nathan Binkert, Al Davis, Moray McLaren, and Robert S. Schreiber. 2009. HyperX: Topology, routing, and packaging of efficient large-scale networks. In Proceedings of the Conference for High-Performance Computing, Networking, Storage, and Analysis (SC). 41:1–41:11.
[33]
Duncan H. Lawrie. 1975. Access and alignment of data in an array processor. IEEE Trans. Comput. C-24, 12 (1975), 1145–1155.
[34]
Masoumeh Ebrahimi and Masoud Daneshtalab. 2017. EbDa: A new theory on design and verification of deadlock-free interconnection networks. In Proceedings of the 44th ACM/IEEE International Symposium on Computer Architecture (ISCA). 703–715.
[35]
Nie McDonald, Mikhail Isaev, Adriana Flores, Al Davis, and John Kim. 2019. Practical and efficient incremental adaptive routing for HyperX networks. In Proceedings of the 2019 International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC). 15:1–15:13.
[36]
Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–14.
[37]
Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2019. PHI: Architectural support for synchronization-and bandwidth-efficient commutative scatter updates. In Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO). 1009–1022.
[38]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 11:1–11:12.
[39]
Shang Li, Zhiyuan Yang, Dhiraj Reddy, Ankur Srivastava, and Bruce Jacob. 2020. DRAMsim3: A cycle-accurate, thermal-capable DRAM simulator. IEEE Comput. Archit. Lett. 19, 2 (2020), 106–109.
[40]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data
[41]
Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. 2010. Kronecker graphs: An approach to modeling networks.J. Mach. Learn. Res. 11, 2 (2010), 985–1042.
[42]
Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. 2010. Introducing the Graph 500. Cray Users Group (CUG). 19, 22 (2010), 45–74.
[43]
NVIDIA. 2021. NVIDIA System Management Interface. https://developer.nvidia.com/system-management-interface
[44]
Xilinx. 2019. Xilinx Board Utility. https://www.xilinx.com/html docs/xilinx2019 1/sdaccel doc/xilinx-board-swiss-armyknife-utility-ufa1504034339078.html.
[45]
Amir H. N. Sabet, Zhijia Zhao, and Rajiv Gupta. 2020. Subway: Minimizing data transfer during out-of-GPU-memory graph processing. In Proceedings of the 15th European Conference on Computer Systems (EuroSys). 12:1–12:16.
[46]
Lingxiao Ma, Zhi Yang, Han Chen, Jilong Xue, and Yafei Dai. 2017. Garaph: Efficient GPU-accelerated graph processing on a single machine with balanced replication. In Proceedings of the 2017 USENIX Annual Technical Conference (ATC). 195–207.
[47]
Jianlong Zhong and Bingsheng He. 2013. Medusa: Simplified graph processing on GPUs. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2013), 1543–1552.
[48]
Long Zheng, Xianliang Li, Yaohui Zheng, Yu Huang, Xiaofei Liao, Hai Jin, Jingling Xue, Zhiyuan Shao, and Qiangsheng Hua. 2020. Scaph: Scalable GPU-accelerated graph processing with value-driven differential scheduling. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC). 573–588.
[49]
Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, and Huazhong Yang. 2017. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 217–226.
[50]
Pengcheng Yao, Long Zheng, Zhen Zeng, Yu Huang, Chuangyi Gui, Xiaofei Liao, Hai Jin, and Jingling Xue. 2020. A locality-aware energy-efficient accelerator for graph mining applications. In Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO). 895–907.
[51]
Priyank Faldu, Jeff Diamond, and Boris Grot. 2020. Domain-specialized cache management for graph analytics. In Proceedings of the 20th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 234–248.
[52]
Jonas Dann, Daniel Ritter, and Holger Fröning. 2022. GraphScale: Scalable bandwidth-efficient graph processing on FPGAs. In Proceedings of the 32nd International Conference on Field-Programmable Logic and Applications (FPL). 24–32.
[53]
Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs. In Proceedings of the 2021 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–9.
[54]
Alberto Parravicini, Francesco Sgherzi, and Marco D. Santambrogio. 2021. A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA. In Proceedings of the 26th Asia and South Pacific Design Automation Conference (ASP-DAC). 378–383.
[55]
Shafiur Rahman, Nael A. Ghazaleh, and Rajiv Gupta. 2020. Graphpulse: An event-driven hardware accelerator for asynchronous graph processing. In Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO). 908–921.
[56]
Dan Chen, Chuangyi Gui, Yi Zhang, Hai Jin, Long Zheng, Yu Huang, and Xiaofei Liao. 2022. Graphfly: Efficient asynchronous streaming graphs processing via dependency-flow. In Proceedings of the 2022 International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC). 45:1–45:14.

Index Terms

  1. PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization Just Accepted
      EISSN:1544-3973
      Table of Contents
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Online AM: 26 November 2024
      Accepted: 26 August 2024
      Revised: 10 July 2024
      Received: 08 March 2024

      Check for updates

      Author Tag

      1. Graph processing; accelerators; priority-aware; redundancy; parallelism

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 160
        Total Downloads
      • Downloads (Last 12 months)160
      • Downloads (Last 6 weeks)147
      Reflects downloads up to 15 Jan 2025

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media