research-article

CoroGraph: Bridging Cache Efficiency and Work Efficiency for Graph Algorithm Execution

Authors:

Minqi ZhouAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 4

Pages 891 - 903

https://doi.org/10.14778/3636218.3636240

Published: 05 March 2024 Publication History

Abstract

Many systems are designed to run graph algorithms efficiently in memory but they achieve only cache efficiency or work efficiency. We tackle this fundamental trade-off in existing systems by designing CoroGraph, a system that attains both cache efficiency and work efficiency for in-memory graph processing. CoroGraph adopts a novel hybrid execution model, which generates update messages at vertex granularity to prioritize promising vertices for work efficiency, and commits updates at partition granularity to share data access for cache efficiency. To overlap the random memory access of graph algorithms with computation, CoroGraph extensively uses coroutine, i.e., a lightweight function in C++ that can yield and resume with low overhead, to prefetch the required data. A suite of designs are incorporated to reap the full benefits of coroutine, which include prefetch pipeline, cache-friendly graph format, and stop-free synchronization. We compare CoroGraph with five state-of-the-art graph algorithm systems via extensive experiments. The results show that CoroGraph yields shorter algorithm execution time than all baselines in 18 out of 20 cases, and its speedup over the best-performing baseline can be over 2x. Detailed profiling suggests that CoroGraph achieves both cache efficiency and work efficiency with a low memory stall and a small number of processed edges.

References

[1]

Leman Akoglu, Hanghang Tong, and Danai Koutra. 2015. Graph based anomaly detection and description: a survey. Data mining and knowledge discovery 29, 3 (2015), 626--688.

Digital Library

[2]

J Alvarez-Hamelin, Luca Dall'Asta, Alain Barrat, and Alessandro Vespignani. 2005. Large scale networks fingerprinting and visualization using the k-core decomposition. Advances in neural information processing systems 18 (2005).

[3]

Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying memory access patterns for prefetching. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 513--526.

Digital Library

[4]

Alex Beutel, Leman Akoglu, and Christos Faloutsos. 2015. Fraud detection through graph-based user behavior modeling. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 1696--1697.

Digital Library

[5]

Paolo Boldi and Sebastiano Vigna. 2004. The webgraph framework I: compression techniques. In Proceedings of the 13th international conference on World Wide Web. 595--602.

Digital Library

[6]

Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, et al. 2013. TAO:Facebook's Distributed Data Store for the Social Graph. In 2013 USENIX Annual Technical Conference (USENIX ATC 13). 49--60.

[7]

Yukuo Cen, Jing Zhang, Gaofei Wang, Yujie Qian, Chuizheng Meng, Zonghong Dai, Hongxia Yang, and Jie Tang. 2019. Trust relationship prediction in alibaba E-commerce platform. IEEE Transactions on Knowledge and Data Engineering 32, 5 (2019), 1024--1035.

[8]

Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, 442--446.

[9]

Shimin Chen, Anastassia Ailamaki, Phillip B Gibbons, and Todd C Mowry. 2007. Improving hash join performance through prefetching. ACM Transactions on Database Systems (TODS) 32, 3 (2007), 17--es.

Digital Library

[10]

Intel Coorporation. 2016. Intel 64 and IA-32 architectures optimization reference manual.

[11]

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2022. Introduction to algorithms. MIT press.

[12]

Laxman Dhulipala, Guy Blelloch, and Julian Shun. 2017. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures. 293--304.

Digital Library

[13]

David Easley and Jon Kleinberg. 2010. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge university press.

Digital Library

[14]

Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and Jure Leskovec. 2018. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 world wide web conference. 1775--1784.

Digital Library

[15]

Stephen Eubank, Hasan Guclu, VS Anil Kumar, Madhav V Marathe, Aravind Srinivasan, Zoltan Toroczkai, and Nan Wang. 2004. Modelling disease outbreaks in realistic urban social networks. Nature 429, 6988 (2004), 180--184.

[16]

Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel computation on natural graphs. In 10th USENIX symposium on operating systems design and implementation (OSDI 12). 17--30.

[17]

Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J Franklin, and Ion Stoica. 2014. GraphX: Graph processing in a distributed dataflow framework. In 11th USENIX symposium on operating systems design and implementation (OSDI 14). 599--613.

Digital Library

[18]

Samuel Grossman, Heiner Litz, and Christos Kozyrakis. 2018. Making pull-based graph processing performant. ACM SIGPLAN Notices 53, 1 (2018), 246--260.

Digital Library

[19]

Yongjun He, Jiacheng Lu, and Tianzheng Wang. 2021. Coroutine-Oriented MainMemory Database Engine. Proceedings of the VLDB Endowment 14, 3 (2021), 431--444.

Digital Library

[20]

Trey Ideker, Owen Ozier, Benno Schwikowski, and Andrew F Siegel. 2002. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18, suppl_1 (2002), S233--S240.

[21]

ISO/IEC. 2017. Technical Specifcation --- C++ Extensions for Coroutines. https://www.iso.org/standard/73008.html

[22]

Hawoong Jeong, Bálint Tombor, Réka Albert, Zoltan N Oltvai, and A-L Barabási. 2000. The large-scale organization of metabolic networks. Nature 407, 6804 (2000), 651--654.

[23]

Christopher Jonathan, Umar Farooq Minhas, James Hunter, Justin Levandoski, and Gor Nishanov. 2018. Exploiting coroutines to attack the" killer nanoseconds". Proceedings of the VLDB Endowment 11, 11 (2018), 1702--1714.

Digital Library

[24]

Juno Kim and Steven Swanson. 2022. Blaze: fast graph processing on fast SSDs. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--15.

[25]

Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous memory access chaining. Proceedings of the VLDB Endowment 9, 4 (2015), 252--263.

Digital Library

[26]

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30--37.

Digital Library

[27]

Jérôme Kunegis. 2013. Konect: the koblenz network collection. In Proceedings of the 22nd international conference on world wide web. 1343--1350.

Digital Library

[28]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web. 591--600.

Digital Library

[29]

Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi:Large-Scale Graph Computation on Just a PC. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 31--46.

[30]

Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2019. GPOP: A cache and memory-efficient framework for graph processing over partitions. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 393--394.

Digital Library

[31]

Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2020. Gpop: A scalable cache-and memory-efficient framework for graph processing over parts. ACM Transactions on Parallel Computing (TOPC) 7, 1 (2020), 1--24.

Digital Library

[32]

Kartik Lakhotia, Rajgopal Kannan, and Viktor Prasanna. 2018. Accelerating PageRank using Partition-Centric Processing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 427--440.

Digital Library

[33]

Shengliang Lu, Shixuan Sun, Johns Paul, Yuchen Li, and Bingsheng He. 2021. Cache-efficient fork-processing patterns on large graphs. In Proceedings of the 2021 International Conference on Management of Data. 1208--1221.

Digital Library

[34]

Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 135--146.

Digital Library

[35]

Prashanth Menon, Todd C Mowry, and Andrew Pavlo. 2017. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. Proceedings of the VLDB Endowment 11, 1 (2017), 1--13.

Digital Library

[36]

Ulrich Meyer and Peter Sanders. 2003. Δ-stepping: a parallelizable shortest path algorithm. Journal of Algorithms 49, 1 (2003), 114--152.

Digital Library

[37]

Michael Mitzenmacher and Eli Upfal. 2017. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press.

[38]

Jan Mühlig and Jens Teubner. 2021. MxTasks: How to Make Efficient Synchronization and Prefetching Easy. In Proceedings of the 2021 International Conference on Management of Data. 1331--1344.

Digital Library

[39]

Vikram Narayanan, David Detweiler, Tianjiao Huang, and Anton Burtsev. 2023. DRAMHiT: A Hash Table Architected for the Speed of DRAM. In Proceedings of the Eighteenth European Conference on Computer Systems. 817--834.

Digital Library

[40]

Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A lightweight infrastructure for graph analytics. In Proceedings of the twenty-fourth ACM symposium on operating systems principles. 456--471.

Digital Library

[41]

Georgios Psaropoulos, Thomas Legler, Norman May, and Anastasia Ailamaki. 2017. Interleaving with coroutines: a practical approach for robust index joins. Proceedings of the VLDB Endowment 11, CONF (2017), 230--242.

Digital Library

[42]

Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 472--488.

Digital Library

[43]

Stephen B Seidman. 1983. Network structure and minimum degree. Social networks 5, 3 (1983), 269--287.

[44]

Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity:A distributed graph engine on a memory cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 505--516.

Digital Library

[45]

Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos. 2016. Corescope: Graph mining using k-core analysis---patterns, anomalies and algorithms. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 469--478.

[46]

Julian Shun and Guy E Blelloch. 2013. Ligra: a lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming. 135--146.

Digital Library

[47]

Shixuan Sun, Yuhang Chen, Shengliang Lu, Bingsheng He, and Yuchen Li. 2021. ThunderRW: an in-memory graph random walk engine. (2021).

[48]

Narayanan Sundaram, Nadathur Rajagopalan Satish, Md Mostofa Ali Patwary, Subramanya R Dulloor, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. Graphmat: High performance graph analytics made productive. Proceedings of the VLDB Endowment 8, 11 (2015), 1214--1225.

Digital Library

[49]

Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. Graphit: A high-performance graph dsl. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1--30.

Digital Library

[50]

Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2016. Gemini: A Computation-Centric Distributed Graph Processing System. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 301--316.

Digital Library

[51]

Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph:Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). 375--386.

Cited By

Kühn RMühlig JTeubner J(2024)How to Be Fast and Not Furious: Looking Under the Hood of CPU Cache PrefetchingProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663451(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663451
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445

Index Terms

CoroGraph: Bridging Cache Efficiency and Work Efficiency for Graph Algorithm Execution

Index terms have been assigned to the content through auto-classification.

Recommendations

Hardware techniques to improve cache efficiency
Improving First Level Cache Efficiency for GPUs Using Dynamic Line Protection
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

A modern Graphics Processing Unit (GPU) utilizes L1 Data (L1D) caches to reduce memory bandwidth requirements and latencies. However, the L1D cache can easily be overwhelmed by many memory requests from GPU function units, which can bottleneck GPU ...
Partitioned instruction cache architecture for energy efficiency

The demand for high-performance architectures and powerful battery-operated mobile devices has accentuated the need for low-power systems. In many media and embedded applications, the memory system can consume more than 50% of the overall system energy, ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 4

December 2023

309 pages

ISSN:2150-8097

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 05 March 2024

Published in PVLDB Volume 17, Issue 4

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
86
Total Downloads

Downloads (Last 12 months)86
Downloads (Last 6 weeks)19

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kühn RMühlig JTeubner J(2024)How to Be Fast and Not Furious: Looking Under the Hood of CPU Cache PrefetchingProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663451(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663451
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents