research-article

TriCore: parallel triangle counting on GPUs

Authors:

H. Howie HuangAuthors Info & Claims

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 14, Pages 1 - 12

Published: 11 November 2018 Publication History

Abstract

Exact triangle counting algorithm enumerates the triangles in a graph by identifying the common neighbors of two vertices of each edge. In this work, we present TriCore, a scalable GPU-based triangle counting system that consists of three major techniques. First, we design a binary search based algorithm that can increase both the thread parallelism and memory performance on Graphics Processing Units (GPUs), both of which are absent from prior work. Second, in contrast to prior attempts which require multiple graph representations, i.e., compressed sparse row (CSR), edge list, and bitmap, to be present in the GPU memory, TriCore evenly partitions and distributes the partitioned CSR data across all the GPUs, and uses a streaming buffer to load the edge list from the CPU memory on the fly. This design enables TriCore to process the graphs that are orders of magnitude larger than the GPU memory. Third, we further develop a dynamic workload management technique to balance the workload across GPUs. our evaluation demonstrates that TriCore on a single GPU can count the triangles in the billion-edge Twitter graph within 24 seconds, that is, 22X faster than the state-of-the-art CPU project which uses CPUs that are 8X more expensive. When processing big graphs (up to 33.4 billion edges) that are ~22X larger than the memory size of a single GPU, it achieves 24X speedup when scaling from 1 to 32 GPUs.

References

[1]

DARPA HIVE GraphChallenge, https://graphchallenge.mit.edu/darpa-hive.

[2]

Graph Challenge Datasets, http://graphchallenge.mit.edu/data-sets.

[3]

Intel Xeon E5 2683 v3 Processor, https://ark.intel.com/products/81055/Intel-Xeon-Processor-E5-2683-v3-35M-Cache.

[4]

Kronecker: Graph 500 Generator, https://graph500.org/?page_id=12#sec-3.

[5]

NVIDIA TESLA V100 GPU ACCELERATOR, http://www.nvidia.com/content/pdf/volta-datasheet.pdf.

[6]

N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algorithmica, 1997.

Digital Library

[7]

S. Arifuzzaman, M. Khan, and M. Marathe. Patric: A parallel algorithm for counting triangles in massive networks. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 2013.

Digital Library

[8]

S. Arifuzzaman, M. Khan, and M. Marathe. A fast parallel algorithm for counting triangles in graphs using dynamic load balancing. In Big Data. IEEE, 2015.

Digital Library

[9]

Ariful Azad, Aydin Buluç, and John Gilbert. Parallel triangle counting and enumeration using matrix algebra. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International, pages 804--811. IEEE, 2015.

Digital Library

[10]

L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In SIGKDD. ACM, 2008.

Digital Library

[11]

Maciej Besta, Michal Podstawski, Linus Groner, Edgar Solomonik, and Torsten Hoefler. To push or to pull: On reducing communication and synchronization in graph computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pages 93--104. ACM, 2017.

Digital Library

[12]

Mauro Bisson and Massimiliano Fatica. Static graph challenge on gpu. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE, pages 1--8. IEEE, 2017.

[13]

Mauro Bisson and Massimilliano Fatica. High performance exact triangle counting on gpus. IEEE Transactions on Parallel and Distributed Systems, 2017.

[14]

Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar, editors, Proceedings of the 20th international conference on World Wide Web, pages 587--596. ACM Press, 2011.

Digital Library

[15]

Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004), pages 595--601, Manhattan, USA, 2004. ACM Press.

Digital Library

[16]

A. Buluç and K. Madduri. Parallel breadth-first search on distributed memory systems. In SC. ACM, 2011.

Digital Library

[17]

Aydin Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. Recent advances in graph partitioning. In Algorithm Engineering, pages 117--158. Springer, 2016.

[18]

R. Burt. Structural holes and good ideas1. American journal of sociology, 2004.

[19]

D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, volume 4, 2004.

[20]

J. Coleman. Social capital in the creation of human capital. American journal of sociology, 1988.

[21]

Disa Mhembere Da Zheng, Randal Burns, Joshua Vogelstein, Carey E Priebe, and Alexander S Szalay. Flashgraph: Processing billion-node graphs on an array of commodity ssds. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, pages 45--58, 2015.

Digital Library

[22]

J. Eckmann and E. Moses. Curvature of co-links uncovers hidden thematic layers in the world wide web. Proceedings of the national academy of sciences, 2002.

[23]

E. Elenberg, K. Shanmugam, M. Borokhovich, and A. Dimakis. Beyond triangles: A distributed framework for estimating 3-profiles of large graphs. In SIGKDD. ACM, 2015.

Digital Library

[24]

I. Giechaskiel, G. Panagopoulos, and E. Yoneki. Pdtl: Parallel and distributed triangle listing for massive graphs. In ICPP. IEEE, 2015.

Digital Library

[25]

Minas Gjoka, Maciej Kurant, Carter T Butts, and Athina Markopoulou. Practical recommendations on crawling online social networks. IEEE Journal on Selected Areas in Communications, 29(9):1872--1892, 2011.

[26]

O. Green, R. McColl, and D. Bader. Gpu merge path: a gpu merging algorithm. In Proceedings of the 26th ICS, 2012.

Digital Library

[27]

O. Green, P. Yalamanchili, and L. Munguía. Fast triangle counting on the gpu. In Proceedings of the Fourth Workshop on Irregular Applications: Architectures and Algorithms, 2014.

Digital Library

[28]

S. Hong, S. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. In ACM SIGPLAN Notices, 2011.

Digital Library

[29]

Yang Hu, Pradeep Kumar, Guy Swope, and H. Howie Huang. Trix: Triangle counting at extreme scale. Technical report, Department of Electrical and Computer Engineering, The George Washington University, 2017.

[30]

Edward Kao, Vijay Gadepally, Michael Hurley, Michael Jones, Jeremy Kepner, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Siddharth Samsi, William Song, et al. Streaming graph challenge: Stochastic block partition. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE, pages 1--12. IEEE, 2017.

[31]

Pradeep Kumar and H Howie Huang. G-store: high-performance graph store for trillion-edge processing. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for, pages 830--841. IEEE, 2016.

Digital Library

[32]

J. Kunegis. Konect: the koblenz network collection. In International conference on World Wide Web companion. International World Wide Web Conferences Steering Committee, 2013.

Digital Library

[33]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, pages 591--600. ACM, 2010.

Digital Library

[34]

A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. In OSDI, 2012.

Digital Library

[35]

M. Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theoretical Computer Science, 2008.

Digital Library

[36]

H. Liu and H. Huang. Enterprise: breadth-first graph traversal on gpus. In SC, 2015.

Digital Library

[37]

Hang Liu and H Howie Huang. Graphene: Fine-grained io management for graph computing. In FAST, pages 285--300, 2017.

Digital Library

[38]

Hang Liu, H Howie Huang, and Yang Hu. ibfs: Concurrent breadth-first search on gpus. In Proceedings of the 2016 International Conference on Management of Data, pages 403--416. ACM, 2016.

Digital Library

[39]

K. Madduri and D. Bader. Gtgraph: A suite of synthetic random graph generators, 2012.

[40]

Duane Merrill and Michael Garland. Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 58. IEEE Press, 2016.

Digital Library

[41]

Duane Merrill, Michael Garland, and Andrew Grimshaw. Scalable gpu graph traversal. In ACM SIGPLAN Notices, volume 47, pages 117--128. ACM, 2012.

[42]

Alan Mislove, Massimiliano Marcon, Krishna P Gummadi, Peter Druschel, and Bobby Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 29--42. ACM, 2007.

Digital Library

[43]

CUDA Nvidia. Programming guide, 2008.

[44]

R. Pagh and C. Tsourakakis. Colorful triangle counting and a mapreduce implementation. Information Processing Letters, 2012.

Digital Library

[45]

H. Park and C. Chung. An efficient mapreduce algorithm for counting triangles in a very large graph. In International conference on Conference on information & knowledge management, 2013.

Digital Library

[46]

Roger Pearce. Triangle counting for scale-free graphs at scale in distributed memory. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE, pages 1--4. IEEE, 2017.

[47]

Roger Pearce, Maya Gokhale, and Nancy M Amato. Faster parallel traversal of scale free graphs at extreme scale with vertex delegates. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for, pages 549--559. IEEE, 2014.

Digital Library

[48]

M. Rahman and M. Al Hasan. Approximate triangle counting algorithms on multi-cores. In BigData, 2013.

[49]

Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M Tamer Özsu. The ubiquity of large graphs and surprising challenges of graph processing. Proceedings of the VLDB Endowment, 11(4), 2017.

Digital Library

[50]

C Seshadhri, Ali Pinar, and Tamara G Kolda. Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Statistical Analysis and Data Mining: The ASA Data Science Journal, 7(4):294--307, 2014.

Digital Library

[51]

J. Shun and K. Tangwongsan. Multicore triangle computations without tuning. In Proceedings of the IEEE ICDE, 2015.

[52]

Marc Snir, Steve Otto, Steven Huss-Lederman, Jack Dongarra, and David Walker. MPI-the Complete Reference: the MPI core, volume 1. MIT press, 1998.

Digital Library

[53]

S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In International conference on World wide web, 2011.

Digital Library

[54]

C. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM, 2008.

Digital Library

[55]

C. Tsourakakis, P. Drineas, E. Michelakis, I. Koutis, and C. Faloutsos. Spectral counting of triangles via element-wise sparsification and triangle-based link recommendation. Social Network Analysis and Mining, 2011.

[56]

C. Tsourakakis, U Kang, G. Miller, and C. Faloutsos. Doulion: counting triangles in massive graphs with a coin. In SIGKDD. ACM, 2009.

Digital Library

[57]

Chad Voegele, Yi-Shan Lu, Sreepathi Pai, and Keshav Pingali. Parallel triangle counting and k-truss identification using graph-centric methods. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE, pages 1--7. IEEE, 2017.

[58]

Jia Wang and James Cheng. Truss decomposition in massive networks. Proceedings of the VLDB Endowment, 5(9):812--823, 2012.

Digital Library

[59]

Leyuan Wang, Yangzihao Wang, Carl Yang, and John D Owens. A comparative study on exact triangle counting algorithms on the gpu. In Proceedings of the ACM Workshop on High Performance Graph Processing, pages 1--8. ACM, 2016.

Digital Library

[60]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 41--53. ACM, 2018.

Digital Library

[61]

W. Wang, Y. Gu, Z. Wang, and G. Yu. Parallel triangle counting over large graphs. In Database Systems for Advanced Applications, 2013.

[62]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. Gunrock: A high-performance graph processing library on the gpu. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, page 11. ACM, 2016.

Digital Library

[63]

D. Watts and S. Strogatz. Collective dynamics of 'small-world' networks. nature, 1998.

[64]

Michael M Wolf, Mehmet Deveci, Jonathan W Berry, Simon D Hammond, and Sivasankaran Rajamanickam. Fast linear algebra-based triangle counting with kokkoskernels. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE, pages 1--7. IEEE, 2017.

Cited By

Kumar PHuang HMerchant AWeatherspoon H(2019)GRAPHONEProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323322(249-263)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323322

TriCore: parallel triangle counting on GPUs

Recommendations

TriCore: parallel triangle counting on GPUs
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Exact triangle counting algorithm enumerates the triangles in a graph by identifying the common neighbors of two vertices of each edge. In this work, we present TriCore, a scalable GPU-based triangle counting system that consists of three major ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2018

932 pages

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 11 November 2018

Check for updates

Qualifiers

Research-article

Conference

SC18

Sponsor:

SIGHPC

SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 16, 2018

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
257
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kumar PHuang HMerchant AWeatherspoon H(2019)GRAPHONEProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323322(249-263)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323322

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten