research-article

Morph algorithms on GPUs

Authors:

Martin Burtscher,

Keshav PingaliAuthors Info & Claims

PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 147 - 156

https://doi.org/10.1145/2442516.2442531

Published: 23 February 2013 Publication History

Abstract

There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs.

In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-To Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges.

We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs.

This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.

References

[1]

L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19).

[2]

David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 523--530, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[3]

J. Barnat, P. Bauch, L. Brim, and M.Ceska. Computing Strongly Connected Components in Parallel on CUDA. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), pages 541--552. IEEE Computer Society, 2011.

Digital Library

[4]

A. Braunstein, M. Mèzard, and R. Zecchina. Survey propagation: An algorithm for satisfiability. Random Structures and Algorithms, 27(2):201--226, 2005.

Digital Library

[5]

Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Morgan Kaufmann, 2011.

[6]

Andrey N. Chernikov and Nikos P. Chrisochoides. Three-dimensional delaunay refinement for multi-core processors. In Proceedings of the 22nd annual international conference on Supercomputing, ICS '08, pages 214--224, New York, NY, USA, 2008. ACM.

Digital Library

[7]

L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993.

Digital Library

[8]

Panagiotis A. Foteinos, Andrey N. Chernikov, and Nikos P. Chrisochoides. Fully generalized two-dimensional constrained delaunay mesh refinement. SIAM J. Sci. Comput., 32(5):2659--2686, 2010.

Digital Library

[9]

Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, and Matei Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In The 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, 2012.

Digital Library

[10]

Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07: Proceedings of the 14th international conference on High performance computing, pages 197--208, Berlin, Heidelberg, 2007. Springer-Verlag.

Digital Library

[11]

Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM.

Digital Library

[12]

Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In 20th International Conference on Parallel Architectures and Compilation Techniques, PACT'11, 2011.

Digital Library

[13]

Mark T. Jones and Paul E. Plassmann. A parallel graph coloring heuristic. SIAM J. Sci. Comput., 14(3):654--669, May 1993.

Digital Library

[14]

Andriy Kot, Andrey Chernikov, and Nikos Chrisochoides. Effective out-of-core parallel delaunay mesh refinement using off-the-shelf software. In Proceedings of the 20th international conference on Parallel and distributed processing, IPDPS'06, pages 125--125, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[15]

Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu, Keshav Pingali, and Calin Casçaval. How much parallelism is there in irregular applications? In Proc. Symp. on Principles and practice of parallel programming (PPoPP), pages 3--14, New York, NY, USA, 2009.

Digital Library

[16]

Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211--222, 2007.

Digital Library

[17]

Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC '10, pages 52--55, New York, NY, USA, 2010. ACM.

Digital Library

[18]

Mario Mendez-Lojo, Martin Burtscher, and Keshav Pingali. A gpu implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 107--116, New York, NY, USA, 2012. ACM.

Digital Library

[19]

Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, Xin Sui, M. Amber Hassaan, Milind Kulkarni, Martin Burtscher, and Keshav Pingali. Structure-driven optimizations for amorphous data-parallel programs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming (PPoPP'10), pages 3--14, January 2010.

Digital Library

[20]

Duane G. Merrill, Michael Garland, and Andrew S. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012.

Digital Library

[21]

Stephan Mertens, Marc Mézard, and Riccardo Zecchina. Threshold values of random k-sat from the cavity method. Random Struct. Algorithms, 28(3):340--373, May 2006.

Digital Library

[22]

Cristobal A. Navarro, Nancy Hitschfeld-Kahler, and Eliana Scheihing. A parallel gpu-based algorithm for delaunay edge-flips. In The 27th European Workshop on Computational Geometry, EuroCG '11, 2011.

[23]

Sadegh Nobari, Thanh-Tung Cao, Panagiotis Karras, and Stéphane Bressan. Scalable parallel minimum spanning forest computation. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 205--214, New York, NY, USA, 2012. ACM.

Digital Library

[24]

Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM.

Digital Library

[25]

Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. Eigencfa: accelerating flow analysis with gpus. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '11, pages 511--522, New York, NY, USA, 2011. ACM.

Digital Library

[26]

Sandeep Putta and Rupesh Nasre. Parallel replication-based points-to analysis. In Proceedings of the 21st international conference on Compiler Construction, CC'12, pages 61--80, Berlin, Heidelberg, 2012. Springer-Verlag.

Digital Library

[27]

Meng Qi, Thanh-Tung Cao, and Tiow-Seng Tan. Computing 2d constrained delaunay triangulation using the gpu. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D '12, pages 39--46, New York, NY, USA, 2012. ACM.

Digital Library

[28]

Jonathan Richard Shewchuk. Triangle: Engineering a 2d quality mesh generator and Delaunay triangulator. In Applied Computational Geometry: Towards Geometric Engineering, volume 1148 of Lecture Notes in Computer Science, pages 203--222. Springer-Verlag, 1996.

Digital Library

[29]

Jeff A. Stuart and John D. Owens. Efficient synchronization primitives for gpus. CoRR, abs/1110.4623, 2011.

[30]

Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the gpu. In Proceedings of the Conference on High Performance Graphics 2009, HPG '09, pages 167--171, New York, NY, USA, 2009. ACM.

Digital Library

[31]

Shucai Xiao and Wu chun Feng. Inter-block gpu communication via fast barrier synchronization. In IPDPS, pages 1--12. IEEE, 2010.

[32]

Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[33]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM.

Digital Library

Cited By

Jin RLiang WZhang HSong YLuo ZMa HMa YJin Z(2024)PASCI : A Scalable Framework for Heterogeneous Parallel Calculation of Dynamical Electron CorrelationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673098(1103-1113)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673098
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Liu YAzami NVanausdal ABurtscher MMohror KArnold DBadia R(2023)Choosing the Best Parallelization and Implementation Styles for Graph Analytics Codes: Lessons Learned from 1106 ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607038(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607038
Show More Cited By

Index Terms

Morph algorithms on GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Morph algorithms on GPUs
PPoPP '13

There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively ...
A GPU implementation of inclusion-based points-to analysis
PPOPP '12

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...
A GPU implementation of inclusion-based points-to analysis
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

February 2013

332 pages

ISBN:9781450319225

DOI:10.1145/2442516

General Chairs:
Alex Nicolau
University of California, Irvine, USA
,
Xiaowei Shen
IBM Research, China
,
Program Chairs:
Saman Amarasinghe
Massachusetts Institute of Technology, USA
,
Richard Vuduc
Georgia Institute of Technology, USA

ACM SIGPLAN Notices Volume 48, Issue 8
PPoPP '13
August 2013
309 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2517327
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 February 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '13

Sponsor:

SIGPLAN

PPoPP '13: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 23 - 27, 2013

Shenzhen, China

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

73
Total Citations
View Citations
927
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)8

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jin RLiang WZhang HSong YLuo ZMa HMa YJin Z(2024)PASCI : A Scalable Framework for Heterogeneous Parallel Calculation of Dynamical Electron CorrelationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673098(1103-1113)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673098
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Liu YAzami NVanausdal ABurtscher MMohror KArnold DBadia R(2023)Choosing the Best Parallelization and Implementation Styles for Graph Analytics Codes: Lessons Learned from 1106 ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607038(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607038
Guo WLi YTan K(2022)Exploiting Reuse for GPU Subgraph EnumerationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.303556434:9(4231-4244)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TKDE.2020.3035564
Singh SNasre R(2020)Graffix: Efficient Graph Processing with a Tinge of GPU-Specific ApproximationsProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404406(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404406
Ben-Nun TSutton MPai SPingali K(2020)GrouteACM Transactions on Parallel Computing10.1145/33997307:3(1-27)Online publication date: 21-Jun-2020
https://dl.acm.org/doi/10.1145/3399730
Gao LXu YWang RLuan ZYu ZQian D(2020)Thread-Level Locking for SIMT ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295570531:5(1121-1136)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2955705
Mencagli GFrança FBentes CJusten Marzulo LLima Pilla MWyrzykowski RDeelman EVasconcellos JCáceres EMongelli HSong SDehne FSzwarcfiter J(2019)New BSP/CGM algorithms for spanning treesInternational Journal of High Performance Computing Applications10.1177/109434201880367233:3(444-461)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1177/1094342018803672
Zhang YShen JXu ZQiu SChen X(2019)Architectural Implications in Graph Processing of Accelerator with Gardenia Benchmark Suite2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00191(1329-1339)Online publication date: Dec-2019
https://doi.org/10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00191
Kaushik AAji AHassaan MChalmers NWolfe NMoe SPuthoor SBeckmann B(2019)Optimizing Hyperplane Sweep Operations Using Asynchronous Multi-grain GPU Tasks2019 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC47752.2019.9042134(59-69)Online publication date: Dec-2019
https://doi.org/10.1109/IISWC47752.2019.9042134
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents