Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2442516.2442531acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Morph algorithms on GPUs

Published: 23 February 2013 Publication History

Abstract

There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs.
In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-To Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges.
We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs.
This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.

References

[1]
L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19).
[2]
David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 523--530, Washington, DC, USA, 2006. IEEE Computer Society.
[3]
J. Barnat, P. Bauch, L. Brim, and M.Ceska. Computing Strongly Connected Components in Parallel on CUDA. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), pages 541--552. IEEE Computer Society, 2011.
[4]
A. Braunstein, M. Mèzard, and R. Zecchina. Survey propagation: An algorithm for satisfiability. Random Structures and Algorithms, 27(2):201--226, 2005.
[5]
Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Morgan Kaufmann, 2011.
[6]
Andrey N. Chernikov and Nikos P. Chrisochoides. Three-dimensional delaunay refinement for multi-core processors. In Proceedings of the 22nd annual international conference on Supercomputing, ICS '08, pages 214--224, New York, NY, USA, 2008. ACM.
[7]
L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993.
[8]
Panagiotis A. Foteinos, Andrey N. Chernikov, and Nikos P. Chrisochoides. Fully generalized two-dimensional constrained delaunay mesh refinement. SIAM J. Sci. Comput., 32(5):2659--2686, 2010.
[9]
Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, and Matei Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In The 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, 2012.
[10]
Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07: Proceedings of the 14th international conference on High performance computing, pages 197--208, Berlin, Heidelberg, 2007. Springer-Verlag.
[11]
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM.
[12]
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In 20th International Conference on Parallel Architectures and Compilation Techniques, PACT'11, 2011.
[13]
Mark T. Jones and Paul E. Plassmann. A parallel graph coloring heuristic. SIAM J. Sci. Comput., 14(3):654--669, May 1993.
[14]
Andriy Kot, Andrey Chernikov, and Nikos Chrisochoides. Effective out-of-core parallel delaunay mesh refinement using off-the-shelf software. In Proceedings of the 20th international conference on Parallel and distributed processing, IPDPS'06, pages 125--125, Washington, DC, USA, 2006. IEEE Computer Society.
[15]
Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu, Keshav Pingali, and Calin Casçaval. How much parallelism is there in irregular applications? In Proc. Symp. on Principles and practice of parallel programming (PPoPP), pages 3--14, New York, NY, USA, 2009.
[16]
Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211--222, 2007.
[17]
Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC '10, pages 52--55, New York, NY, USA, 2010. ACM.
[18]
Mario Mendez-Lojo, Martin Burtscher, and Keshav Pingali. A gpu implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 107--116, New York, NY, USA, 2012. ACM.
[19]
Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, Xin Sui, M. Amber Hassaan, Milind Kulkarni, Martin Burtscher, and Keshav Pingali. Structure-driven optimizations for amorphous data-parallel programs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming (PPoPP'10), pages 3--14, January 2010.
[20]
Duane G. Merrill, Michael Garland, and Andrew S. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012.
[21]
Stephan Mertens, Marc Mézard, and Riccardo Zecchina. Threshold values of random k-sat from the cavity method. Random Struct. Algorithms, 28(3):340--373, May 2006.
[22]
Cristobal A. Navarro, Nancy Hitschfeld-Kahler, and Eliana Scheihing. A parallel gpu-based algorithm for delaunay edge-flips. In The 27th European Workshop on Computational Geometry, EuroCG '11, 2011.
[23]
Sadegh Nobari, Thanh-Tung Cao, Panagiotis Karras, and Stéphane Bressan. Scalable parallel minimum spanning forest computation. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 205--214, New York, NY, USA, 2012. ACM.
[24]
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM.
[25]
Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. Eigencfa: accelerating flow analysis with gpus. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '11, pages 511--522, New York, NY, USA, 2011. ACM.
[26]
Sandeep Putta and Rupesh Nasre. Parallel replication-based points-to analysis. In Proceedings of the 21st international conference on Compiler Construction, CC'12, pages 61--80, Berlin, Heidelberg, 2012. Springer-Verlag.
[27]
Meng Qi, Thanh-Tung Cao, and Tiow-Seng Tan. Computing 2d constrained delaunay triangulation using the gpu. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D '12, pages 39--46, New York, NY, USA, 2012. ACM.
[28]
Jonathan Richard Shewchuk. Triangle: Engineering a 2d quality mesh generator and Delaunay triangulator. In Applied Computational Geometry: Towards Geometric Engineering, volume 1148 of Lecture Notes in Computer Science, pages 203--222. Springer-Verlag, 1996.
[29]
Jeff A. Stuart and John D. Owens. Efficient synchronization primitives for gpus. CoRR, abs/1110.4623, 2011.
[30]
Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the gpu. In Proceedings of the Conference on High Performance Graphics 2009, HPG '09, pages 167--171, New York, NY, USA, 2009. ACM.
[31]
Shucai Xiao and Wu chun Feng. Inter-block gpu communication via fast barrier synchronization. In IPDPS, pages 1--12. IEEE, 2010.
[32]
Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society.
[33]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM.

Cited By

View all
  • (2024)PASCI : A Scalable Framework for Heterogeneous Parallel Calculation of Dynamical Electron CorrelationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673098(1103-1113)Online publication date: 12-Aug-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2023)Choosing the Best Parallelization and Implementation Styles for Graph Analytics Codes: Lessons Learned from 1106 ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607038(1-14)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2013
332 pages
ISBN:9781450319225
DOI:10.1145/2442516
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 48, Issue 8
    PPoPP '13
    August 2013
    309 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2517327
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 February 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. boruvka
  2. cuda
  3. delaunay mesh refinement
  4. gpu
  5. graph algorithms
  6. irregular programs
  7. minimum spanning tree
  8. morph algorithms
  9. points-to analysis
  10. survey propagation

Qualifiers

  • Research-article

Conference

PPoPP '13
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)8
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PASCI : A Scalable Framework for Heterogeneous Parallel Calculation of Dynamical Electron CorrelationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673098(1103-1113)Online publication date: 12-Aug-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2023)Choosing the Best Parallelization and Implementation Styles for Graph Analytics Codes: Lessons Learned from 1106 ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607038(1-14)Online publication date: 12-Nov-2023
  • (2022)Exploiting Reuse for GPU Subgraph EnumerationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.303556434:9(4231-4244)Online publication date: 1-Sep-2022
  • (2020)Graffix: Efficient Graph Processing with a Tinge of GPU-Specific ApproximationsProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404406(1-11)Online publication date: 17-Aug-2020
  • (2020)GrouteACM Transactions on Parallel Computing10.1145/33997307:3(1-27)Online publication date: 21-Jun-2020
  • (2020)Thread-Level Locking for SIMT ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295570531:5(1121-1136)Online publication date: 1-May-2020
  • (2019)New BSP/CGM algorithms for spanning treesInternational Journal of High Performance Computing Applications10.1177/109434201880367233:3(444-461)Online publication date: 1-May-2019
  • (2019)Architectural Implications in Graph Processing of Accelerator with Gardenia Benchmark Suite2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00191(1329-1339)Online publication date: Dec-2019
  • (2019)Optimizing Hyperplane Sweep Operations Using Asynchronous Multi-grain GPU Tasks2019 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC47752.2019.9042134(59-69)Online publication date: Dec-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media