Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2145816.2145831acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

A GPU implementation of inclusion-based points-to analysis

Published: 25 February 2012 Publication History

Abstract

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components.
In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores.
Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.

References

[1]
NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2010.
[2]
CUDA C Programming Guide 4.0. NVIDIA, 2011.
[3]
L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19).
[4]
David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 523--530, Washington, DC, USA, 2006. IEEE Computer Society.
[5]
J. Barnat, P. Bauch, L. Brim, and M. Ceska. Computing Strongly Connected Components in Parallel on CUDA. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), pages 541--552. IEEE Computer Society, 2011.
[6]
Marc Berndl, Ondrej Lhoták, Feng Qian, Laurie Hendren, and Navindra Umanee. Points-to analysis using BDDs. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 103--114, New York, NY, USA, 2003. ACM.
[7]
Ulrik Brandes and Thomas Erlebach, editors. Network Analysis: Methodological Foundations. Springer-Verlag, 2005.
[8]
Randal E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, 35:677--691, 1986.
[9]
Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Morgan Kaufmann, 2011.
[10]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. A performance study of general-purpose applications on graphics processors using cuda. J. Parallel Distrib. Comput., 68:1370--1380, October 2008.
[11]
L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993.
[12]
Manuel Fahndrich, Jeffrey S. Foster, Zhendong Su, and Alexander Aiken. Partial online cycle elimination in inclusion constraint graphs. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 85--96, New York, NY, USA, 1998. ACM.
[13]
Ben Hardekopf and Calvin Lin. The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code. In Proc. Conf. on Programming Language Design and Implementation (PLDI), 2007.
[14]
Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07: Proceedings of the 14th international conference on High performance computing, pages 197--208, Berlin, Heidelberg, 2007. Springer-Verlag.
[15]
Nevin Heintze and Olivier Tardieu. Ultra-fast aliasing analysis using cla: a million lines of c code in a second. SIGPLAN Not., 36(5):254--263, 2001.
[16]
Fritz Henglein. Type inference and semi-unification. In Proceedings of the 1988 ACM conference on LISP and functional programming, LFP '88, pages 184--197, New York, NY, USA, 1988. ACM.
[17]
Michael Hind. Pointer analysis: haven't we solved this problem yet? In PASTE '01: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 54--61, New York, NY, USA, 2001. ACM.
[18]
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM.
[19]
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In 20th International Conference on Parallel Architectures and Compilation Techniques, PACT'11, 2011.
[20]
Song Huang, Shucai Xiao, and Wu chun Feng. On the energy efficiency of graphics processing units for scientific computing. In IPDPS, pages 1--8, 2009.
[21]
Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211--222, 2007.
[22]
Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In Proceedings of the 37th annual international symposium on Computer architecture, ISCA '10, pages 451--460, New York, NY, USA, 2010. ACM.
[23]
Ondrej Lhoták and Laurie Hendren. Scaling Java points-to analysis using Spark. In G. Hedin, editor, Compiler Construction, 12th International Conference, volume 2622 of LNCS, pages 153--169, Warsaw, Poland, April 2003. Springer.
[24]
Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC '10, pages 52--55, New York, NY, USA, 2010. ACM.
[25]
Mario Méndez-Lojo, Augustine Mathew, and Keshav Pingali. Parallel inclusion-based points-to analysis. In Proceedings of the 24th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'10), October 2010.
[26]
Duane G. Merrill, Michael Garland, and Andrew S. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012.
[27]
Donald Nguyen and Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. In ASPLOS '11: Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2011.
[28]
NVIDIA. Thrust library version 1.4.0. http://code.google.com/p/thrust/.
[29]
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM.
[30]
Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. Eigencfa: accelerating flow analysis with gpus. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '11, pages 511--522, New York, NY, USA, 2011. ACM.
[31]
Thomas W. Reps. Program analysis via graph reachability. Technical Report Technical Report Number 1386, University of Wisconsin, 1998.
[32]
Atanas Rountev and Satish Chandra. Off-line variable substitution for scaling points-to analysis. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 47--56, New York, NY, USA, 2000. ACM.
[33]
Bjarne Steensgaard. Points-to analysis in almost linear time. In POPL '96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 32--41, New York, NY, USA, 1996. ACM.
[34]
Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. On the limits of gpu acceleration. In Proceedings of the 2nd USENIX conference on Hot topics in parallelism, HotPar'10, pages 13--13, Berkeley, CA, USA, 2010. USENIX Association.
[35]
John Whaley and Monica S. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 131--144, New York, NY, USA, 2004. ACM.
[36]
Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society.

Cited By

View all
  • (2024) Octopus: Scaling Value-Flow Analysis via Parallel Collection of Realizable Path ConditionsACM Transactions on Software Engineering and Methodology10.1145/363274333:3(1-33)Online publication date: 24-Jan-2024
  • (2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • Show More Cited By

Recommendations

Reviews

Henk Sips

Modern computing devices like graphical processing units (GPUs) require high-performance algorithms to adhere to a few principles like regular computations on aligned data structures and the careful use of synchronization constructs. This is challenging for many current algorithms and especially for graph-based algorithms. Graphs generally exhibit an irregular connection structure that may even change during computation. This paper deals with the implementation on GPUs of a graph algorithm necessary for inclusion-based points-to analysis, an algorithm that is used within compilers to determine which variables each pointer in a program points to. It is a challenging algorithm because it is highly irregular, requires relatively little computation, and makes extensive modification of the underlying graph during the processing of that graph. The solution described in the paper is to modify the representation of the graph, modify the algorithm, and add GPU-based optimizations. Classical representation of points-to graphs may lead to excessive memory space since these graphs are very sparse. Alternative compressed sparse schemes add new edges that cannot be statically predicted, leading to efficiency problems on GPUs. The authors use wide sparse bit vectors, chosen because a good alignment exists with the resources of the GPU (such as bus and memory banks). As such, data is aligned, avoiding shared memory bank conflicts and thread divergence. The price, however, is the need for more memory to accommodate the representation, which seems to be a fair trade-off in this case. The modification of the graph algorithm relates to the parallel rule application in this algorithm. Parallel execution of the graph rewrite rules requires many fine-grained synchronizations in the graph data structure, which may lead to severe performance problems when done on a GPU. To solve these problems, the authors propose modified rewrite rules that require lower synchronization overhead by storing the reversed copy edge instead of the original copy edge in the graph. Finally, the authors describe a number of GPU-specific optimizations for the processing of these types of points-to graphs. In the section on experimental validation, the authors compare their GPU approach to an earlier derived multicore central processing unit (CPU) solution, and to a classical sequential solution. The results are that, on average, the GPU solution is slightly faster (seven times) than the multicore solution (six times), but requires 35 percent more programming effort, according to the authors (assuming that the intellectual work of rethinking the algorithm is not included). The main contribution of this paper is not so much the absolute speed that can be obtained from using a GPU with this specific algorithm, but rather that seemingly GPU-unfriendly algorithms like graph algorithms can be recast into GPU-friendly ones by careful reconsideration and redesign of the shape and representation of the data structures and the associated algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 8
    PPOPP '12
    August 2012
    334 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2370036
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. GPU
  3. graph algorithms
  4. inclusion-based points-to analysis
  5. irregular programs

Qualifiers

  • Research-article

Conference

PPoPP '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)1
Reflects downloads up to 11 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024) Octopus: Scaling Value-Flow Analysis via Parallel Collection of Realizable Path ConditionsACM Transactions on Software Engineering and Methodology10.1145/363274333:3(1-33)Online publication date: 24-Jan-2024
  • (2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2023)AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular Workloads2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137243(1-6)Online publication date: Apr-2023
  • (2023)A Parallel Memory Defect Detection Method based on Sparse-Value-Flow Graph2023 IEEE International Conference on Joint Cloud Computing (JCC)10.1109/JCC59055.2023.00014(57-65)Online publication date: Jul-2023
  • (2023)Accelerating High-Precision Vulnerability Detection in C Programs with Parallel Graph Summarization2023 6th International Conference on Software Engineering and Computer Science (CSECS)10.1109/CSECS60003.2023.10428132(1-6)Online publication date: 22-Dec-2023
  • (2023)P-DATA: A Task-Level Parallel Framework for Dependency-Aware Value Flow Taint Analysis2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00035(249-258)Online publication date: 4-Dec-2023
  • (2022)SHARP: fast incremental context-sensitive pointer analysis for JavaProceedings of the ACM on Programming Languages10.1145/35273326:OOPSLA1(1-28)Online publication date: 29-Apr-2022
  • (2021)Context-Sensitive Data Race Detection for Concurrent ProgramsIEEE Access10.1109/ACCESS.2021.30558319(20861-20867)Online publication date: 2021
  • (2021)A parallel worklist algorithm and its exploration heuristics for static modular analysesJournal of Systems and Software10.1016/j.jss.2021.111042181:COnline publication date: 1-Nov-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media