research-article

A GPU implementation of inclusion-based points-to analysis

Authors:

Mario Mendez-Lojo,

Martin Burtscher,

Keshav PingaliAuthors Info & Claims

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Pages 107 - 116

https://doi.org/10.1145/2145816.2145831

Published: 25 February 2012 Publication History

Get Access

Abstract

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components.

In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores.

Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.

References

[1]

NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2010.

Google Scholar

[2]

CUDA C Programming Guide 4.0. NVIDIA, 2011.

Google Scholar

[3]

L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19).

Google Scholar

[4]

David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 523--530, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

Google Scholar

[5]

J. Barnat, P. Bauch, L. Brim, and M. Ceska. Computing Strongly Connected Components in Parallel on CUDA. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), pages 541--552. IEEE Computer Society, 2011.

Digital Library

Google Scholar

[6]

Marc Berndl, Ondrej Lhoták, Feng Qian, Laurie Hendren, and Navindra Umanee. Points-to analysis using BDDs. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 103--114, New York, NY, USA, 2003. ACM.

Digital Library

Google Scholar

[7]

Ulrik Brandes and Thomas Erlebach, editors. Network Analysis: Methodological Foundations. Springer-Verlag, 2005.

Digital Library

Google Scholar

[8]

Randal E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, 35:677--691, 1986.

Digital Library

Google Scholar

[9]

Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Morgan Kaufmann, 2011.

Crossref

Google Scholar

[10]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. A performance study of general-purpose applications on graphics processors using cuda. J. Parallel Distrib. Comput., 68:1370--1380, October 2008.

Digital Library

Google Scholar

[11]

L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993.

Digital Library

Google Scholar

[12]

Manuel Fahndrich, Jeffrey S. Foster, Zhendong Su, and Alexander Aiken. Partial online cycle elimination in inclusion constraint graphs. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 85--96, New York, NY, USA, 1998. ACM.

Digital Library

Google Scholar

[13]

Ben Hardekopf and Calvin Lin. The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code. In Proc. Conf. on Programming Language Design and Implementation (PLDI), 2007.

Digital Library

Google Scholar

[14]

Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07: Proceedings of the 14th international conference on High performance computing, pages 197--208, Berlin, Heidelberg, 2007. Springer-Verlag.

Digital Library

Google Scholar

[15]

Nevin Heintze and Olivier Tardieu. Ultra-fast aliasing analysis using cla: a million lines of c code in a second. SIGPLAN Not., 36(5):254--263, 2001.

Digital Library

Google Scholar

[16]

Fritz Henglein. Type inference and semi-unification. In Proceedings of the 1988 ACM conference on LISP and functional programming, LFP '88, pages 184--197, New York, NY, USA, 1988. ACM.

Digital Library

Google Scholar

[17]

Michael Hind. Pointer analysis: haven't we solved this problem yet? In PASTE '01: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 54--61, New York, NY, USA, 2001. ACM.

Digital Library

Google Scholar

[18]

Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM.

Digital Library

Google Scholar

[19]

Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In 20th International Conference on Parallel Architectures and Compilation Techniques, PACT'11, 2011.

Digital Library

Google Scholar

[20]

Song Huang, Shucai Xiao, and Wu chun Feng. On the energy efficiency of graphics processing units for scientific computing. In IPDPS, pages 1--8, 2009.

Digital Library

Google Scholar

[21]

Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211--222, 2007.

Digital Library

Google Scholar

[22]

Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In Proceedings of the 37th annual international symposium on Computer architecture, ISCA '10, pages 451--460, New York, NY, USA, 2010. ACM.

Digital Library

Google Scholar

[23]

Ondrej Lhoták and Laurie Hendren. Scaling Java points-to analysis using Spark. In G. Hedin, editor, Compiler Construction, 12th International Conference, volume 2622 of LNCS, pages 153--169, Warsaw, Poland, April 2003. Springer.

Digital Library

Google Scholar

[24]

Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC '10, pages 52--55, New York, NY, USA, 2010. ACM.

Digital Library

Google Scholar

[25]

Mario Méndez-Lojo, Augustine Mathew, and Keshav Pingali. Parallel inclusion-based points-to analysis. In Proceedings of the 24th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'10), October 2010.

Digital Library

Google Scholar

[26]

Duane G. Merrill, Michael Garland, and Andrew S. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012.

Digital Library

Google Scholar

[27]

Donald Nguyen and Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. In ASPLOS '11: Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2011.

Digital Library

Google Scholar

[28]

NVIDIA. Thrust library version 1.4.0. http://code.google.com/p/thrust/.

Google Scholar

[29]

Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM.

Digital Library

Google Scholar

[30]

Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. Eigencfa: accelerating flow analysis with gpus. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '11, pages 511--522, New York, NY, USA, 2011. ACM.

Digital Library

Google Scholar

[31]

Thomas W. Reps. Program analysis via graph reachability. Technical Report Technical Report Number 1386, University of Wisconsin, 1998.

Crossref

Google Scholar

[32]

Atanas Rountev and Satish Chandra. Off-line variable substitution for scaling points-to analysis. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 47--56, New York, NY, USA, 2000. ACM.

Digital Library

Google Scholar

[33]

Bjarne Steensgaard. Points-to analysis in almost linear time. In POPL '96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 32--41, New York, NY, USA, 1996. ACM.

Digital Library

Google Scholar

[34]

Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. On the limits of gpu acceleration. In Proceedings of the 2nd USENIX conference on Hot topics in parallelism, HotPar'10, pages 13--13, Berkeley, CA, USA, 2010. USENIX Association.

Digital Library

Google Scholar

[35]

John Whaley and Monica S. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 131--144, New York, NY, USA, 2004. ACM.

Digital Library

Google Scholar

[36]

Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

Google Scholar

Cited By

View all

Tang WDong DLi SWang CYao PZhou JZhang C(2024) Octopus: Scaling Value-Flow Analysis via Parallel Collection of Realizable Path ConditionsACM Transactions on Software Engineering and Methodology10.1145/363274333:3(1-33)Online publication date: 24-Jan-2024
https://dl.acm.org/doi/10.1145/3632743
Shobaki GMuyan-Özçelik PHutton JLinck BMalyshenko VKerbow ARamirez-Ortega RGordon V(2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444869
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Show More Cited By

Index Terms

A GPU implementation of inclusion-based points-to analysis
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

A GPU implementation of inclusion-based points-to analysis
PPOPP '12

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...
Out-of-core implementation for accelerator kernels on heterogeneous clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...

Reviews

Reviewer: Henk Sips

Modern computing devices like graphical processing units (GPUs) require high-performance algorithms to adhere to a few principles like regular computations on aligned data structures and the careful use of synchronization constructs. This is challenging for many current algorithms and especially for graph-based algorithms. Graphs generally exhibit an irregular connection structure that may even change during computation. This paper deals with the implementation on GPUs of a graph algorithm necessary for inclusion-based points-to analysis, an algorithm that is used within compilers to determine which variables each pointer in a program points to. It is a challenging algorithm because it is highly irregular, requires relatively little computation, and makes extensive modification of the underlying graph during the processing of that graph. The solution described in the paper is to modify the representation of the graph, modify the algorithm, and add GPU-based optimizations. Classical representation of points-to graphs may lead to excessive memory space since these graphs are very sparse. Alternative compressed sparse schemes add new edges that cannot be statically predicted, leading to efficiency problems on GPUs. The authors use wide sparse bit vectors, chosen because a good alignment exists with the resources of the GPU (such as bus and memory banks). As such, data is aligned, avoiding shared memory bank conflicts and thread divergence. The price, however, is the need for more memory to accommodate the representation, which seems to be a fair trade-off in this case. The modification of the graph algorithm relates to the parallel rule application in this algorithm. Parallel execution of the graph rewrite rules requires many fine-grained synchronizations in the graph data structure, which may lead to severe performance problems when done on a GPU. To solve these problems, the authors propose modified rewrite rules that require lower synchronization overhead by storing the reversed copy edge instead of the original copy edge in the graph. Finally, the authors describe a number of GPU-specific optimizations for the processing of these types of points-to graphs. In the section on experimental validation, the authors compare their GPU approach to an earlier derived multicore central processing unit (CPU) solution, and to a classical sequential solution. The results are that, on average, the GPU solution is slightly faster (seven times) than the multicore solution (six times), but requires 35 percent more programming effort, according to the authors (assuming that the intellectual work of rethinking the algorithm is not included). The main contribution of this paper is not so much the absolute speed that can be obtained from using a GPU with this specific algorithm, but rather that seemingly GPU-unfriendly algorithms like graph algorithms can be recast into GPU-friendly ones by careful reconsideration and redesign of the shape and representation of the data structures and the associated algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

February 2012

352 pages

ISBN:9781450311601

DOI:10.1145/2145816

General Chair:
J. Ramanujam
Louisiana State University, USA
,
Program Chair:
P. Sadayappan
The Ohio State University, USA

ACM SIGPLAN Notices Volume 47, Issue 8
PPOPP '12
August 2012
334 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2370036
Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '12

Sponsor:

SIGPLAN

PPoPP '12: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 25 - 29, 2012

Louisiana, New Orleans, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

98
Total Citations
View Citations
1,005
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Tang WDong DLi SWang CYao PZhou JZhang C(2024) Octopus: Scaling Value-Flow Analysis via Parallel Collection of Realizable Path ConditionsACM Transactions on Software Engineering and Methodology10.1145/363274333:3(1-33)Online publication date: 24-Jan-2024
https://dl.acm.org/doi/10.1145/3632743
Shobaki GMuyan-Özçelik PHutton JLinck BMalyshenko VKerbow ARamirez-Ortega RGordon V(2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444869
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Zhang CScheffler PBenz TPerotti MBenini L(2023)AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular Workloads2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137243(1-6)Online publication date: Apr-2023
https://doi.org/10.23919/DATE56975.2023.10137243
Xu RMao XChen LYu Y(2023)A Parallel Memory Defect Detection Method based on Sparse-Value-Flow Graph2023 IEEE International Conference on Joint Cloud Computing (JCC)10.1109/JCC59055.2023.00014(57-65)Online publication date: Jul-2023
https://doi.org/10.1109/JCC59055.2023.00014
Xu RMao XXiao W(2023)Accelerating High-Precision Vulnerability Detection in C Programs with Parallel Graph Summarization2023 6th International Conference on Software Engineering and Computer Science (CSECS)10.1109/CSECS60003.2023.10428132(1-6)Online publication date: 22-Dec-2023
https://doi.org/10.1109/CSECS60003.2023.10428132
Xu RMao XChen L(2023)P-DATA: A Task-Level Parallel Framework for Dependency-Aware Value Flow Taint Analysis2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00035(249-258)Online publication date: 4-Dec-2023
https://doi.org/10.1109/APSEC60848.2023.00035
Liu BHuang J(2022)SHARP: fast incremental context-sensitive pointer analysis for JavaProceedings of the ACM on Programming Languages10.1145/35273326:OOPSLA1(1-28)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3527332
Zhang YLiu HQiao L(2021)Context-Sensitive Data Race Detection for Concurrent ProgramsIEEE Access10.1109/ACCESS.2021.30558319(20861-20867)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3055831
Stiévenart QVan Es NVan der Plas JDe Roover C(2021)A parallel worklist algorithm and its exploration heuristics for static modular analysesJournal of Systems and Software10.1016/j.jss.2021.111042181:COnline publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1016/j.jss.2021.111042
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations