Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Published: 22 December 2015 Publication History

Abstract

Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.

Supplementary Material

TACO1204-54 (taco1204-54.pdf)
Slide deck associated with this paper

References

[1]
D. Bader and K. Madduri. 2006. GTgraph: A Suite of Synthetic Graph Generators. Retrieved November 18, 2015, from http://www.cse.psu.edu/∼madduri/software/GTgraph.
[2]
D. Bader and K. Madduri. 2008. Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’08).
[3]
David A. Bader and Kamesh Madduri. 2005. Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors. In High Performance Computing—PiPC 2005. Lecture Notes in Computer Science, Vol. 3769. Springer, 465--476.
[4]
A. Braunstein, M. Mézard, and R. Zecchina. 2005. Survey propagation: An algorithm for satisfiability. Random Structures and Algorithms 27, 2, 201--226.
[5]
Martin Burtscher and Keshav Pingali. 2011. CUDA implementation of the tree-based Barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition. Morgan Kaufmann, 75--92. http://iss.ices.utexas.edu/Publications/Papers/burtscher11.pdf.
[6]
Unnikrishnan C, R. Nasre, and Y. N. Srikant. 2015. Falcon: A Graph Manipulation Language for Heterogeneous Systems. Technical Report. CSA, IISc, Bangalore, India. http://www.csa.iisc.ernet.in/TR/2015/5/.
[7]
L. Paul Chew. 1993. Guaranteed-quality mesh generation for curved surfaces. In Proceedings of the ACM Symposium on Computational Geometry. 274--280.
[8]
S. Chung and A. Condon. 1996. Parallel Implementation of Boruvka’s Minimum Spanning Tree Algorithm Retrieved November 18, 2015, from http://www.cs.ubc.ca/∼condon/papers/chungcondon96.pdf.
[9]
A. Davidson, S. Baxter, M. Garland, and J. D. Owens. 2014. Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). 349--359.
[10]
DIMACS. 2009. 9th DIMACS Implementation Challenge. Retrieved November 18, 2015, from http://www.dis.uniroma1.it/challenge9/download.shtml.
[11]
P. Erdős and A Rényi. 1960. On the Evolution of Random Graphs. Retrieved November 18, 2015, from http://www.renyi.hu/∼p_erdos/1960-10.pdf.
[12]
Min Feng, Rajiv Gupta, and Laxmi N. Bhuyan. 2012. Speculative parallelization on GPGPUs. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 293--294.
[13]
Daniele Frigioni, Mario Ioffreda, Umberto Nanni, and Giulio Pasqualone. 1998. Experimental analysis of dynamic algorithms for the single source shortest paths problem. ACM Journal of Experimental Algorithmics 3, Article No. 5.
[14]
Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, and Matei Ripeanu. 2012. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 345--354.
[15]
Abdullah Gharaibeh, Elizeu Santos-Neto, Lauro Beltrão Costa, and Matei Ripeanu. 2013. The energy case for graph processing on hybrid CPU and GPU systems. In Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms. ACM, New York, NY, Article No. 2.
[16]
Douglas Gregor and Andrew Lumsdaine. 2005. The parallel BGL: A generic library for distributed graph computations. In Proceedings of the Conference on Parallel Object-Oriented Scientific Computing (POOSC’05).
[17]
P. Harish and P. J. Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the 14th International Conference on High Performance Computing (HiPC’07). 197--208.
[18]
P. Harish, V. Vineet, and P. J. Narayanan. 2009. Large Graph Algorithms for Massively Multithreaded Architectures. Technical Report IIIT/TR/2009/74. International Institute of Information Technology, Hyderabad, India.
[19]
Mark Harris. 2007. Optimizing Parallel Reduction in CUDA. Retrieved November 18, 2015, from http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf.
[20]
Jared Hoberock and Nathan Bell. 2011. Thrust: A Productivity-Oriented Library for CUDA. Technical Report. Nvidia Corporation.
[21]
Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. 2012. Green-Marl: A DSL for easy and efficient graph analysis. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). ACM, New York, NY, 349--362.
[22]
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011. Accelerating CUDA graph algorithms at maximum warp. ACM SIGPLAN Notices 46, 8, 267--276.
[23]
Sungpack Hong, Semih Salihoglu, Jennifer Widom, and Kunle Olukotun. 2014. Simplifying scalable graph processing with a domain-specific language. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY, 208.
[24]
Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T. Lewis, Chunling Hu, and Keshav Pingali. 2014. Adaptive heterogeneous scheduling for integrated GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 151--162.
[25]
Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan. 2014. CuSha: Vertex-centric graph processing on GPUs. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC’14). ACM, New York, NY, 239--252.
[26]
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. ACM SIGPLAN Notices 44, 4, 101--110.
[27]
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8, 716--727.
[28]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10). ACM, New York, NY, 135--145.
[29]
Mario Mendez-Lojo, Martin Burtscher, and Keshav Pingali. 2012. A GPU implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 107--116.
[30]
Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: Exception support and speculative execution on GPUs. ACM SIGARCH Computer Architecture News 40, 3, 72--83.
[31]
Ulrich Meyer and Peter Sanders. 1998. Delta-stepping: A parallel single source shortest path algorithm. In Proceedings of the European Symposium on Algorithms (ESA’98). 393--404.
[32]
Rupesh Nasre, Martin Burtscher, and Keshav Pingali. 2013a. Data-driven versus topology-driven irregular computations on GPUs. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS’13). 463--474.
[33]
Rupesh Nasre, Martin Burtscher, and Keshav Pingali. 2013b. Morph algorithms on GPUs. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 147--156.
[34]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2, 40--53.
[35]
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). ACM, New York, NY,12--25.
[36]
Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. 2011. EigenCFA: Accelerating flow analysis with GPUs. ACM SIGPLAN Notices 46, 1, 511--522.
[37]
Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. 2012. Elixir: A system for synthesizing concurrent graph programs. ACM SIGPLAN Notices 47, 10, 375--394.
[38]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, 519--530.
[39]
G. Ramalingam and T. Reps. 1996. On the computational complexity of dynamic graph problems. Theoretical Computer Science 158, 233--277.
[40]
Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Symposium on Operating Systems Principles (SOSP’13). ACM, New York, NY, 472--488.
[41]
Mehrzad Samadi, Amir Hormati, Janghaeng Lee, and Scott Mahlke. 2012. Paragon: Collaborative speculative loop execution on GPU and CPU. In Proceedings of the Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, 64--73.
[42]
Ahmet Erdem Sariyüce, Kamer Kaya, Erik Saule, and Ümit V. Çatalyürek. 2013. Betweenness centrality on GPUs and heterogeneous architectures. In Proceedings of the Workshop on General Purpose Processing Using Graphics Processing Units (GPGPU-6). ACM, New York, NY, 76--85.
[43]
Julian Shun and Guy E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. ACM SIGPLAN Notices 48, 8, 135--146.
[44]
Richard M. Stallman and the GCC Developer Community. 2011. Using the GNU Compiler Collection. Retrieved November 18, 2015, from https://gcc.gnu.org/onlinedocs/gcc.pdf.
[45]
Morten Stockel and Soren Bog. 2008. Concurrent Datastructures. Technical Report IMM-BSC-2008-12. Technical University of Denmark.
[46]
Chen Tian, Min Feng, Vijay Nagarajan, and Rajiv Gupta. 2008. Copy or discard execution model for speculative parallelization on multicores. In Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 330--341.
[47]
Chen Tian, Changhui Lin, Min Feng, and Rajiv Gupta. 2011. Enhanced speculative parallelization via incremental recovery. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, New York, NY, 189--200.
[48]
Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8, 103--111.
[49]
Shucai Xiao and Wu Chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDP’10). 1--12.
[50]
Kaiyuan Zhang, Rong Chen, and Haibo Chen. 2015. NUMA-aware graph-structured analytics. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). ACM, New York, NY, 183--193.
[51]
Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems 25, 6, 1543--1552.

Cited By

View all
  • (2022)A Multi-target, Multi-paradigm DSL Compiler for Algorithmic Graph ProcessingProceedings of the 15th ACM SIGPLAN International Conference on Software Language Engineering10.1145/3567512.3567513(2-15)Online publication date: 29-Nov-2022
  • (2021)iTurboGraphProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457243(977-990)Online publication date: 9-Jun-2021
  • (2021)Efficient execution of graph algorithms on CPU with SIMD extensionsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370326(262-276)Online publication date: 27-Feb-2021
  • Show More Cited By

Index Terms

  1. Falcon: A Graph Manipulation Language for Heterogeneous Systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 4
    January 2016
    848 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2836331
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 December 2015
    Accepted: 01 October 2015
    Revised: 01 October 2015
    Received: 01 June 2015
    Published in TACO Volume 12, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CUDA
    2. GPU
    3. Graph manipulation languages
    4. OpenMP
    5. domain specific languages
    6. local computation algorithms
    7. morph algorithms
    8. multi-core CPU

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • IMPECS project of DST (Government of India)
    • MPI-SWS (Germany)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)131
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)A Multi-target, Multi-paradigm DSL Compiler for Algorithmic Graph ProcessingProceedings of the 15th ACM SIGPLAN International Conference on Software Language Engineering10.1145/3567512.3567513(2-15)Online publication date: 29-Nov-2022
    • (2021)iTurboGraphProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457243(977-990)Online publication date: 9-Jun-2021
    • (2021)Efficient execution of graph algorithms on CPU with SIMD extensionsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370326(262-276)Online publication date: 27-Feb-2021
    • (2020)PangolinProceedings of the VLDB Endowment10.14778/3389133.338913713:8(1190-1205)Online publication date: 3-May-2020
    • (2020)Custom code generation for a graph DSLProceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit10.1145/3366428.3380772(51-60)Online publication date: 23-Feb-2020
    • (2020)Distributed Graph AnalyticsDistributed Computing and Internet Technology10.1007/978-3-030-36987-3_1(3-20)Online publication date: 9-Jan-2020
    • (2019)HyPar: A divide-and-conquer model for hybrid CPU–GPU graph processingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.05.014Online publication date: Jun-2019
    • (2019)Optimization in the parallelism extraction algorithm with spanning tree on a multi‐GPU environmentIEEJ Transactions on Electrical and Electronic Engineering10.1002/tee.2287514:6(862-869)Online publication date: 13-Mar-2019
    • (2018)Probabilistic programming with programmable inferenceACM SIGPLAN Notices10.1145/3296979.319240953:4(603-616)Online publication date: 11-Jun-2018
    • (2018)PMAF: an algebraic framework for static analysis of probabilistic programsACM SIGPLAN Notices10.1145/3296979.319240853:4(513-528)Online publication date: 11-Jun-2018
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media