research-article

Open access

DisGCo: A Compiler for Distributed Graph Analytics

Authors:

Anchu Rajendran,

V. Krishna NandivadaAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 4

Article No.: 28, Pages 1 - 26

https://doi.org/10.1145/3414469

Published: 30 September 2020 Publication History

All formats PDF

Abstract

Graph algorithms are widely used in various applications. Their programmability and performance have garnered a lot of interest among the researchers. Being able to run these graph analytics programs on distributed systems is an important requirement. Green-Marl is a popular Domain Specific Language (DSL) for coding graph algorithms and is known for its simplicity. However, the existing Green-Marl compiler for distributed systems (Green-Marl to Pregel) can only compile limited types of Green-Marl programs (in Pregel canonical form). This severely restricts the types of parallel Green-Marl programs that can be executed on distributed systems. We present DisGCo, the first compiler to translate any general Green-Marl program to equivalent MPI program that can run on distributed systems.

Translating Green-Marl programs to MPI (SPMD/MPMD style of computation, distributed memory) presents many other exciting challenges, besides the issues related to differences in syntax, as Green-Marl gives the programmer a unified view of the whole memory and allows the parallel and serial code to be inter-mixed. We first present the set of challenges involved in translating Green-Marl programs to MPI and then present a systematic approach to do the translation. We also present a few optimization techniques to improve the performance of our generated programs. DisGCo is the first graph DSL compiler that can handle all syntactic capabilities of a practical graph DSL like Green-Marl and generate code that can run on distributed systems. Our preliminary evaluation of DisGCo shows that our generated programs are scalable. Further, compared to the state-of-the-art DH-Falcon compiler that translates a subset of Falcon programs to MPI, our generated codes exhibit a geomean speedup of 17.32×.

References

[1]

2015. Green-Marl Language Spec. Retrieved from https://docs.oracle.com/cd/E56133_01/1.2.0/Green_Marl_Language_Specification.pdf.

[2]

2015. MPI3.1 documentation. Retrieved from https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.

[3]

2016. Mezzanine Apapters. Retrieved from http://www.mellanox.com/related-docs/user_manuals.

[4]

2019. MPICH Home Page. Retrieved from http://www.mcs.anl.gov/mpi/mpich2.

[5]

A. Abdolrashidi and L. Ramaswamy. 2016. Continual and cost-effective partitioning of dynamic graphs for optimizing big graph processing systems. In Proceedings of the IEEE International Congress on Big Data (BigData Congress). 18--25.

[6]

A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. 2013. Distributed large-scale natural graph factorization. In Proceedings of the World Wide Web Conference. 37--48.

[7]

S. P. Amarasinghe and M. S. Lam. 1993. Communication optimization and code generation for distributed memory machines. In Proceedings of the Conference on Programming Language Design and Implementation. 126--138.

[8]

K. Andreev and H. Räcke. 2004. Balanced graph partitioning. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 120--124.

[9]

A. Bader and K. Madduri. 2008. SNAP, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In Proceedings of the International Parallel and Distributed Processing Symposium. 1--12.

[10]

G. Bikshandi, J. G. Castanos, S. B. Kodali, V. K. Nandivada, I. Peshansky, V. A. Saraswat, S. Sur, P. Varma, and T. Wen. 2009. Efficient, portable implementation of asynchronous multi-place programs. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. 271--282.

[11]

R. C. Calinescu. 2000. The Bulk-Synchronous Parallel Model. Springer London, 5--12.

[12]

A. Chan and F. Dehne. 2003. CGMgraph/CGMlib: Implementing and testing CGM graph algorithms on PC clusters. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 117--125.

[13]

U. Cheramangalath, R. Nasre, and Y. N. Srikant. 2017. DH-Falcon: A language for large-scale graph processing on distributed heterogeneous systems. In Proceedings of the IEEE International Conference on Cluster Computing. 439--450.

[14]

S. Cherem, T. Chilimbi, and S. Gulwani. 2008. Inferring locks for atomic sections. In Proceedings of the Conference on Programming Language Design and Implementation. 304--315.

[15]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. Introduction to Algorithms (3rd ed.). The MIT Press, Cambridge, MA.

[16]

R. Cytron, J. Lipkis, and E. Schonberg. 1990. A compiler-assisted approach to SPMD execution. In Proceedings of the ACM/IEEE Supercomputing Conference. 398--406.

[17]

R. Dathathri, G. Gill, L. Hoang, H. Dang, A. Brooks, N. Dryden, M. Snir, and K. Pingali. 2018. Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18). ACM, New York, NY, 752--768.

[18]

J. Dinan, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur. 2016. An implementation and evaluation of the MPI 3.0 one-sided communication interface. Concurr. Comput. : Pract. Exper. 28 (Dec. 2016), 4385--4404.

Digital Library

[19]

G. Gill, R. Dathathri, L. Hoang, A. Lenharth, and K. Pingali. 2018. Abelian: A compiler for graph analytics on distributed, heterogeneous platforms. In Proceedings of the European Conference on Parallel Processing. 249--264.

[20]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, 17--30. Retrieved from http://dl.acm.org/citation.cfm?id=2387880.2387883.

Digital Library

[21]

J. Gray, R. A. Lorie, G. R. Putzolu, and I. L. Traiger. 1976. Granularity of locks and degrees of consistency in a shared data base. In Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems.

[22]

J. N. Gray, R. A. Lorie, and G. R. Putzolu. 1975. Granularity of locks in a shared data base. In Proceedings of the International Conference on Very Large Data Bases. 428--451.

[23]

D. Gregor and A. Lumsdaine. 2005. Lifting sequential graph algorithms for distributed-memory parallel computation. In Proceedings of the ACM SIGPLAN International Conference on Object-oriented Programming, Systems, Languages, and Applications. 423--437.

[24]

W. D. Gropp and R. Thakur. 2007. Revealing the performance of MPI RMA implementations. In Proceedings of the PVM/MPI Users’ Group Conference. 272--280.

[25]

F. Hielscher and P. Gottschling. 2004. ParGraph. Retrieved from http://pargraph.sourceforge.net/.

[26]

T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, and K. Underwood. 2015. Remote memory access programming in MPI-3. ACM Trans. Parallel Comput. 2 (June 2015).

Digital Library

[27]

S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. 2012. Green-Marl: A DSL for easy and efficient graph analysis. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 349--362.

[28]

S. Hong, S. Salihoglu, J. Widom, and K. Olukotun. 2014. Simplifying scalable graph processing with a domain-specific language. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY.

[29]

G. Karypis and V. Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20 (Dec. 1998), 359--392.

Digital Library

[30]

Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. 2013. Mizan: A system for dynamic load balancing in large-scale graph processing. In Proceedings of the European Conference on Computer Systems. 169--182.

[31]

J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. 2012. SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the International Conference on Supercomputing. 341--352.

[32]

M. Li, X. Lu, K. Hamidouche, J. Zhang, and D. K. Panda. 2016. Mizan-RMA: Accelerating Mizan graph processing framework with MPI RMA. In Proceedings of the 23rd IEEE International Conference on High Performance Computing, Data, and Analytics. IEEE, 42--51.

[33]

M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko, and D. K. Panda. 2014. Scalable Graph500 design with MPI-3 RMA. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’14). 230--238.

[34]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Proc. VLDB Endow. 5 (Apr. 2012), 716--727.

Digital Library

[35]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. 2010. GraphLab: New framework for parallel machine learning. CoRR abs/1006.4990 (2010).

[36]

T. Maier, P. Sanders, and R. Dementiev. 2016. Concurrent hash tables: Fast and general?(!) In Proceedings of the Symposium on Principles and Practice of Parallel Programming. 3:41–3:42.

[37]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the SIGMOD Conference. 135--146.

[38]

J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. 2015. Latency-tolerant software distributed shared memory. In Proceedings of the USENIX Annual Technical Conference. 291--305.

[39]

D. Nguyen, A. Lenharth, and K. Pingali. 2013. A lightweight infrastructure for graph analytics. In Proceedings of the ACM Symposium on Operating Systems Principles. 456--471.

[40]

D. Nguyen, A. Lenharth, and K. Pingali. 2013. A lightweight infrastructure for graph analytics. In Proceedings of the ACM Symposium on Operating Systems Principles. 456--471.

[41]

J. Nishimura and J. Ugander. 2013. Restreaming graph partitioning: Simple versatile algorithms for advanced balancing. In Proceedings of the Knowledge Discovery and Data Mining Conference. 1106--1114.

[42]

S. Pai and K. Pingali. 2016. A compiler for throughput optimization of graph algorithms on GPUs. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 1--19.

[43]

S. J. Plimpton and K. D. Devine. 2011. MapReduce in MPI for large-scale graph algorithms. Parallel Comput. 37 (Sep. 2011), 610--632.

Digital Library

[44]

L. Rauchwerger, F. Arzu, and K. Ouchi. 1998. Standard templates adaptive parallel library (STAPL). In Proceedings of the International Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers. 402--409.

[45]

S. Salihoglu and J. Widom. 2013. GPS: A graph processing system. In Proceedings of the Scientific and Statistical Database Management Conference. 22:1–22:12.

[46]

J. Seo, J. Park, J. Shin, and M. S. Lam. 2013. Distributed socialite: A datalog-based language for large-scale graph analysis. Proc. VLDB Endow. 6 (Sep. 2013), 1906--1917.

Digital Library

[47]

B. Shao, H. Wang, and Y. Li. 2013. Trinity: A distributed graph engine on a memory cloud. In Proceedings of the SIGMOD Conference. 505--516.

[48]

G. Shashidhar and R. Nasre. 2017. LightHouse: An automatic code generator for graph algorithms on GPUs. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing. 235--249.

[49]

J. Shun and G. E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. 135--146.

[50]

G. M. Slota, S. Rajamanickam, K. Devine, and K. Madduri. 2017. Partitioning trillion-edge graphs in minutes. In Proceedings of the International Parallel and Distributed Processing Symposium. 646--655.

[51]

V. Tipparaju, W. Gropp, H. Ritzdorf, R. Thakur, and J. L. Träff. 2009. Investigating high performance RMA interfaces for the MPI-3 standard. In Proceedings of the International Conference on Parallel Processing. 293--300.

[52]

C. Tseng. 1995. Compiler optimizations for eliminating barrier synchronization. SIGPLAN Not. 30 (Aug 1995), 144--155.

[53]

C. Tsourakakis, C. Gkantsidis, B. Radunovic, and M. Vojnovic. 2014. FENNEL: Streaming graph partitioning for massive scale graphs. In Proceedings of the Web Search and Data Mining Conference. 333--342.

[54]

R. Wang and K. Chiu. 2013. A stream partitioning approach to processing large scale distributed graph datasets. In Proceedings of the IEEE International Conference on Big Data. 537--542.

[55]

T. Yu and M. Pradel. 2016. SyncProf: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the International Symposium on Software Testing and Analysis. 389--400.

[56]

Y. Zhang, V. C. Sreedhar, W. Zhu, V. Sarkar, and G. R. Gao. 2007. Optimized lock assignment and allocation: A method for exploiting concurrency among critical sections. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. 146--147.

[57]

Y. Zhang, M. Yang, R. Baghdadi, S. Kamil, J. Shun, and S. Amarasinghe. 2018. GraphIt: A high-performance graph DSL. Proc. ACM Program. Lang. 2 (Oct. 2018).

[58]

X. Zhu, W. Chen, W. Zheng, and X. Ma. 2016. Gemini: A computation-centric distributed graph processing system. In Proceedings of the Symposium on Operating Systems Design and Implementation. 301--316.

Cited By

Behera NKumar ARajadurai T ENitish SM RNasre R(2024)StarPlat: A Versatile DSL for Graph AnalyticsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104967(104967)Online publication date: Aug-2024
https://doi.org/10.1016/j.jpdc.2024.104967
Chang RHoang T(2023)Constructing an AI Compiler for ARM Cortex-M DevicesComputer Systems Science and Engineering10.32604/csse.2023.03467246:1(999-1019)Online publication date: 2023
https://doi.org/10.32604/csse.2023.034672
Mishra PNandivada V(2023)COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel LoopACM Transactions on Architecture and Code Optimization10.1145/363333121:1(1-26)Online publication date: 18-Nov-2023
https://dl.acm.org/doi/10.1145/3633331
Show More Cited By

Index Terms

DisGCo: A Compiler for Distributed Graph Analytics
1. Software and its engineering
  1. Software notations and tools

Recommendations

Pardis: a process calculus for parallel and distributed programming in Haskell

Parallel programming and distributed programming involve substantial amounts of boilerplate code for process management and data synchronisation. This leads to increased bug potential and often results in unintended non-deterministic program behaviour. ...
Distributed data-parallel computing using a high-level programming language
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

The Dryad and DryadLINQ systems offer a new programming model for large scale data-parallel computing. They generalize previous execution environments such as SQL and MapReduce in three ways: by providing a general-purpose distributed execution engine ...
Compiling Fortran D for MIMD distributed-memory machines

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 4

December 2020

430 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3427420

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Accepted: 01 July 2020

Revised: 01 June 2020

Received: 01 December 2019

Published in TACO Volume 17, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

SERB CRG
NSM research

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
782
Total Downloads

Downloads (Last 12 months)151
Downloads (Last 6 weeks)39

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Behera NKumar ARajadurai T ENitish SM RNasre R(2024)StarPlat: A Versatile DSL for Graph AnalyticsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104967(104967)Online publication date: Aug-2024
https://doi.org/10.1016/j.jpdc.2024.104967
Chang RHoang T(2023)Constructing an AI Compiler for ARM Cortex-M DevicesComputer Systems Science and Engineering10.32604/csse.2023.03467246:1(999-1019)Online publication date: 2023
https://doi.org/10.32604/csse.2023.034672
Mishra PNandivada V(2023)COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel LoopACM Transactions on Architecture and Code Optimization10.1145/363333121:1(1-26)Online publication date: 18-Nov-2023
https://dl.acm.org/doi/10.1145/3633331
Cook SGarcia P(2022)Arbitrarily Parallelizable Code: A Model of Computation Evaluated on a Message-Passing Many-Core SystemComputers10.3390/computers1111016411:11(164)Online publication date: 18-Nov-2022
https://doi.org/10.3390/computers11110164

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents