Article

Optimizing irregular shared-memory applications for distributed-memory systems

Authors:

Ayon Basumallik,

Rudolf EigenmannAuthors Info & Claims

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 119 - 128

https://doi.org/10.1145/1122971.1122990

Published: 29 March 2006 Publication History

Abstract

In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming to distributed-memory platforms by automatic translation of OpenMP programs to MPI. In the case of irregular applications, the performance of this translation scheme is limited by the fact that accesses to shared-data cannot be accurately resolved at compile-time. Additionally, irregular applications with high communication to computation ratios pose challenges even for direct implementation on message passing systems. In this paper, we present combined compile-time/run-time techniques for optimizing irregular shared-memory applications on message passing systems in the context of automatic translation from OpenMP to MPI. Our transformations enable computation-communication overlap by restructuring irregular parallel loops. The compiler creates inspectors to analyze actual data access patterns for irregular accesses at runtime. This analysis is combined with the compile-time analysis of regular data accesses to determine which iterations of irregular loops access non-local data. The iterations are then reordered to enable computation-communication overlap. In the case where the irregular access occurs inside nested loops, the loop nest is restructured. We evaluate our techniques by translating OpenMP versions of three benchmarks from two important classes of irregular applications - sparse matrix computations and molecular dynamics. We find that for these applications, on sixteen nodes, versions employing computation-communication overlap are almost twice as fast as baseline OpenMP-to-MPI versions, almost 30% faster than inspector-only versions, almost 25% faster than hand-coded versions on two applications and about 9% slower on the third.

References

[1]

T. S. Abdelrahman and G. Liu. Overlap of computation and communication on shared-memory networks-of-workstations. Cluster computing, pages 35--45, 2001.]]

Digital Library

[2]

N. Adiga, G. Almasi, G. Almasi, Y. Aridor, R. Barik, D. Beece, R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, and J. An overview of the BlueGene/L supercomputer. In SC2002 -- High Performance Networking and Computing, Baltimore, MD, November 2002.]]

Digital Library

[3]

G. Almási, R. Bellofatto, J. Brunheroto, C. Caşcaval, J. G. C. nos, L. Ceze, P. Crumley, C. Erway, J. Gagliano, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, and K. Strauss. An overview of the BlueGene/L system software organization. In Proceedings of Euro-Par 2003 Conference, Lecture Notes in Computer Science, Klagenfurt, Austria, August 2003. Springer-Verlag.]]

[4]

V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In Proc. of the Workshop on OpenMP Applications and Tools (WOMPAT2001), Lecture Notes in Computer Science, 2104, pages 1--10, July 2001.]]

Digital Library

[5]

A. Basumallik and R. Eigenmann. Towards automatic translation of openmp to mpi. In ICS '05: Proceedings of the 19th annual International Conference on Supercomputing, pages 189--198, Cambridge, Massachusetts, USA, 2005. ACM Press.]]

Digital Library

[6]

B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. Charmm: A program for macromolecular energy, minimization, and dynamics calculations. J. Comp. Chem., 4:187--217, 1983.]]

[7]

G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379--386, 1994.]]

[8]

S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In ASPLOS-VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, pages 252--262, New York, NY, USA, 1994. ACM Press.]]

Digital Library

[9]

F. Darema, D. A. George, V. A. Norton, and G. F. Pfister. A single-program-multiple-data computational model for epex/fortran. Parallel Computing, 7(1):11--24, 1988.]]

[10]

R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallel unstructured Euler solver using software primitives, AIAA-92-0562. In Proceedings of the 30th Aerospace Sciences Meeting, 1992.]]

[11]

R. Das, M. Uysal, J. Saltz, and Y.-S. S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. Journal of Parallel and Distributed Computing, 22(3):462--478, 1994.]]

Digital Library

[12]

R. Das, J. Wu, J. Saltz, H. Berryman, and S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Trans. Comput., 44(6):737--753, 1995.]]

Digital Library

[13]

C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In PLDI '99: Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 229--241, New York, NY, USA, 1999. ACM Press.]]

Digital Library

[14]

M. P. I. Forum. MPI: A Message-Passing Interface Standard. Technical Report UT-CS-94-230, 1994.]]

Digital Library

[15]

O. Forum. OpenMP: A Proposed Industry Standard API for Shared Memory Programming. Technical report, Oct. 1997.]]

[16]

W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789--828, Sept. 1996.]]

Digital Library

[17]

M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, 1995.]]

Digital Library

[18]

H. Han and C.-W. Tseng. A comparison of locality transformations for irregular codes. In LCR '00: Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 70--84, London, UK, 2000. Springer-Verlag.]]

Digital Library

[19]

P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3):350--360, 1991.]]

Digital Library

[20]

High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC-TR92225, Houston, Tex., 1993.]]

[21]

P. N. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Liblit, G. Pike, and K. Yelick. Titanium language reference manual. Technical report, Berkeley, CA, USA, 2001.]]

Digital Library

[22]

D. Hisley, G. Agrawal, P. Satya-narayana, and L. Pollock. Porting and performance evaluation of irregular codes using OpenMP. Concurrency: Practice and Experience, 12(12):1241--1259, 2000.]]

[23]

E.-J. Im and K. A. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01: Proceedings of the International Conference on Computational Sciences-Part I, pages 127--136, London, UK, 2001. Springer-Verlag.]]

Digital Library

[24]

K. Ishizaki, H. Komatsu, and T. Nakatani. "a loop transformation algorithm for communication overlapping". International Journal of Parallel Programming, 28(2):135--154, 2000.]]

Digital Library

[25]

H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011.]]

[26]

K. Kennedy and K. S. McKinley. Loop distribution with arbitrary control flow. In Supercomputing '90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 407--416, Washington, DC, USA, 1990. IEEE Computer Society.]]

Digital Library

[27]

A. Lain and P. Banerjee. Techniques to overlap computation and communication in irregular iterative applications. In ICS '94: Proceedings of the 8th international conference on Supercomputing, pages 236--245, New York, NY, USA, 1994. ACM Press.]]

Digital Library

[28]

S.-I. Lee, T. A. Johnson, and R. Eigenmann. Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation. In Proc. of the Workshop on Languages and Compilers for Parallel Computing(LCPC'03), pages 539--553. (Springer-Verlag Lecture Notes in Computer Science), Oct. 2003.]]

[29]

H. Lu, A. L. Cox, S. Dwarkadas, R. Rajamony, and W. Zwaenepoel. Compiler and software distributed shared memory support for irregular applications. In Proc. of the Sixth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP'97), pages 48--56, 1997.]]

Digital Library

[30]

S.-J. Min, A. Basumallik, and R. Eigenmann. Optimizing OpenMP programs on Software Distributed Shared Memory Systems. International Journal of Parallel Programming, 31(3):225--249, June 2003.]]

Digital Library

[31]

N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In 1999 International Conference on Parallel Architectures and Compilation Techniques, pages 192--202, 1999.]]

Digital Library

[32]

R. Sass and M. Mutka. Enabling unimodular transformations. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 753--762, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.]]

Digital Library

[33]

H. Simon. Partitioning of unstructured problems for parallel processing. Computing Systems in Engineering, 2(2-3):135--148, 1991.]]

[34]

J. Su and K. Yelick. Automatic support for irregular computations in a high-level language. In Proceedings of the 19th International Parallel and Distributed Processing Symposium, Denver, Colorado, 2005.]]

Digital Library

[35]

P. Tu and D. Padua. Array privatization for shared and distributed memory machines (extended abstract). SIGPLAN Not., 28(1):64--67, 1993.]]

Digital Library

[36]

A. Yoshida, K. Koshizuka, and H. Kasahara. Data-localization for fortran macro-dataflow computation using partial static task assignment. In ICS '96: Proceedings of the 10th international conference on Supercomputing, pages 61--68, Philadelphia, Pennsylvania, USA, 1996. ACM Press.]]

Digital Library

Cited By

Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Shiina STaura KMohror KArnold DBadia R(2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607049
You XLiu CYang HWang PLuan ZQian D(2022)Vectorizing SpMV by Exploiting Dynamic Regular PatternsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545042(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545042
Show More Cited By

Index Terms

Optimizing irregular shared-memory applications for distributed-memory systems
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Concurrent programming languages
        Distributed programming languages
        Parallel programming languages

Recommendations

Optimizing irregular shared-memory applications for clusters
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

Irregular applications pose challenges in optimizing communication, due to the difficulty of analyzing irregular data accesses accurately and efficiently. This challenge is especially big when translating irregular shared-memory applications to message-...
Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs

Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single ...
Towards automatic translation of OpenMP to MPI
ICS '05: Proceedings of the 19th annual international conference on Supercomputing

We present compiler techniques for translating OpenMP shared-memory parallel applications into MPI message-passing programs for execution on distributed memory systems. This translation aims to extend the ease of creating parallel applications with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

March 2006

258 pages

ISBN:1595931899

DOI:10.1145/1122971

General Chair:
Josep Torrellas
University of Illinois
,
Program Chair:
Siddhartha Chatterjee
IBM Research

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP06

Sponsor:

PPoPP06: ACM SIGPLAN 2006 Symposium on Principles and Practice of Parallel Programming 2006

March 29 - 31, 2006

New York, New York, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
711
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Shiina STaura KMohror KArnold DBadia R(2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607049
You XLiu CYang HWang PLuan ZQian D(2022)Vectorizing SpMV by Exploiting Dynamic Regular PatternsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545042(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545042
Mustafa D(2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147846
Romero FZhao MYadwadkar NKozyrakis C(2021)LlamaProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486972(1-17)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3472883.3486972
Mohammadi MYuki TCheshmi KDavis EHall MDehnavi MNandy POlschanowsky CVenkat AStrout MMcKinley KFisher K(2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314646
Hegde NChang QKulkarni MTaufer MBalaji PPeña A(2019)D2PProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356205(1-22)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356205
Lee WPapadakis MSlaughter EAiken ATaufer MBalaji PPeña A(2019)A constraint-based approach to automatic data partitioning for distributed memory executionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356199(1-24)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356199
Strout MHall MOlschanowsky C(2018)The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor CodeProceedings of the IEEE10.1109/JPROC.2018.2857721106:11(1921-1934)Online publication date: Nov-2018
https://doi.org/10.1109/JPROC.2018.2857721
Ham TAragón JMartonosi M(2017)Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/307562014:2(1-27)Online publication date: 28-Jun-2017
https://dl.acm.org/doi/10.1145/3075620
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents