research-article

Hybrid parallel programming with MPI and unified parallel C

Authors:

Rajeev ThakurAuthors Info & Claims

CF '10: Proceedings of the 7th ACM international conference on Computing frontiers

Pages 177 - 186

https://doi.org/10.1145/1787275.1787323

Published: 17 May 2010 Publication History

Abstract

The Message Passing Interface (MPI) is one of the most widely used programming models for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. Partitioned Global Address Space (PGAS) models such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a shared global address space that spans the memories of multiple compute nodes. However, taking advantage of UPC can require a large recoding effort for existing parallel applications.

In this paper, we explore a new hybrid parallel programming model that combines MPI and UPC. This model allows MPI programmers incremental access to a greater amount of memory, enabling memory-constrained MPI codes to process larger data sets. In addition, the hybrid model offers UPC programmers an opportunity to create static UPC groups that are connected over MPI. As we demonstrate, the use of such groups can significantly improve the scalability of locality-constrained UPC codes. This paper presents a detailed description of the hybrid model and demonstrates its effectiveness in two applications: a random access benchmark and the Barnes-Hut cosmological simulation. Experimental results indicate that the hybrid model can greatly enhance performance; using hybrid UPC groups that span two cluster nodes, RA performance increases by a factor of 1.33 and using groups that span four cluster nodes, Barnes-Hut experiences a twofold speedup at the expense of a 2% increase in code size.

References

[1]

MPICH2. http://www.mcs.anl.gov/research/projects/mpich2/, December 2009.

[2]

Eduard Ayguade, Marc Gonzalez, Xavier Martorell, and Gabriele Jost. Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications. J. Parallel Distrib. Comput., 66(5):686--697, 2006.

Digital Library

[3]

Joshua E. Barnes and Piet Hut. A hierarchical o(n log n) force calculation algorithm. Nature, 324:446--449, 1986.

[4]

Berkeley UPC. Berkeley UPC user's guide version 2.8.0, 2009.

[5]

L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK user's guide. SIAM, Philadelphia, PA, 1997.

[6]

Dan Bonachea and Jason Duell. Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations. In 2nd Workshop on Hardware/Software Support for High Performance Scientific and Engineering Computing (SHPSEC), pages 91--99, 2003.

[7]

B. L. Chamberlain, D. Callahan, and H. P. Zima. Parallel programmability and the Chapel language. Intl. J. High Performance Computing Applications (IJHPCA), 21(3):291--312, 2007.

Digital Library

[8]

Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Intl. Conf. Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 519--538. ACM SIGPLAN, 2005.

Digital Library

[9]

Julita Corbalán, Alejandro Duran, and Jesús Labarta. Dynamic load balancing of MPI+OpenMP applications. In Intl. Conf. on Parallel Processing (ICPP), 2004.

Digital Library

[10]

Haoqiang Jin and Rob F. Van der Wijngaart. Performance characteristics of the multi-zone NAS parallel benchmarks. In 18th Intl. Parallel and Distributed Processing Symp. (IPDPS). IEEE, 2004.

[11]

MPI Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, University of Tennessee, Knoxville, 1994.

[12]

MPI Forum. MPI-2: Extensions to the message-passing interface. Technical report, University of Tennessee, Knoxville, 1996.

[13]

Jarek Nieplocha and Bryan Carpenter. ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture Notes in Computer Science, 1586, 1999.

Digital Library

[14]

Jarek Nieplocha, Robert J. Harrison, and Richard J. Littlefield. Global Arrays: A portable "shared-memory" programming model for distributed memory computers. In Supercomputing (SC) '94, pages 340--349, 1994.

Digital Library

[15]

Jarek Nieplocha, Bruce Palmer, Vinod Tipparaju, Manojkumar Krishnan, Harold Trease, and Edoardo Aprà. Advances, applications and performance of the Global Arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2):203--231, 2006.

Digital Library

[16]

Bruce Palmer, Jarek Nieplocha, and Edoardo Apra. Shared memory mirroring for reducing communication overhead on commodity networks. In Intl. Conf. on Cluster Computing. IEEE Computer Society, 2003.

[17]

Steven C. Pieper. Quantum Monte Carlo calculations of light nuclei. Nuclear Physics A, 751:516--532, 2005. Proceedings of the 22nd International Nuclear Physics Conference (Part 1).

[18]

Lorna Smith and Mark Bull. Development of mixed mode MPI / OpenMP applications. Scientific Programming, 9(2,3):83--98, 2001.

Digital Library

[19]

Guy L. Steele Jr. Parallel programming and parallel abstractions in fortress. In 14th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), page 157, 2005.

Digital Library

[20]

Vinod Tipparaju, William Gropp, Hubert Ritzdorf, Rajeev Thakur, and Jesper L. Träff. Investigating high performance RMA interfaces for the MPI-3 standard. In Proc. 38th Intl. Conf. on Parallel Processing (ICPP), September 2009.

Digital Library

[21]

UPC Consortium. UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Laboratory, 2005.

Cited By

Coti CMalony A(2021)DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network pathConcurrency and Computation: Practice and Experience10.1002/cpe.617933:11Online publication date: 4-Feb-2021
https://doi.org/10.1002/cpe.6179
Jansson N(2020)A Hybrid MPI+PGAS Approach to Improve Strong Scalability Limits of Finite Element Solvers2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00041(303-313)Online publication date: Sep-2020
https://doi.org/10.1109/CLUSTER49012.2020.00041
Munier BAleem MKhan MIslam MIqbal MKhattak M(2020)On the parallelization and performance analysis of Barnes–Hut algorithm using Java parallel platformsSN Applied Sciences10.1007/s42452-020-2386-z2:4Online publication date: 10-Mar-2020
https://doi.org/10.1007/s42452-020-2386-z
Show More Cited By

Index Terms

Hybrid parallel programming with MPI and unified parallel C
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Concurrent programming structures
      2. Language types
        Parallel programming languages

Recommendations

A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI

The Gemini interconnect on the Cray XE6 platform provides for lightweight remote direct memory access (RDMA) between nodes, which is useful for implementing partitioned global address space (PGAS) languages like UPC and Co-Array Fortran. In this paper, ...
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is ...
Communication Bandwidth of Parallel Programming Models on Hybrid Architectures
ISHPC '02: Proceedings of the 4th International Symposium on High Performance Computing

Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node. This paper introduces several ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '10: Proceedings of the 7th ACM international conference on Computing frontiers

May 2010

370 pages

ISBN:9781450300445

DOI:10.1145/1787275

General Chair:
Nancy M. Amato
Texas A&M University, USA
,
Program Chairs:
Hubertus Franke
IBM Research, USA
,
Paul H.J. Kelly
Imperial College London, UK

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'10

Sponsor:

SIGMICRO

CF'10: Computing Frontiers Conference

May 17 - 19, 2010

Bertinoro, Italy

Acceptance Rates

CF '10 Paper Acceptance Rate 30 of 113 submissions, 27%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
487
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Coti CMalony A(2021)DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network pathConcurrency and Computation: Practice and Experience10.1002/cpe.617933:11Online publication date: 4-Feb-2021
https://doi.org/10.1002/cpe.6179
Jansson N(2020)A Hybrid MPI+PGAS Approach to Improve Strong Scalability Limits of Finite Element Solvers2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00041(303-313)Online publication date: Sep-2020
https://doi.org/10.1109/CLUSTER49012.2020.00041
Munier BAleem MKhan MIslam MIqbal MKhattak M(2020)On the parallelization and performance analysis of Barnes–Hut algorithm using Java parallel platformsSN Applied Sciences10.1007/s42452-020-2386-z2:4Online publication date: 10-Mar-2020
https://doi.org/10.1007/s42452-020-2386-z
Bak SGuo YBalaji PSarkar V(2019)Optimized Execution of Parallel Loops via User-Defined Scheduling PoliciesProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337913(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337913
Gava JBandeira VReis ROst L(2019)Evaluation of Compilers Effects on OpenMP Soft Error Resiliency2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2019.00055(259-264)Online publication date: Jul-2019
https://doi.org/10.1109/ISVLSI.2019.00055
Bak SMenon HWhite SDiener MKale LEl-Araby EEl-Ghazawi TPanda D(2018)Multi-level load balancing with an integrated runtime approachProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00018(31-40)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1109/CCGRID.2018.00018
Baun CCocos HSpanou R(2018)Erfahrungen beim Aufbau von großen Clustern aus Einplatinencomputern für Forschung und LehreInformatik-Spektrum10.1007/s00287-017-1083-941:3(189-199)Online publication date: 5-Jan-2018
https://doi.org/10.1007/s00287-017-1083-9
Baun C(2016)Mobile clusters of single board computers: an option for providing resources to student projects and researchersSpringerPlus10.1186/s40064-016-1981-35:1Online publication date: 22-Mar-2016
https://doi.org/10.1186/s40064-016-1981-3
Da Costa GFahringer TRico-Gallego JGrasso IHristov AKaratza HLastovetsky AMarozzo FPetcu DStavrinides GTalia DTrunfio PAstsatryan H(2015)Exascale Machines Require New Programming Paradigms and RuntimesSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1502012:2(6-27)Online publication date: 6-Apr-2015
https://dl.acm.org/doi/10.14529/jsfi150201
Junchao Zhang Behzad BSnir M(2015)Design of a Multithreaded Barnes-Hut Algorithm for Multicore ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.233124326:7(1861-1873)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1109/TPDS.2014.2331243
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents