Article

Process Mapping for MPI Collective Communications

Authors:

Weimin ZhengAuthors Info & Claims

Euro-Par '09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing

Pages 81 - 92

https://doi.org/10.1007/978-3-642-03869-3_11

Published: 23 August 2009 Publication History

Abstract

It is an important problem to map virtual parallel processes to physical processors (or cores) in an optimized way to get scalable performance due to non-uniform communication cost in modern parallel computers. Existing work uses profile-guided approaches to optimize mapping schemes to minimize the cost of point-to-point communications automatically. However, these approaches cannot deal with collective communications and may get sub-optimal mappings for applications with collective communications.

In this paper, we propose an approach called OPP (Optimized Process Placement) to handle collective communications which transforms collective communications into a series of point-to-point communication operations according to the implementation of collective communications in communication libraries. Then we can use existing approaches to find optimized mapping schemes which are optimized for both point-to-point and collective communications.

We evaluated the performance of our approach with micro-benchmarks which include all MPI collective communications, NAS Parallel Benchmark suite and three other applications. Experimental results show that the optimized process placement generated by our approach can achieve significant speedup.

References

[1]

Colwell, R.R.: From terabytes to insights. Commun. ACM 46(7), 25-27 (2003).

Digital Library

[2]

Pant, A., Jafri, H.: Communicating efficiently on cluster based grids with MPICH-VMI. In: CLUSTER, pp. 23-33 (2004).

Digital Library

[3]

Chen, H., Chen, W., Huang, J., Robert, B., Kuhn, H.: MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In: ICS, pp. 353-360 (2006).

Digital Library

[4]

NASA Ames Research Center. NAS parallel benchmark NPB, http://www.nas.nasa.gov/Resources/Software/npb.html

[5]

Phinjaroenphan, P., Bevinakoppa, S., Zeephongsekul, P.: A heuristic algorithm for mapping parallel applications on computational grids. In: EGC, pp. 1086-1096 (2005).

Digital Library

[6]

Sanyal, S., Jain, A., Das, S.K., Biswas, R.: A hierarchical and distributed approach for mapping large applications to heterogeneous grids using genetic algorithms. In: CLUSTER, pp. 496-499 (2003).

[7]

Bhanot, G., Gara, A., Heidelberger, P., Lawless, E., Sexton, J., Walkup, R.: Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development 49(2-3), 489-500 (2005).

Digital Library

[8]

Yu, H., Chung, I., Moreira, J.: Topology mapping for Blue Gene/L supercomputer. In: SC, pp. 52-64 (2006).

Digital Library

[9]

Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.: MagPIe: MPI's collective communication operations for clustered wide area systems. In: PPOPP (1999).

Digital Library

[10]

Sanders, P., Traff, J.L.: The hierarchical factor algorithm for all-to-all communication (research note). In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 799-804. Springer, Heidelberg (2002).

Digital Library

[11]

Sistare, S., vande Vaart, R., Loh, E.: Optimization of MPI collectives on clusters of largescale SMP's. In: SC, pp. 23-36 (1999).

Digital Library

[12]

Tipparaju, V., Nieplocha, J., Panda, D.K.: Fast collective operations using shared and remote memory access protocols on clusters. In: IPDPS, pp. 84-93 (2003).

Digital Library

[13]

Barnett, M., Gupta, S., Payne, D.G., Shuler, L., van de Geijn, R., Watts, J.: Interprocessor collective communication library (InterCom). In: SHPCC, pp. 357-364 (1994).

[14]

Kalé, L.V., Kumar, S., Varadarajan, K.: A framework for collective personalized communication. In: IPDPS, pp. 69-77 (2003).

[15]

Ohio State University. MVAPICH: MPI over infiniband and iWARP, http://mvapich.cse.ohio-state.edu

[16]

Bruck, J., Ho, C., Upfal, E., Kipnis, S., Weathersby, D.: Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. 8(11), 1143-1156 (1997).

Digital Library

[17]

Rabenseifner, R.: New optimized MPI reduce algorithm, http://www.hlrs.de/ organization/par/services/models/mpi/myreduce.html

[18]

Argonne National Laboratory. MPICH1, http://www-unix.mcs.anl.gov/mpi/mpich1

[19]

Intel Ltd. Intel IMB benchmark, http://www.intel.com/cd/software/ products/asmo-na/eng/219848.htm

[20]

Huang, Z., Purvis, M.K., Werstein, P.: Performance evaluation of view-oriented parallel programming. In: ICPP, pp. 251-258 (2005).

Digital Library

[21]

Xue, W., Shu, J., Wu, Y., Zheng, W.: Parallel algorithm and implementation for realtime dynamic simulation of power system. In: ICPP, pp. 137-144 (2005).

Digital Library

[22]

Hewlett-Packard Development Company. HP-MPI user's guide, http://docs.hp.com/en/B6060-96024/ch03s12.html

Cited By

Swartvagher PHunold STräff JVardas I(2023)Using Mixed-Radix Decomposition to Enumerate Computational Resources of Deeply Hierarchical ArchitecturesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624109(405-415)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624109
Baicheng YZhang YLimin XYi ZBing WYao S(2019)LPMSWorkshop Proceedings of the 48th International Conference on Parallel Processing10.1145/3339186.3339208(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3339186.3339208
Pickartz SClauss CLankes SMonti APeña ABalaji PGropp WThakur R(2017)Enabling hierarchy-aware MPI collectives in dynamically changing topologiesProceedings of the 24th European MPI Users' Group Meeting10.1145/3127024.3127031(1-11)Online publication date: 25-Sep-2017
https://dl.acm.org/doi/10.1145/3127024.3127031
Show More Cited By

Process Mapping for MPI Collective Communications
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Process Distance-Aware Adaptive MPI Collective Communications
CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster Computing

Message Passing Interface (MPI) implementations provide a great flexibility to allow users to arbitrarily bind processes to computing cores to fully exploit clusters of multicore/ many-core nodes. An intelligent process placement can optimize ...
Modeling MPI Collective Communications on the AP3000 Multicomputer
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface

The performance of the communication primitives of a parallel computer is critical for the overall system performance. The performance of the message-passing routines does not only depend on the hardware of the communication subsystem, but also on their ...
Collective Communication and Communicators in mpi++
MPIDC '96: Proceedings of the Second MPI Developers Conference

This paper describes the current version of mpi++, a C++ language binding for MPI, that includes all of the collective services, and services for contexts, groups and communicators as described in Chapter 4 and 5 of the MPI standard. The code for mpi++ ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Euro-Par '09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing

August 2009

1082 pages

ISBN:9783642038686

Editors:
Henk Sips
Department of Software Technology, Delft University of Technology, Delft, The Netherlands 2628
,
Dick Epema
Department of Software Technology, Delft University of Technology, Delft, The Netherlands 2628
,
Hai-Xiang Lin
Department of Software Technology, Delft University of Technology, Delft, The Netherlands 2628

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2009

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Swartvagher PHunold STräff JVardas I(2023)Using Mixed-Radix Decomposition to Enumerate Computational Resources of Deeply Hierarchical ArchitecturesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624109(405-415)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624109
Baicheng YZhang YLimin XYi ZBing WYao S(2019)LPMSWorkshop Proceedings of the 48th International Conference on Parallel Processing10.1145/3339186.3339208(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3339186.3339208
Pickartz SClauss CLankes SMonti APeña ABalaji PGropp WThakur R(2017)Enabling hierarchy-aware MPI collectives in dynamically changing topologiesProceedings of the 24th European MPI Users' Group Meeting10.1145/3127024.3127031(1-11)Online publication date: 25-Sep-2017
https://dl.acm.org/doi/10.1145/3127024.3127031
Jeannot EMansouri FMercier GPeña ABalaji PGropp WThakur R(2017)A hierarchical model to manage hardware topology in MPI applicationsProceedings of the 24th European MPI Users' Group Meeting10.1145/3127024.3127030(1-11)Online publication date: 25-Sep-2017
https://dl.acm.org/doi/10.1145/3127024.3127030
Karlsson CChen Z(2015)Optimising MPI tree-based communication for NUMA architecturesInternational Journal of Autonomous and Adaptive Communications Systems10.1504/IJAACS.2015.0731908:4(407-423)Online publication date: 1-Nov-2015
https://dl.acm.org/doi/10.1504/IJAACS.2015.073190
Zhai JHu JTang XMa XChen WDamkroger TDongarra J(2014)CypressProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.17(143-153)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/SC.2014.17
González-Domínguez JTaboada GFraguela BMartín MTouriño J(2012)Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suiteComputers and Electrical Engineering10.1016/j.compeleceng.2011.12.00738:2(258-269)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.1016/j.compeleceng.2011.12.007
Castro MGóes LFernandes LMéhaut J(2012)Dynamic thread mapping based on machine learning for transactional memory applicationsProceedings of the 18th international conference on Parallel Processing10.1007/978-3-642-32820-6_47(465-476)Online publication date: 27-Aug-2012
https://dl.acm.org/doi/10.1007/978-3-642-32820-6_47
Zhai JSheng THe JChen WZheng WPinfold W(2009)FACTProceedings of the Conference on High Performance Computing Networking, Storage and Analysis10.1145/1654059.1654087(1-12)Online publication date: 14-Nov-2009
https://dl.acm.org/doi/10.1145/1654059.1654087

View Options

View options

Figures

Tables

Media

View Table of Conten