Performance analysis and optimization of MPI collective operations on multi-core clusters

Tu, Bibo; Fan, Jianping; Zhan, Jianfeng; Zhao, Xiaofang

doi:10.1007/s11227-009-0296-3

Performance analysis and optimization of MPI collective operations on multi-core clusters

Published: 22 April 2009

Volume 60, pages 141–162, (2012)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Bibo Tu¹,
Jianping Fan^1,2,
Jianfeng Zhan¹ &
…
Xiaofang Zhao¹

371 Accesses
28 Citations
Explore all metrics

Abstract

Memory hierarchy on multi-core clusters has twofold characteristics: vertical memory hierarchy and horizontal memory hierarchy. This paper proposes new parallel computation model to unitedly abstract memory hierarchy on multi-core clusters in vertical and horizontal levels. Experimental results show that new model can predict communication costs for message passing on multi-core clusters more accurately than previous models, only incorporated vertical memory hierarchy. The new model provides the theoretical underpinning for the optimal design of MPI collective operations. Aimed at horizontal memory hierarchy, our methodology for optimizing collective operations on multi-core clusters focuses on hierarchical virtual topology and cache-aware intra-node communication, incorporated into existing collective algorithms in MPICH2. As a case study, multi-core aware broadcast algorithm has been implemented and evaluated. The results of performance evaluation show that the above methodology for optimizing collective operations on multi-core clusters is efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Hierarchical redesign of classic MPI reduction algorithms

Article 18 June 2016

References

TOP500 Team, TOP500 Report for November 2007, http://www.top500.org
Mamidala AR, Kumar R, De D, Panda DK (2008) MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: 8th IEEE international conference on cluster computing and the grid (CCGRID ’08)
Rabenseifner R (1999) Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the message passing interface developer’s and user’s conference, pp 77–85
Pjesivac-Grbovic J, Angskun T, Bosilca G et al (2005) Performance analysis of MPI collective operations. In: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)
Cameron KW, Sun X-H (2003) Quantifying locality effect in data access delay: memory logP. In: Proceedings of IEEE international parallel and distributed processing symposium (IPDPs 2003), Nice, France
Cameron KW, Ge R (2004) Predicting and evaluating distributed communication performance. In: Proceedings of the 2004 ACM/IEEE supercomputing conference
Cameron KW, Ge R, Sun X-H (2007) log_nP and log₃P: accurate analytical models of point-to-point communication in distributed systems. IEEE Trans Comput 56(3):314–327
Article MathSciNet Google Scholar
Thakur R, Gropp W (2003) Improving the performance of collective operations in MPICH. In: Dongarra J, Laforenza D, Orlando S (eds) Recent advances in parallel virtual machine and message passing interface. Lecture notes in computer science, vol 2840. Springer, Berlin, pp 257–267
Chapter Google Scholar
Rabenseifner R, Traff JL (2004) More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In: Proceedings of EuroPVM/MPI. Lecture notes in computer science. Springer, Berlin
Google Scholar
Kielmann T, Hofman RFH, Bal HE, Plaat A, Bhoedjang RAF (1999) MagPIe: MPI’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on principles and practice of parallel programming. ACM Press, New York, pp 131–140
Chapter Google Scholar
Park J-YL, Choi H-A, Nupairoj N, Ni LM (1996) Construction of optimal multicast trees based on the parameterized communication model. In: Proc int conference on parallel processing (ICPP), vol I, pp 180–187
Culler DE, Karp R, Patterson DA, Sahay A, Santos E, Schauser K, Subramonian R, von Eicken T (1996) LogP: a practical model of parallel computation. Commun ACM 39:78–85
Article Google Scholar
Alexandrov A, Ionescu MF, Schauser K, Scheiman C (1995) LogGP: incorporating long messages into the LogP model. In: Proceedings of seventh annual symposium on parallel algorithms and architecture, Santa Barbara, CA, pp 95–105
Kielmann T, Bal HE (2000) Fast measurement of LogP parameters for message passing platforms. In: Proceedings of the 15 IPDPS 2000 workshops on parallel and distributed processing, pp 1176–1183
Frank MI, Agarwal A, Vernon MK (1997) LoPC: modeling contention in parallel algorithms. In: Proceedings of sixth symposium on principles and practice of parallel programming, Las Vegas, NV, pp 276–287
Moritz CA, Frank MI (1998) LoGPC: modeling network contention in message-passing programs. In: Proceedings of SIGMETRICS ’98, Madison, WI, pp 254–263
Ino F, Fujimoto N, Hagihara K (2001) LogGPS: a parallel computational model for synchronization analysis. In: Proceedings of PPoPP’01, Snowbird, Utah, pp 133–142
Barnett M, Littlefield R, Payne D, van de Geijn R (1993) Global combine on mesh architectures with wormhole routing. In: Proceedings of the 7th international parallel processing symposium, April
Scott D (1991) Efficient all-to-all communication patterns in hypercube and mesh topologies. In: Proceedings of the 6th distributed memory computing conference, pp 398–403
Vadhiyar SS, Fagg GE, Dongarra J (1999) Automatically tuned collective communications. In: Proceedings of SC99: high performance networking and computing, November
Faraj A, Yuan X (2005) Automatic generation and tuning of MPI collective communication routines. In: Proceedings of the 19th annual international conference on supercomputing, pp 393–402
Karonis NT, de Supinski BR, Foster I, Gropp W et al (2000) Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the 14th international parallel and distributed processing symposium (IPDPS’2000), pp 377–384
Husbands P, Hoe JC (1998) MPI-StarT: delivering network performance to numerical applications. In: Proceedings of the 1998 ACM/IEEE SC98 conference (SC’98)
Tipparaju V, Nieplocha J, Panda DK (2003) Fast collective operations using shared and remote memory access protocols on clusters. In: International parallel and distributed processing symposium
Wu M-S, Kendall RA, Wright K (2005) Optimizing collective communications on SMP clusters. In: ICPP’ 2005
Chai L, Hartono A, Panda DK (2006) Designing high performance and scalable MPI intra-node communication support for clusters. In: The IEEE international conference on cluster computing
Asanovic K, Bodik R, Catanzaro BC et al (2006) The landscape of parallel computing research: a view from Berkeley. Electrical Engineering and Computer Sciences, University of California at Berkeley. Technical Report No: UCB/EECS-2006-183, p 12
Chai L, Gao Q, Panda DK (2007) Understanding the impact of multi-core architecture in cluster computing: a case study with intel dual-core system. In: Seventh IEEE international symposium on cluster computing and the grid (CCGrid’07), pp 471–478
Alam SR, Barrett RF, Kuehn JA, Roth PC, Vetter JS (2006) Characterization of scientific workloads on systems with multi-core processors. In: International symposium on workload characterization
Liu J, Wu J, Panda DK (2004) High performance RDMA-based MPI implementation over InfiniBand. Int J Parallel Program
Hoefler T, Lichei A, Rehm W (2007) Low-overhead LogGP parameter assessment for modern interconnection networks. In: Proceedings of IEEE international parallel and distributed processing symposium (IPDPS’2007)
Curtis-Maury M, Ding X, Antonopoulos CD, Nikolopoulos DS (2005) An evaluation of OpenMP on current and emerging multithreaded/multicore processors. In: IWOMP

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Bibo Tu, Jianping Fan, Jianfeng Zhan & Xiaofang Zhao
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518067, China
Jianping Fan

Authors

Bibo Tu
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Fan
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofang Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bibo Tu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tu, B., Fan, J., Zhan, J. et al. Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput 60, 141–162 (2012). https://doi.org/10.1007/s11227-009-0296-3

Download citation

Received: 15 July 2008
Accepted: 26 March 2009
Published: 22 April 2009
Issue Date: April 2012
DOI: https://doi.org/10.1007/s11227-009-0296-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance analysis and optimization of MPI collective operations on multi-core clusters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Hierarchical redesign of classic MPI reduction algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Performance analysis and optimization of MPI collective operations on multi-core clusters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Hierarchical redesign of classic MPI reduction algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation