Abstract
Memory hierarchy on multi-core clusters has twofold characteristics: vertical memory hierarchy and horizontal memory hierarchy. This paper proposes new parallel computation model to unitedly abstract memory hierarchy on multi-core clusters in vertical and horizontal levels. Experimental results show that new model can predict communication costs for message passing on multi-core clusters more accurately than previous models, only incorporated vertical memory hierarchy. The new model provides the theoretical underpinning for the optimal design of MPI collective operations. Aimed at horizontal memory hierarchy, our methodology for optimizing collective operations on multi-core clusters focuses on hierarchical virtual topology and cache-aware intra-node communication, incorporated into existing collective algorithms in MPICH2. As a case study, multi-core aware broadcast algorithm has been implemented and evaluated. The results of performance evaluation show that the above methodology for optimizing collective operations on multi-core clusters is efficient.
Similar content being viewed by others
References
TOP500 Team, TOP500 Report for November 2007, http://www.top500.org
Mamidala AR, Kumar R, De D, Panda DK (2008) MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: 8th IEEE international conference on cluster computing and the grid (CCGRID ’08)
Rabenseifner R (1999) Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the message passing interface developer’s and user’s conference, pp 77–85
Pjesivac-Grbovic J, Angskun T, Bosilca G et al (2005) Performance analysis of MPI collective operations. In: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)
Cameron KW, Sun X-H (2003) Quantifying locality effect in data access delay: memory logP. In: Proceedings of IEEE international parallel and distributed processing symposium (IPDPs 2003), Nice, France
Cameron KW, Ge R (2004) Predicting and evaluating distributed communication performance. In: Proceedings of the 2004 ACM/IEEE supercomputing conference
Cameron KW, Ge R, Sun X-H (2007) log n P and log3P: accurate analytical models of point-to-point communication in distributed systems. IEEE Trans Comput 56(3):314–327
Thakur R, Gropp W (2003) Improving the performance of collective operations in MPICH. In: Dongarra J, Laforenza D, Orlando S (eds) Recent advances in parallel virtual machine and message passing interface. Lecture notes in computer science, vol 2840. Springer, Berlin, pp 257–267
Rabenseifner R, Traff JL (2004) More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In: Proceedings of EuroPVM/MPI. Lecture notes in computer science. Springer, Berlin
Kielmann T, Hofman RFH, Bal HE, Plaat A, Bhoedjang RAF (1999) MagPIe: MPI’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on principles and practice of parallel programming. ACM Press, New York, pp 131–140
Park J-YL, Choi H-A, Nupairoj N, Ni LM (1996) Construction of optimal multicast trees based on the parameterized communication model. In: Proc int conference on parallel processing (ICPP), vol I, pp 180–187
Culler DE, Karp R, Patterson DA, Sahay A, Santos E, Schauser K, Subramonian R, von Eicken T (1996) LogP: a practical model of parallel computation. Commun ACM 39:78–85
Alexandrov A, Ionescu MF, Schauser K, Scheiman C (1995) LogGP: incorporating long messages into the LogP model. In: Proceedings of seventh annual symposium on parallel algorithms and architecture, Santa Barbara, CA, pp 95–105
Kielmann T, Bal HE (2000) Fast measurement of LogP parameters for message passing platforms. In: Proceedings of the 15 IPDPS 2000 workshops on parallel and distributed processing, pp 1176–1183
Frank MI, Agarwal A, Vernon MK (1997) LoPC: modeling contention in parallel algorithms. In: Proceedings of sixth symposium on principles and practice of parallel programming, Las Vegas, NV, pp 276–287
Moritz CA, Frank MI (1998) LoGPC: modeling network contention in message-passing programs. In: Proceedings of SIGMETRICS ’98, Madison, WI, pp 254–263
Ino F, Fujimoto N, Hagihara K (2001) LogGPS: a parallel computational model for synchronization analysis. In: Proceedings of PPoPP’01, Snowbird, Utah, pp 133–142
Barnett M, Littlefield R, Payne D, van de Geijn R (1993) Global combine on mesh architectures with wormhole routing. In: Proceedings of the 7th international parallel processing symposium, April
Scott D (1991) Efficient all-to-all communication patterns in hypercube and mesh topologies. In: Proceedings of the 6th distributed memory computing conference, pp 398–403
Vadhiyar SS, Fagg GE, Dongarra J (1999) Automatically tuned collective communications. In: Proceedings of SC99: high performance networking and computing, November
Faraj A, Yuan X (2005) Automatic generation and tuning of MPI collective communication routines. In: Proceedings of the 19th annual international conference on supercomputing, pp 393–402
Karonis NT, de Supinski BR, Foster I, Gropp W et al (2000) Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the 14th international parallel and distributed processing symposium (IPDPS’2000), pp 377–384
Husbands P, Hoe JC (1998) MPI-StarT: delivering network performance to numerical applications. In: Proceedings of the 1998 ACM/IEEE SC98 conference (SC’98)
Tipparaju V, Nieplocha J, Panda DK (2003) Fast collective operations using shared and remote memory access protocols on clusters. In: International parallel and distributed processing symposium
Wu M-S, Kendall RA, Wright K (2005) Optimizing collective communications on SMP clusters. In: ICPP’ 2005
Chai L, Hartono A, Panda DK (2006) Designing high performance and scalable MPI intra-node communication support for clusters. In: The IEEE international conference on cluster computing
Asanovic K, Bodik R, Catanzaro BC et al (2006) The landscape of parallel computing research: a view from Berkeley. Electrical Engineering and Computer Sciences, University of California at Berkeley. Technical Report No: UCB/EECS-2006-183, p 12
Chai L, Gao Q, Panda DK (2007) Understanding the impact of multi-core architecture in cluster computing: a case study with intel dual-core system. In: Seventh IEEE international symposium on cluster computing and the grid (CCGrid’07), pp 471–478
Alam SR, Barrett RF, Kuehn JA, Roth PC, Vetter JS (2006) Characterization of scientific workloads on systems with multi-core processors. In: International symposium on workload characterization
Liu J, Wu J, Panda DK (2004) High performance RDMA-based MPI implementation over InfiniBand. Int J Parallel Program
Hoefler T, Lichei A, Rehm W (2007) Low-overhead LogGP parameter assessment for modern interconnection networks. In: Proceedings of IEEE international parallel and distributed processing symposium (IPDPS’2007)
Curtis-Maury M, Ding X, Antonopoulos CD, Nikolopoulos DS (2005) An evaluation of OpenMP on current and emerging multithreaded/multicore processors. In: IWOMP
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tu, B., Fan, J., Zhan, J. et al. Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput 60, 141–162 (2012). https://doi.org/10.1007/s11227-009-0296-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0296-3