research-article

Optimization principles for collective neighborhood communications

Authors:

Torsten Hoefler,

Timo SchneiderAuthors Info & Claims

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 98, Pages 1 - 10

Published: 10 November 2012 Publication History

Abstract

Many scientific applications operate in a bulk-synchronous mode of iterative communication and computation steps. Even though the communication steps happen at the same logical time, important patterns such as stencil computations cannot be expressed as collective communications in MPI. We demonstrate how neighborhood collective operations allow to specify arbitrary collective communication relations during run-time and enable optimizations similar to traditional collective calls. We show a number of optimization opportunities and algorithms for different communication scenarios. We also show how users can assert constraints that provide additional optimization opportunities in a portable way. We demonstrate the utility of all described optimizations in a highly optimized implementation of neighborhood collective operations. Our communication and protocol optimizations result in a performance improvement of up to a factor of two for small stencil communications. We found that, for some patterns, our optimization heuristics automatically generate communication schedules that are comparable to hand-tuned collectives. With those optimizations in place, we are able to accelerate arbitrary collective communication patterns, such as regular and irregular stencils with optimization methods for collective communications. We expect that our methods will influence the design of future MPI libraries and provide a significant performance benefit on large-scale systems.

References

[1]

L. G. Valiant, "A bridging model for parallel computation," Commun. ACM, vol. 33, no. 8, pp. 103--111, 1990.

Digital Library

[2]

R. Thakur, "Improving the performance of collective operations in mpich," in Recent Advances in Parallel Virtual Machine and Message Passing Interface. Number 2840 in LNCS, Springer Verlag (2003) 257267 10th European PVM/MPI Users Group Meeting, pp. 257--267, Springer Verlag, 2003.

[3]

S. Gorlatch, "Send-receive considered harmful: Myths and realities of message passing," ACM Trans. Program. Lang. Syst., vol. 26, pp. 47--56, Jan. 2004.

Digital Library

[4]

MPI Forum, MPI: A Message-Passing Interface Standard. Version 2.2, June 23rd 2009.

[5]

T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski, R. Thakur, and J. L. Traeff, "The Scalable Process Topology Interface of MPI 2.2," Concurrency and Computation: Practice and Experience, vol. 23, pp. 293--310, Aug. 2010.

Digital Library

[6]

T. Hoefler and J. L. Traeff, "Sparse collective operations for MPI," in Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS), HIPS Workshop, May 2009.

Digital Library

[7]

T. Hoefler, F. Lorenzen, and A. Lumsdaine, "Sparse Non-Blocking Collectives in Quantum Mechanical Calculations," in Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, vol. LNCS 5205, pp. 55--63, Springer, Sep. 2008.

Digital Library

[8]

J. Pješivac-Grbović, G. Bosilca, G. E. Fagg, T. Angskun, and J. J. Dongarra, "Mpi collective algorithm selection and quadtree encoding," Parallel Comput., vol. 33, pp. 613--623, Sept. 2007.

Digital Library

[9]

A. Bar-Noy and S. Kipnis, "Designing broadcasting algorithms in the postal model for message-passing systems," Math. Syst. Theory, vol. 27, no. 5, pp. 431--452, 1994.

Digital Library

[10]

R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser, "Optimal broadcast and summation in the LogP model," in Proc. of Symposium on Parallel Algorithms and Architectures, pp. 142--153, 1993.

Digital Library

[11]

P. Sanders, J. Speck, and J. L. Träff, "Two-tree algorithms for full bandwidth broadcast, reduction and scan," Parallel Comput., vol. 35, pp. 581--594, December 2009.

Digital Library

[12]

"High performance RDMA protocols in HPC," in Proceedings, 13th European PVM/MPI Users' Group Meeting, Lecture Notes in Computer Science, (Bonn, Germany), Springer-Verlag, September 2006.

Digital Library

[13]

D. J. A. Welsh and M. B. Powell, "An upper bound for the chromatic number of a graph and its application to timetabling problems," The Computer Journal, vol. 10, no. 1, pp. 85--86, 1967.

[14]

A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, "LogGP: Incorporating long messages into the LogP model," J. of Par. and Distr. Comp., vol. 44, no. 1, pp. 71--79, 1995.

Digital Library

[15]

A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, M. E. G. P. Coteus, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, and P. Vranas, "Overview of the bluegene/l system architecture," IBM Journal of Research and Development, vol. 49, no. 2, pp. 195--213, 2005.

Digital Library

[16]

T. Hoefler, A. Lumsdaine, and W. Rehm, "Implementation and performance analysis of nonblocking collective operations for mpi," in Proc. of the 2007 ACM/IEEE conference on Supercomputing (CDROM), 2007.

Digital Library

[17]

M. ten Bruggencate and D. Roweth, "Dmapp - an api for one-sided program models on baker systems," in Cray User Group Conference, CUG, 2010.

[18]

R. Alverson, D. Roweth, and L. Kaplan, "The gemini system interconnect," in Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects, HOTI '10, (Washington, DC, USA), pp. 83--87, IEEE Computer Society, 2010.

Digital Library

[19]

Hyper Transport Consortium, HyperTransport I/O Technology Overview An Optimized, Low-latency Board-level Architecture, 2004.

[20]

M. Woodacre, D. Robb, D. Roe, and K. Feind, "The sgi® altixtm 3000 global shared-memory architecture," 2005.

[21]

T. Hoefler, C. Siebert, and A. Lumsdaine, "Group Operation Assembly Language - A Flexible Way to Express Collective Communication," in Intl. Conf. on Par. Proc., Sep. 2009.

Digital Library

[22]

P. Erdos and A. Rényi, "On the evolution of random graphs," in Publication of the Mathematical Institute of the Hungarian Academy of Sciences, pp. 17--61, 1960.

[23]

W. C. Skamarock and J. B. Klemp, "A time-split nonhydrostatic atmospheric model for weather research and forecasting applications," J. Comput. Phys., vol. 227, pp. 3465--3485, Mar. 2008.

Digital Library

[24]

R. Mei, W. Shyy, D. Yu, and L.-S. Luo, "Lattice boltzmann method for 3-d flows with curved boundary," J. Comput. Phys., vol. 161, pp. 680--699, July 2000.

Digital Library

[25]

C. Bernard, M. C. Ogilvie, T. A. DeGrand, C. E. DeTar, S. A. Gottlieb, A. Krasnitz, R. Sugar, and D. Toussaint, "Studying Quarks and Gluons On Mimd Parallel Computers," International Journal of High Performance Computing Applications, vol. 5, no. 4, pp. 61--70, 1991.

Digital Library

[26]

T. A. Davis, "University of Florida Sparse Matrix Collection," NA Digest, vol. 92, 1994.

[27]

K. Schloegel, G. Karypis, and V. Kumar, "Parallel static and dynamic multi-constraint graph partitioning," Concurrency and Computation: Practice and Experience, vol. 14, no. 3, pp. 219--240, 2002.

[28]

G. Haase, M. Kuhn, and S. Reitzinger, "Parallel algebraic multigrid methods on distributed memory computers," SIAM J. Sci. Comput., vol. 24, pp. 410--427, Feb. 2002.

Digital Library

[29]

J. Bruck, C. T. Ho, S. Kipnis, and D. Weathersby, "Efficient algorithms for all-to-all communications in multi-port message-passing systems," in 6th ACM Symp. on Par. Alg. and Arch., pp. 298--309, 1994.

Digital Library

[30]

J. S. Vetter and F. Mueller, "Communication characteristics of large-scale scientific applications for contemporary cluster architectures," J. Parallel Distrib. Comput., vol. 63, pp. 853--865, Sept. 2003.

Digital Library

[31]

S. Kamil, L. Oliker, A. Pinar, and J. Shalf, "Communication requirements and interconnect optimization for high-end scientific applications," IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 2, pp. 188--202, 2010.

Digital Library

[32]

R. Bordawekar, A. Choudhary, and J. Ramanujam, "Automatic optimization of communication in compiling out-of-core stencil codes," in Proceedings of the 10th international conference on Supercomputing, ICS '96, (New York, NY, USA), pp. 366--373, ACM, 1996.

Digital Library

[33]

L. Renganarayana, M. Harthikote-Matha, R. Dewri, and S. Rajopadhye, "Towards optimal multi-level tiling for stencil computations," in Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pp. 1--10, march 2007.

[34]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan, "Effective automatic parallelization of stencil computations," in Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI '07, (New York, NY, USA), pp. 235--244, ACM, 2007.

Digital Library

[35]

S. Potluri, P. Lai, K. Tomko, Y. Cui, M. Tatineni, K. Schulz, W. Barth, A. Majumdar, and D. Panda, "Optimizing a Stencil-Based Application for Earthquake Modeling on Modern InfiniBand Clusters," tech. rep., Ohio State University, 2009. OSU-CISRC-12/09.

[36]

E. Gabriel, S. Feki, K. Benkert, and M. M. Resch, "Towards performance portability through runtime adaptation for high-performance computing applications," Concurr. Comput.: Pract. Exper., vol. 22, pp. 2230--2246, Nov. 2010.

Digital Library

[37]

S. Kumar, P. Heidelberger, D. Chen, and M. Hines, "Optimization of applications with non-blocking neighborhood collectives via multisends on the blue gene/p supercomputer," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1--11, april 2010.

[38]

R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang, "Communication optimizations for irregular scientific computations on distributed memory architectures," J. Parallel Distrib. Comput., vol. 22, pp. 462--478, Sept. 1994.

Digital Library

[39]

A. Faraj, X. Yuan, and D. Lowenthal, "Star-mpi: self tuned adaptive routines for mpi collective operations," in Proceedings of the 20th annual international conference on Supercomputing, ICS '06, (New York, NY, USA), pp. 199--208, ACM, 2006.

Digital Library

Cited By

Liu YDu XLu ZDuan QFeng JWang MWu J(2022)Regularizing Sparse and Imbalanced Communications for Voxel-based Brain Simulations on SupercomputersProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545019(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545019
Träff JHunold S(2019)Cartesian Collective CommunicationProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337848(1-11)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337848
Selvitopi OAykanat CTaufer MBalaji PPeña A(2019)Regularizing irregularly sparse point-to-point communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356187(1-14)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356187
Show More Cited By

Recommendations

Optimization principles for collective neighborhood communications
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

Many scientific applications operate in a bulk-synchronous mode of iterative communication and computation steps. Even though the communication steps happen at the same logical time, important patterns such as stencil computations cannot be expressed as ...
Optimization of collective communications in HeteroMPI
PVM/MPI'07: Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

HeteroMPI is an extension of MPI designed for high performance computing on heterogeneous networks of computers. The recent new feature of HeteroMPI is the optimized version of collective communications. The optimization is based on a novel performance ...
Process Mapping for MPI Collective Communications
Euro-Par '09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing

It is an important problem to map virtual parallel processes to physical processors (or cores) in an optimized way to get scalable performance due to non-uniform communication cost in modern parallel computers. Existing work uses profile-guided ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2012

1161 pages

ISBN:9781467308045

General Chair:
Jeffrey K. Hollingsworth
University of Maryland

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Qualifiers

Research-article

Conference

SC '12

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '12: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2012

Utah, Salt Lake City

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
180
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YDu XLu ZDuan QFeng JWang MWu J(2022)Regularizing Sparse and Imbalanced Communications for Voxel-based Brain Simulations on SupercomputersProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545019(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545019
Träff JHunold S(2019)Cartesian Collective CommunicationProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337848(1-11)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337848
Selvitopi OAykanat CTaufer MBalaji PPeña A(2019)Regularizing irregularly sparse point-to-point communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356187(1-14)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356187
Morgan BHolmes DSkjellum ABangalore PSridharan SPeña ABalaji PGropp WThakur R(2017)Planning for performanceProceedings of the 24th European MPI Users' Group Meeting10.1145/3127024.3127028(1-11)Online publication date: 25-Sep-2017
https://dl.acm.org/doi/10.1145/3127024.3127028
Martinasso MKwasniewski GAlam SSchulthess THoefler TWest J(2016)A PCIe congestion-aware performance model for densely populated accelerator serversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014989(1-11)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014989
Träff JLübbe FRougier AHunold SDongarra JDenis AGoglin BJeannot EMercier G(2015)Isomorphic, Sparse MPI-like Collective Communication Operations for Parallel Stencil ComputationsProceedings of the 22nd European MPI Users' Group Meeting10.1145/2802658.2802663(1-10)Online publication date: 21-Sep-2015
https://dl.acm.org/doi/10.1145/2802658.2802663
Hoefler TDinan JThakur RBarrett BBalaji PGropp WUnderwood K(2015)Remote Memory Access Programming in MPI-3ACM Transactions on Parallel Computing10.1145/27805842:2(1-26)Online publication date: 29-Jun-2015
https://dl.acm.org/doi/10.1145/2780584
Hoefler TMoor D(2014)Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication OperationsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1402041:2(58-75)Online publication date: 9-Jul-2014
https://dl.acm.org/doi/10.14529/jsfi140204
Borovska PIvanova D(2014)"Silicon Galaxy" system area network for collective communication in supercomputersProceedings of the 15th International Conference on Computer Systems and Technologies10.1145/2659532.2659650(86-93)Online publication date: 27-Jun-2014
https://dl.acm.org/doi/10.1145/2659532.2659650
Ramos SHoefler TParashar MWeissman JEpema DFigueiredo R(2013)Modeling communication in cache-coherent SMP systemsProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2493123.2462916(97-108)Online publication date: 17-Jun-2013
https://dl.acm.org/doi/10.1145/2493123.2462916
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents