Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2388996.2389129acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Optimization principles for collective neighborhood communications

Published: 10 November 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Many scientific applications operate in a bulk-synchronous mode of iterative communication and computation steps. Even though the communication steps happen at the same logical time, important patterns such as stencil computations cannot be expressed as collective communications in MPI. We demonstrate how neighborhood collective operations allow to specify arbitrary collective communication relations during run-time and enable optimizations similar to traditional collective calls. We show a number of optimization opportunities and algorithms for different communication scenarios. We also show how users can assert constraints that provide additional optimization opportunities in a portable way. We demonstrate the utility of all described optimizations in a highly optimized implementation of neighborhood collective operations. Our communication and protocol optimizations result in a performance improvement of up to a factor of two for small stencil communications. We found that, for some patterns, our optimization heuristics automatically generate communication schedules that are comparable to hand-tuned collectives. With those optimizations in place, we are able to accelerate arbitrary collective communication patterns, such as regular and irregular stencils with optimization methods for collective communications. We expect that our methods will influence the design of future MPI libraries and provide a significant performance benefit on large-scale systems.

    References

    [1]
    L. G. Valiant, "A bridging model for parallel computation," Commun. ACM, vol. 33, no. 8, pp. 103--111, 1990.
    [2]
    R. Thakur, "Improving the performance of collective operations in mpich," in Recent Advances in Parallel Virtual Machine and Message Passing Interface. Number 2840 in LNCS, Springer Verlag (2003) 257267 10th European PVM/MPI Users Group Meeting, pp. 257--267, Springer Verlag, 2003.
    [3]
    S. Gorlatch, "Send-receive considered harmful: Myths and realities of message passing," ACM Trans. Program. Lang. Syst., vol. 26, pp. 47--56, Jan. 2004.
    [4]
    MPI Forum, MPI: A Message-Passing Interface Standard. Version 2.2, June 23rd 2009.
    [5]
    T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski, R. Thakur, and J. L. Traeff, "The Scalable Process Topology Interface of MPI 2.2," Concurrency and Computation: Practice and Experience, vol. 23, pp. 293--310, Aug. 2010.
    [6]
    T. Hoefler and J. L. Traeff, "Sparse collective operations for MPI," in Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS), HIPS Workshop, May 2009.
    [7]
    T. Hoefler, F. Lorenzen, and A. Lumsdaine, "Sparse Non-Blocking Collectives in Quantum Mechanical Calculations," in Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, vol. LNCS 5205, pp. 55--63, Springer, Sep. 2008.
    [8]
    J. Pješivac-Grbović, G. Bosilca, G. E. Fagg, T. Angskun, and J. J. Dongarra, "Mpi collective algorithm selection and quadtree encoding," Parallel Comput., vol. 33, pp. 613--623, Sept. 2007.
    [9]
    A. Bar-Noy and S. Kipnis, "Designing broadcasting algorithms in the postal model for message-passing systems," Math. Syst. Theory, vol. 27, no. 5, pp. 431--452, 1994.
    [10]
    R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser, "Optimal broadcast and summation in the LogP model," in Proc. of Symposium on Parallel Algorithms and Architectures, pp. 142--153, 1993.
    [11]
    P. Sanders, J. Speck, and J. L. Träff, "Two-tree algorithms for full bandwidth broadcast, reduction and scan," Parallel Comput., vol. 35, pp. 581--594, December 2009.
    [12]
    "High performance RDMA protocols in HPC," in Proceedings, 13th European PVM/MPI Users' Group Meeting, Lecture Notes in Computer Science, (Bonn, Germany), Springer-Verlag, September 2006.
    [13]
    D. J. A. Welsh and M. B. Powell, "An upper bound for the chromatic number of a graph and its application to timetabling problems," The Computer Journal, vol. 10, no. 1, pp. 85--86, 1967.
    [14]
    A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, "LogGP: Incorporating long messages into the LogP model," J. of Par. and Distr. Comp., vol. 44, no. 1, pp. 71--79, 1995.
    [15]
    A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, M. E. G. P. Coteus, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, and P. Vranas, "Overview of the bluegene/l system architecture," IBM Journal of Research and Development, vol. 49, no. 2, pp. 195--213, 2005.
    [16]
    T. Hoefler, A. Lumsdaine, and W. Rehm, "Implementation and performance analysis of nonblocking collective operations for mpi," in Proc. of the 2007 ACM/IEEE conference on Supercomputing (CDROM), 2007.
    [17]
    M. ten Bruggencate and D. Roweth, "Dmapp - an api for one-sided program models on baker systems," in Cray User Group Conference, CUG, 2010.
    [18]
    R. Alverson, D. Roweth, and L. Kaplan, "The gemini system interconnect," in Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects, HOTI '10, (Washington, DC, USA), pp. 83--87, IEEE Computer Society, 2010.
    [19]
    Hyper Transport Consortium, HyperTransport I/O Technology Overview An Optimized, Low-latency Board-level Architecture, 2004.
    [20]
    M. Woodacre, D. Robb, D. Roe, and K. Feind, "The sgi® altixtm 3000 global shared-memory architecture," 2005.
    [21]
    T. Hoefler, C. Siebert, and A. Lumsdaine, "Group Operation Assembly Language - A Flexible Way to Express Collective Communication," in Intl. Conf. on Par. Proc., Sep. 2009.
    [22]
    P. Erdos and A. Rényi, "On the evolution of random graphs," in Publication of the Mathematical Institute of the Hungarian Academy of Sciences, pp. 17--61, 1960.
    [23]
    W. C. Skamarock and J. B. Klemp, "A time-split nonhydrostatic atmospheric model for weather research and forecasting applications," J. Comput. Phys., vol. 227, pp. 3465--3485, Mar. 2008.
    [24]
    R. Mei, W. Shyy, D. Yu, and L.-S. Luo, "Lattice boltzmann method for 3-d flows with curved boundary," J. Comput. Phys., vol. 161, pp. 680--699, July 2000.
    [25]
    C. Bernard, M. C. Ogilvie, T. A. DeGrand, C. E. DeTar, S. A. Gottlieb, A. Krasnitz, R. Sugar, and D. Toussaint, "Studying Quarks and Gluons On Mimd Parallel Computers," International Journal of High Performance Computing Applications, vol. 5, no. 4, pp. 61--70, 1991.
    [26]
    T. A. Davis, "University of Florida Sparse Matrix Collection," NA Digest, vol. 92, 1994.
    [27]
    K. Schloegel, G. Karypis, and V. Kumar, "Parallel static and dynamic multi-constraint graph partitioning," Concurrency and Computation: Practice and Experience, vol. 14, no. 3, pp. 219--240, 2002.
    [28]
    G. Haase, M. Kuhn, and S. Reitzinger, "Parallel algebraic multigrid methods on distributed memory computers," SIAM J. Sci. Comput., vol. 24, pp. 410--427, Feb. 2002.
    [29]
    J. Bruck, C. T. Ho, S. Kipnis, and D. Weathersby, "Efficient algorithms for all-to-all communications in multi-port message-passing systems," in 6th ACM Symp. on Par. Alg. and Arch., pp. 298--309, 1994.
    [30]
    J. S. Vetter and F. Mueller, "Communication characteristics of large-scale scientific applications for contemporary cluster architectures," J. Parallel Distrib. Comput., vol. 63, pp. 853--865, Sept. 2003.
    [31]
    S. Kamil, L. Oliker, A. Pinar, and J. Shalf, "Communication requirements and interconnect optimization for high-end scientific applications," IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 2, pp. 188--202, 2010.
    [32]
    R. Bordawekar, A. Choudhary, and J. Ramanujam, "Automatic optimization of communication in compiling out-of-core stencil codes," in Proceedings of the 10th international conference on Supercomputing, ICS '96, (New York, NY, USA), pp. 366--373, ACM, 1996.
    [33]
    L. Renganarayana, M. Harthikote-Matha, R. Dewri, and S. Rajopadhye, "Towards optimal multi-level tiling for stencil computations," in Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pp. 1--10, march 2007.
    [34]
    S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan, "Effective automatic parallelization of stencil computations," in Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI '07, (New York, NY, USA), pp. 235--244, ACM, 2007.
    [35]
    S. Potluri, P. Lai, K. Tomko, Y. Cui, M. Tatineni, K. Schulz, W. Barth, A. Majumdar, and D. Panda, "Optimizing a Stencil-Based Application for Earthquake Modeling on Modern InfiniBand Clusters," tech. rep., Ohio State University, 2009. OSU-CISRC-12/09.
    [36]
    E. Gabriel, S. Feki, K. Benkert, and M. M. Resch, "Towards performance portability through runtime adaptation for high-performance computing applications," Concurr. Comput.: Pract. Exper., vol. 22, pp. 2230--2246, Nov. 2010.
    [37]
    S. Kumar, P. Heidelberger, D. Chen, and M. Hines, "Optimization of applications with non-blocking neighborhood collectives via multisends on the blue gene/p supercomputer," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1--11, april 2010.
    [38]
    R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang, "Communication optimizations for irregular scientific computations on distributed memory architectures," J. Parallel Distrib. Comput., vol. 22, pp. 462--478, Sept. 1994.
    [39]
    A. Faraj, X. Yuan, and D. Lowenthal, "Star-mpi: self tuned adaptive routines for mpi collective operations," in Proceedings of the 20th annual international conference on Supercomputing, ICS '06, (New York, NY, USA), pp. 199--208, ACM, 2006.

    Cited By

    View all
    • (2022)Regularizing Sparse and Imbalanced Communications for Voxel-based Brain Simulations on SupercomputersProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545019(1-11)Online publication date: 29-Aug-2022
    • (2019)Cartesian Collective CommunicationProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337848(1-11)Online publication date: 5-Aug-2019
    • (2019)Regularizing irregularly sparse point-to-point communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356187(1-14)Online publication date: 17-Nov-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
    November 2012
    1161 pages
    ISBN:9781467308045

    Sponsors

    Publisher

    IEEE Computer Society Press

    Washington, DC, United States

    Publication History

    Published: 10 November 2012

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SC '12
    Sponsor:

    Acceptance Rates

    SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Regularizing Sparse and Imbalanced Communications for Voxel-based Brain Simulations on SupercomputersProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545019(1-11)Online publication date: 29-Aug-2022
    • (2019)Cartesian Collective CommunicationProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337848(1-11)Online publication date: 5-Aug-2019
    • (2019)Regularizing irregularly sparse point-to-point communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356187(1-14)Online publication date: 17-Nov-2019
    • (2017)Planning for performanceProceedings of the 24th European MPI Users' Group Meeting10.1145/3127024.3127028(1-11)Online publication date: 25-Sep-2017
    • (2016)A PCIe congestion-aware performance model for densely populated accelerator serversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014989(1-11)Online publication date: 13-Nov-2016
    • (2015)Isomorphic, Sparse MPI-like Collective Communication Operations for Parallel Stencil ComputationsProceedings of the 22nd European MPI Users' Group Meeting10.1145/2802658.2802663(1-10)Online publication date: 21-Sep-2015
    • (2015)Remote Memory Access Programming in MPI-3ACM Transactions on Parallel Computing10.1145/27805842:2(1-26)Online publication date: 29-Jun-2015
    • (2014)Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication OperationsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1402041:2(58-75)Online publication date: 9-Jul-2014
    • (2014)"Silicon Galaxy" system area network for collective communication in supercomputersProceedings of the 15th International Conference on Computer Systems and Technologies10.1145/2659532.2659650(86-93)Online publication date: 27-Jun-2014
    • (2013)Modeling communication in cache-coherent SMP systemsProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2493123.2462916(97-108)Online publication date: 17-Jun-2013
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media