Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Accelerating Intercommunication in Highly Parallel Systems

Published: 02 December 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Every HPC system consists of numerous processing nodes interconnect using a number of different inter-process communication protocols such as Messaging Passing Interface (MPI) and Global Arrays (GA). Traditionally, research has focused on optimizing these protocols and identifying the most suitable ones for each system and/or application. Recently, there has been a proposal to unify the primitive operations of the different inter-processor communication protocols through the Portals library. Portals offer a set of low-level communication routines which can be composed in order to implement the functionality of different intercommunication protocols. However, Portals modularity comes at a performance cost, since it adds one more layer in the actual protocol implementation. This work aims at closing the performance gap between a generic and reusable intercommunication layer, such as Portals, and the several monolithic and highly optimized intercommunication protocols. This is achieved through the development of a novel hardware offload engine efficiently implementing the basic Portals’ modules. Our innovative system is up to two2 orders of magnitude faster than the conventional software implementation of Portals’ while the speedup achieved over the conventional monolithic software implementations of MPI and GAs is more than an order of magnitude. The power consumption of our hardware system is less than 1/100th of what a low-power CPU consumes when executing the Portal's software while its silicon cost is less than 1/10th of that of a very simple RISC CPU. Moreover, our design process is also innovative since we have first modeled the hardware within an untimed virtual prototype which allowed for rapid design space exploration; then we applied a novel methodology to transform the untimed description into an efficient timed hardware description, which was then transformed into a hardware netlist through a High-Level Synthesis (HLS) tool.

    References

    [1]
    ARM 2013. http://www.arm.com/products/processors/cortex-a/cortex-a9.php.
    [2]
    D. Bailey, E. Barszcz, J. Barton, and D. Browning. 1994. The NAS Parallel Benchmarks. RNR Tech. Rep., Mar. 1994.
    [3]
    B. W. Barrett, R. Brightwell, K. S. Hemmert, K. Pedretti, K. Wheeler, and K. D. Underwood. 2011a. Enhanced support for openshmem communication in portals. In Proceedings of the IEEE 19th Annual Symposium on High Performance Interconnects (HOTI), Aug. 2011, 61--69.
    [4]
    Brian W. Barrett, Ron Brightwell, K. Scott Hemmert, Kyle B. Wheeler, and Keith D. Underwood. 2011b. Using triggered operations to offload rendezvous messages. EuroMPI 2011. 120--129.
    [5]
    Brian W. Barrett, Ron Brightwell, Ryan E. Grant, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Arthur B. Maccabe, and Trammell Hudson. 2014. The Portals 4.0.2 networking programming interface. Tech. Rep. SAND2014--19568. Sandia National Laboratories.
    [6]
    Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Jon Hiller, Sherman Karp, et al. 2008. Exascale Computing Study: Technology Challenges in Achieving Exascale Systems. University of Notre Dame CSE Department Technical Report. TR-2008--13, Tech. Rep, Sept. 2008.
    [7]
    S. Borkar and A. A. Chien. 2011. The future of microprocessors. Commun. ACM 54, 5, 67--77.
    [8]
    R. Brightwell and K. D. Underwood. 2004. An analysis of NIC resource usage for offloading MPI. In Proceedings of the International Parallel and Distributed Processing Symposium, 9.
    [9]
    R. Brightwell, R. Riesen, and K. D. Underwood. 2005. Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl 2, 103--117.
    [10]
    Ron Brightwell, Kevin T. Pedretti, Keith D. Underwood, and Trammell Hudson. 2006. SeaStar interconnect: Balanced bandwidth for scalable performance. IEEE Micro 26, 3, 41--57.
    [11]
    CADENCE 2013. http://www.eejournal.com/archives/articles/20131031-cadence.
    [12]
    CADENCE 2016. http://www.cadence.com.
    [13]
    Cadence C-to-Silicon Compiler User Guide. 2011. Product Version 11.10 s200, August 2011.
    [14]
    L. Cai and D. Gajski. 2003. Transaction level modeling: An overview. In Proceedings of the 1st IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’03). ACM, New York, 19--24.
    [15]
    Sung-eun Choi, Evan Harvey, Howard Pritchard, James Shimek, James Swaro, and Zachary Tiffany. 2016. The GNI provider layer for OFI libfabric. In Cray User Group Conference, May 2016, London.
    [16]
    Luc Claesen, Maria-Teresa Sanz-Pascual, Ricardo Reis, and Arturo Sarmiento-Reyes. 2015. VLSI-SoC: Internet of things foundations. In Proceedings of the 22nd IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration, Oct. 2014.
    [17]
    Jeffrey Daily, Abhinav Vishnu, Bruce Palmer, and Hubertus Van Dam. 2013. PGAS models using an MPI runtime: Design alternatives and performance evaluation. SCHEDULE. Nov. 2013. http://sc13.supercomputing.org/.
    [18]
    Said Derradji, Thibaut Palfer-Sollier, Jean-Pierre Panziera, Axel Poudes, and François Wellenreiter. 2015. ATOS. The BXI interconnect architecture. In Proceedings of the 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, 2015, 18--25.
    [19]
    M. Flajslik, J. Dinan, and K. D. Underwood. 2016. Mitigating MPI message matching misery. In Proceedings of the International Supercomputing Conference.
    [20]
    Michael Frumkin. 2005. Data flow pattern analysis of scientific applications. In Proceedings of the Workshop on Patterns in High Performance Computing.
    [21]
    Eduardo R. Hernández. 2008. Molecular dynamics: From basic techniques to applications. In Institut de Ciència de Materials de Barcelona, AIP Conference Proceedings. 1077. 1, Nov. 2008.
    [22]
    HP 2009. http://www.hpl.hp.com/research/cacti.
    [23]
    R. Keller and R. L. Graham. 2010. Characteristics of the unexpected message queue of MPI applications. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface. 179--188.
    [24]
    D. E. Knuth. 1998. The Art of Computer Programming, 1 Fundamental Algorithms. 3rd ed: Addison-Wesley Longman Publishing Co, Inc, Boston, MA.
    [25]
    Pavlos M. Mattheakis and Ioannis Papaefstathiou. 2013. Significantly reducing MPI intercommunication latency and power overhead in both embedded and HPC systems. ACM Trans. Archit. Code Optim. 9, 4, Article 51 (January 2013), 25 pages.
    [26]
    OpenMPI 2016. https://www.open-mpi.org/.
    [27]
    OVP 2014. http://www.ovpworld.org.
    [28]
    PGAS 2013. http://en.wikipedia.org/wiki/Partitioned_global_address_space.
    [29]
    SystemC 2014. http://www.asic-world.com/systemc/process3.html.
    [30]
    Timo Schneider, Torsten Hoefler, Ryan E. Grant, Brian W. Barrett, and Ron Brightwell. 2013. Protocols for fully offloaded collective operations on accelerated network adapters. In Proceedings of the 2013 42nd International Conference on Parallel Processing. IEEE, 593--602.
    [31]
    UCX 2016. http://www.openucx.org/.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 4
    December 2016
    648 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3012405
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 December 2016
    Accepted: 01 October 2016
    Revised: 01 September 2016
    Received: 01 May 2016
    Published in TACO Volume 13, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GA
    2. HPC
    3. MPI
    4. Portals
    5. embedded systems
    6. intercommunication

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 417
      Total Downloads
    • Downloads (Last 12 months)48
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media