Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3295500.3356153acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

INCA: in-network compute assistance

Published: 17 November 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Current proposals for in-network data processing operate on data as it streams through a network switch or endpoint. Since compute resources must be available when data arrives, these approaches provide deadline-based models of execution. This paper introduces a deadline-free general compute model for network endpoints called INCA: In-Network Compute Assistance. INCA builds upon contemporary NIC offload capabilities to provide on-NIC, deadline-free, general-purpose compute capacities that can be utilized when the network is inactive. We demonstrate INCA is Turing complete, and provide a detailed design for extending existing hardware to support this model. We evaluate runtimes for a selection of kernels, including several optimizations, and show INCA can provide up to a 11% speedup for applications with minimal code modifications and between 25% to 37% when applications are optimized for INCA.

    References

    [1]
    Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. 1995. LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation. ACM Press, 95--105.
    [2]
    Brian W. Barrett, Ron Brightwell, Ryan E. Grant, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Torsten Hoefler, Arthur B. Maccabe, and Trammell Hudson. 2018. The Portals 4.2 Network Programming Interface. Technical Report SAND2018-12790.
    [3]
    Brian W Barrett, Ron Brightwell, K Scott Hemmert, Kyle B Wheeler, and Keith D Underwood. 2011. Using triggered operations to offload rendezvous messages. In European MPI Users' Group Meeting. Springer, 120--129.
    [4]
    Nanette J Boden, Danny Cohen, Robert E Felderman, Alan E. Kulawik, Charles L Seitz, Jakov N Seizovic, and Wen-King Su. 1995. Myrinet: A gigabit-per-second local area network. IEEE Micro 15, 1 (1995), 29--36.
    [5]
    Ron Brightwell, Kevin T Pedretti, Keith D Underwood, and Trammell Hudson. 2006. SeaStar interconnect: Balanced bandwidth for scalable performance. IEEE Micro 26, 3 (2006), 41--57.
    [6]
    Broadcom. 2019. Stingray SmartNIC. Retrieved 2019-10-01 from https://www.broadcom.com/products/ethernet-connectivity/smartnic/ps225
    [7]
    Darius Buntinas, Dhabaleswar K. Panda, and Ponnuswamy Sadayappan. 2001. Fast NIC-based barrier over Myrinet/GM. In Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001. 52--59.
    [8]
    Christopher L Chappell and James Mitchell. 2012. Packet processing in switched fabric networks. Patent No. 8285907, Filed December 10th., 2004, Issued October 9th., 2012.
    [9]
    David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. 1993. LogP: Towards a Realistic Model of Parallel Computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP '93). ACM, New York, NY, USA, 1--12.
    [10]
    Dennis Dalessandro, Ananth Devulapalli, and Pete Wyckoff. 2005. Design and implementation of the iWARP protocol in software. In Proceedings of the 17th IASTED International Conference on Parallel and Distributed Computing and Systems. Phoenix, Arizona, 471--476.
    [11]
    Dennis Dalessandro, Pete Wyckoff, and Gary Montry. 2006. Initial performance evaluation of the neteffect 10 gigabit iwarp adapter. In 2006 IEEE International Conference on Cluster Computing. IEEE, 1--7.
    [12]
    S. Derradji, T. Palfer-Sollier, J. P. Panziera, A. Poudes, and F. W. Atos. 2015. The BXI Interconnect Architecture. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 18--25.
    [13]
    Hans Devries. 2019. Chip Architect. Retrieved 2019-04-09 from http://www.chip-architect.com/
    [14]
    Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. 2018. Azure accelerated networking: SmartNICs in the public cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 51--66.
    [15]
    Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In Proceedings of the First Workshop on Optimization of Communication in HPC. IEEE Press, 1--10.
    [16]
    Richard L Graham, Steve Poole, Pavel Shamis, Gil Bloch, Noam Bloch, Hillel Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, and Gilad Shainer. 2010. Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 1--8.
    [17]
    Ryan E Grant, Mohammad J Rashti, Ahmad Afsahi, and Pavan Balaji. 2011. RDMA capable iWARP over datagrams. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 628--639.
    [18]
    K. Scott Hemmert, Brian Barrett, and Keith D. Underwood. 2010. Using Triggered Operations to Offload Collective Communication Operations. In Recent Advances in the Message Passing Interface (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 249--256.
    [19]
    Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.
    [20]
    Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, and Ron Brightwell. 2017. sPIN: High-performance Streaming Processing In the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, 59:1--59:16.
    [21]
    Antoine Kaufmann, SImon Peter, Naveen Kr. Sharma, Thomas Anderson, and Arvind Krishnamurthy. 2016. High Performance Packet Processing with FlexNIC. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 67--81.
    [22]
    D. Brian Larkins, John Snyder, and James Dinan. 2018. Efficient Runtime Support for a Partitioned Global Logical Address Space. In ICPP 2018: 47th International Conference on Parallel Processing. ACM, Eugune, Oregon.
    [23]
    Mellanox. 2018. Mellanox BlueField SmartNIC. Retrieved 2019-10-01 from https://www.mellanox.com/products/bluefield-overview
    [24]
    Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. 2002. The Quadrics network: High-performance clustering technology. IEEE Micro 22, 1 (2002), 46--57.
    [25]
    Steve Plimpton. 1995. Fast parallel algorithms for short-range molecular dynamics. Journal of computational physics 117, 1 (1995), 1--19.
    [26]
    ECP Project. 2019. ECP Proxy Applications. Retrieved 2019-10-01 from https://proxyapps.exascaleproject.org/
    [27]
    Mohammad J Rashti, Ryan E Grant, Ahmad Afsahi, and Pavan Balaji. 2010. iWARP redefined: Scalable connectionless communication over high-speed Ethernet. In High Performance Computing (HiPC), 2010 International Conference on. IEEE, 1--10.
    [28]
    Timo Schneider, Torsten Hoefler, Ryan E Grant, Brian W Barrett, and Ron Brightwell. 2013. Protocols for fully offloaded collective operations on accelerated network adapters. In Parallel Processing (ICPP), 2013 42nd International Conference on. IEEE, 593--602.
    [29]
    J. C. Shepherdson and H. E. Sturgis. 1963. Computability of Recursive Functions. J. ACM 10, 2 (April 1963), 217--255.
    [30]
    Krishna Parasuram Srinivasan. 2018. Creating a PCI express interconnect in the gem5 simulator. Master's thesis.
    [31]
    K. D. Underwood, J. Coffman, R. Larsen, K. S. Hemmert, B.W. Barrett, R. Brightwell, and M. Levenhagen. 2011. Enabling Flexible Collective Communication Offload with Triggered Operations. In 2011 IEEE 19th Annual Symposium on High Performance Interconnects. 35--42.
    [32]
    K. D. Underwood, K. S. Hemmert, A. Rodrigues, R. Murphy, and R. Brightwell. 2005. A Hardware Acceleration Unit for MPI Queue Processing. In 19th IEEE International Parallel and Distributed Processing Symposium.

    Cited By

    View all
    • (2024)Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective OperationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528935(1-12)Online publication date: May-2024
    • (2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
    • (2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2019
    1921 pages
    ISBN:9781450362290
    DOI:10.1145/3295500
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    • United States Department of Energy

    Conference

    SC '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)62
    • Downloads (Last 6 weeks)10

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective OperationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528935(1-12)Online publication date: May-2024
    • (2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
    • (2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
    • (2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 1-Mar-2024
    • (2023)HEAR: Homomorphically Encrypted AllreduceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607099(1-17)Online publication date: 12-Nov-2023
    • (2023)Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and TrainingProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593724(336-347)Online publication date: 21-Jun-2023
    • (2023)Comprex: In-Network Compression for Accelerating IoT Analytics at ScaleIEEE Micro10.1109/MM.2023.3343498(1-10)Online publication date: 2023
    • (2023)Exploring Challenges Associated with Employing SmartNICs as General-Purpose HPC Accelerators2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363618(1-7)Online publication date: 25-Sep-2023
    • (2023)In-Network Compression for Accelerating IoT Analytics at Scale2023 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI59126.2023.00017(15-24)Online publication date: Aug-2023
    • (2022)“Smarter” NICs for faster molecular dynamics: a case study2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00063(583-594)Online publication date: May-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media