research-article

INCA: in-network compute assistance

Authors:

Whit Schonbein,

Matthew G. F. Dosanjh, and

Dorian ArnoldAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

Article No.: 54, Pages 1 - 13

https://doi.org/10.1145/3295500.3356153

Published: 17 November 2019 Publication History

Abstract

Current proposals for in-network data processing operate on data as it streams through a network switch or endpoint. Since compute resources must be available when data arrives, these approaches provide deadline-based models of execution. This paper introduces a deadline-free general compute model for network endpoints called INCA: In-Network Compute Assistance. INCA builds upon contemporary NIC offload capabilities to provide on-NIC, deadline-free, general-purpose compute capacities that can be utilized when the network is inactive. We demonstrate INCA is Turing complete, and provide a detailed design for extending existing hardware to support this model. We evaluate runtimes for a selection of kernels, including several optimizations, and show INCA can provide up to a 11% speedup for applications with minimal code modifications and between 25% to 37% when applications are optimized for INCA.

References

[1]

Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. 1995. LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation. ACM Press, 95--105.

Digital Library

[2]

Brian W. Barrett, Ron Brightwell, Ryan E. Grant, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Torsten Hoefler, Arthur B. Maccabe, and Trammell Hudson. 2018. The Portals 4.2 Network Programming Interface. Technical Report SAND2018-12790.

[3]

Brian W Barrett, Ron Brightwell, K Scott Hemmert, Kyle B Wheeler, and Keith D Underwood. 2011. Using triggered operations to offload rendezvous messages. In European MPI Users' Group Meeting. Springer, 120--129.

Digital Library

[4]

Nanette J Boden, Danny Cohen, Robert E Felderman, Alan E. Kulawik, Charles L Seitz, Jakov N Seizovic, and Wen-King Su. 1995. Myrinet: A gigabit-per-second local area network. IEEE Micro 15, 1 (1995), 29--36.

Digital Library

[5]

Ron Brightwell, Kevin T Pedretti, Keith D Underwood, and Trammell Hudson. 2006. SeaStar interconnect: Balanced bandwidth for scalable performance. IEEE Micro 26, 3 (2006), 41--57.

Digital Library

[6]

Broadcom. 2019. Stingray SmartNIC. Retrieved 2019-10-01 from https://www.broadcom.com/products/ethernet-connectivity/smartnic/ps225

[7]

Darius Buntinas, Dhabaleswar K. Panda, and Ponnuswamy Sadayappan. 2001. Fast NIC-based barrier over Myrinet/GM. In Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001. 52--59.

[8]

Christopher L Chappell and James Mitchell. 2012. Packet processing in switched fabric networks. Patent No. 8285907, Filed December 10th., 2004, Issued October 9th., 2012.

[9]

David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. 1993. LogP: Towards a Realistic Model of Parallel Computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP '93). ACM, New York, NY, USA, 1--12.

Digital Library

[10]

Dennis Dalessandro, Ananth Devulapalli, and Pete Wyckoff. 2005. Design and implementation of the iWARP protocol in software. In Proceedings of the 17th IASTED International Conference on Parallel and Distributed Computing and Systems. Phoenix, Arizona, 471--476.

[11]

Dennis Dalessandro, Pete Wyckoff, and Gary Montry. 2006. Initial performance evaluation of the neteffect 10 gigabit iwarp adapter. In 2006 IEEE International Conference on Cluster Computing. IEEE, 1--7.

[12]

S. Derradji, T. Palfer-Sollier, J. P. Panziera, A. Poudes, and F. W. Atos. 2015. The BXI Interconnect Architecture. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 18--25.

Digital Library

[13]

Hans Devries. 2019. Chip Architect. Retrieved 2019-04-09 from http://www.chip-architect.com/

[14]

Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. 2018. Azure accelerated networking: SmartNICs in the public cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 51--66.

Digital Library

[15]

Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In Proceedings of the First Workshop on Optimization of Communication in HPC. IEEE Press, 1--10.

[16]

Richard L Graham, Steve Poole, Pavel Shamis, Gil Bloch, Noam Bloch, Hillel Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, and Gilad Shainer. 2010. Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 1--8.

[17]

Ryan E Grant, Mohammad J Rashti, Ahmad Afsahi, and Pavan Balaji. 2011. RDMA capable iWARP over datagrams. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 628--639.

Digital Library

[18]

K. Scott Hemmert, Brian Barrett, and Keith D. Underwood. 2010. Using Triggered Operations to Offload Collective Communication Operations. In Recent Advances in the Message Passing Interface (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 249--256.

[19]

Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.

[20]

Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, and Ron Brightwell. 2017. sPIN: High-performance Streaming Processing In the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, 59:1--59:16.

Digital Library

[21]

Antoine Kaufmann, SImon Peter, Naveen Kr. Sharma, Thomas Anderson, and Arvind Krishnamurthy. 2016. High Performance Packet Processing with FlexNIC. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 67--81.

Digital Library

[22]

D. Brian Larkins, John Snyder, and James Dinan. 2018. Efficient Runtime Support for a Partitioned Global Logical Address Space. In ICPP 2018: 47th International Conference on Parallel Processing. ACM, Eugune, Oregon.

Digital Library

[23]

Mellanox. 2018. Mellanox BlueField SmartNIC. Retrieved 2019-10-01 from https://www.mellanox.com/products/bluefield-overview

[24]

Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. 2002. The Quadrics network: High-performance clustering technology. IEEE Micro 22, 1 (2002), 46--57.

Digital Library

[25]

Steve Plimpton. 1995. Fast parallel algorithms for short-range molecular dynamics. Journal of computational physics 117, 1 (1995), 1--19.

Digital Library

[26]

ECP Project. 2019. ECP Proxy Applications. Retrieved 2019-10-01 from https://proxyapps.exascaleproject.org/

[27]

Mohammad J Rashti, Ryan E Grant, Ahmad Afsahi, and Pavan Balaji. 2010. iWARP redefined: Scalable connectionless communication over high-speed Ethernet. In High Performance Computing (HiPC), 2010 International Conference on. IEEE, 1--10.

[28]

Timo Schneider, Torsten Hoefler, Ryan E Grant, Brian W Barrett, and Ron Brightwell. 2013. Protocols for fully offloaded collective operations on accelerated network adapters. In Parallel Processing (ICPP), 2013 42nd International Conference on. IEEE, 593--602.

Digital Library

[29]

J. C. Shepherdson and H. E. Sturgis. 1963. Computability of Recursive Functions. J. ACM 10, 2 (April 1963), 217--255.

Digital Library

[30]

Krishna Parasuram Srinivasan. 2018. Creating a PCI express interconnect in the gem5 simulator. Master's thesis.

[31]

K. D. Underwood, J. Coffman, R. Larsen, K. S. Hemmert, B.W. Barrett, R. Brightwell, and M. Levenhagen. 2011. Enabling Flexible Collective Communication Offload with Triggered Operations. In 2011 IEEE 19th Annual Symposium on High Performance Interconnects. 35--42.

Digital Library

[32]

K. D. Underwood, K. S. Hemmert, A. Rodrigues, R. Murphy, and R. Brightwell. 2005. A Hardware Acceleration Unit for MPI Queue Processing. In 19th IEEE International Parallel and Distributed Processing Symposium.

Digital Library

Cited By

Graham RBosilca GQin YSettlemyer BShainer GStunkel CVallee GWilliams BCisneros-Stoianowski GOhlmann SRampp M(2024)Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective OperationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528935(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528935
Haghi PTan CGuo AWu CLiu DLi ASkjellum AGeng THerbordt M(2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656616
Schonbein WMatsika TGrant R(2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00022
Show More Cited By

Index Terms

INCA: in-network compute assistance
1. Networks
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Software infrastructure
        Middleware
        Message oriented middleware

Recommendations

INCA: a next-generation architecture for simulation
IVC '96: Proceedings of the 1996 IEEE International Verilog HDL Conference (IVC '96)

The paper presents INCA, the Interleaved Native-Compiled code Architecture for simulation. INCA is a flexible strategy to create optimized simulations involving multiple design styles, languages, and scheduling paradigms. INCA emphasizes optimized ...
Read More
INCA: An Architecture for In-Network Computing
ENCP '19: Proceedings of the 1st ACM CoNEXT Workshop on Emerging in-Network Computing Paradigms

We present some results on integrating computing with networking so as to optimize the placement of workloads within a distributed network. We describe INCA, an In-Network Computing Architecture that allows clients to request functions that are then ...
Read More
INCA: INterruptible CNN accelerator for multi-tasking in embedded robots
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference

In recent years, Convolutional Neural Network (CNN) has been widely used in robotics, which has dramatically improved the perception and decision-making ability of robots. A series of CNN accelerators have been designed to implement energy-efficient CNN ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

United States Department of Energy

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
704
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)10

Other Metrics

View Author Metrics

Citations

Cited By

Graham RBosilca GQin YSettlemyer BShainer GStunkel CVallee GWilliams BCisneros-Stoianowski GOhlmann SRampp M(2024)Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective OperationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528935(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528935
Haghi PTan CGuo AWu CLiu DLi ASkjellum AGeng THerbordt M(2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656616
Schonbein WMatsika TGrant R(2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00022
De Sensi DCosta Molero EDi Girolamo SVanbever LHoefler T(2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.future.2023.10.010
Chrapek MKhalilov MHoefler TMohror KArnold DBadia R(2023)HEAR: Homomorphically Encrypted AllreduceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607099(1-17)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607099
Guo AHao YWu CHaghi PPan ZSi MTao DLi AHerbordt MGeng TGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and TrainingProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593724(336-347)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593724
Oliveira RGavrilovska A(2023)Comprex: In-Network Compression for Accelerating IoT Analytics at ScaleIEEE Micro10.1109/MM.2023.3343498(1-10)Online publication date: 2023
https://doi.org/10.1109/MM.2023.3343498
Williams BChen YPoole WPoole S(2023)Exploring Challenges Associated with Employing SmartNICs as General-Purpose HPC Accelerators2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363618(1-7)Online publication date: 25-Sep-2023
https://doi.org/10.1109/HPEC58863.2023.10363618
Oliveira RGavrilovska A(2023)In-Network Compression for Accelerating IoT Analytics at Scale2023 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI59126.2023.00017(15-24)Online publication date: Aug-2023
https://doi.org/10.1109/HOTI59126.2023.00017
Karamati SHughes CHemmert KGrant RSchonbein WLevy SConte TYoung JVuduc R(2022)“Smarter” NICs for faster molecular dynamics: a case study2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00063(583-594)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00063
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents