Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3617232.3624868acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

CC-NIC: a Cache-Coherent Interface to the NIC

Published: 17 April 2024 Publication History

Abstract

Emerging interconnects make peripherals, such as the network interface controller (NIC), accessible through the processor's cache hierarchy, allowing these devices to participate in the CPU cache coherence protocol. This is a fundamental change from the separate I/O data paths and read-write transaction primitives of today's PCIe NICs. Our experiments show that the I/O data path characteristics cause NICs to prioritize CPU efficiency at the expense of inflated latency, an issue that can be mitigated by the emerging low-latency coherent interconnects. But, the coherence abstraction is not suited to current host-NIC access patterns. Applying existing signaling mechanisms and data structure layouts in a cache-coherent setting results in extraneous communication and cache retention, limiting performance. Redesigning the interface is necessary to minimize overheads and benefit from the new interactions coherence enables. This work contributes CC-NIC, a host-NIC interface design for coherent interconnects. We model CC-NIC using Intel's Ice Lake and Sapphire Rapids UPI interconnects, demonstrating the potential of optimizing for coherence. Our results show a maximum packet rate of 1.5Gpps and 980Gbps packet throughput. CC-NIC has 77% lower minimum latency, and 88% lower at 80% load, than today's PCIe NICs. We also demonstrate application-level core savings. Finally, we show that CC-NIC's benefits hold across a range of interconnect performance characteristics.

References

[1]
B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy. User-Level Interprocess Communication for Shared Memory Multiprocessors. ACM Trans. Comput. Syst., 9(2):175--198, may 1991.
[2]
CCIX Consortium Inc. CCIX Base Specification 1.0. https://www.ccixconsortium.com/library/specification/.
[3]
Compute Express Link Consortium Inc. CXL 3.0 Specification. https://www.computeexpresslink.org/download-the-specification.
[4]
DPDK Project. Data Plane Development Kit. https://www.dpdk.org/.
[5]
A. Dragojević, D. Narayanan, M. Castro, and O. Hodson. FaRM: Fast Remote Memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 401--414, Seattle, WA, Apr. 2014. USENIX Association.
[6]
A. Farshin, T. Barbette, A. Roozbeh, G. Q. Maguire Jr., and D. Kostić. PacketMill: Toward per-Core 100-Gbps Networking. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '21, page 1--17, New York, NY, USA, 2021. Association for Computing Machinery.
[7]
M. Flajslik and M. Rosenblum. Network Interface Design for Low Latency Request-Response Protocols. In 2013 USENIX Annual Technical Conference (USENIX ATC 13), pages 333--346, San Jose, CA, June 2013. USENIX Association.
[8]
S. Gallenmüller, P. Emmerich, F. Wohlfart, D. Raumer, and G. Carle. Comparison of Frameworks for High-Performance Packet IO. In Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS '15, page 29--38, USA, 2015. IEEE Computer Society.
[9]
Gen-Z Consortium. Gen-Z Specifications. https://genzconsortium.org/specifications/.
[10]
D. Gouk, S. Lee, M. Kwon, and M. Jung. Direct Access, High-Performance Memory Disaggregation with DirectCXL. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 287--294, Carlsbad, CA, July 2022. USENIX Association.
[11]
S. Han, K. Jang, A. Panda, S. Palkar, D. Han, and S. Ratnasamy. SoftNIC: A Software NIC to Augment Hardware. Technical Report UCB/EECS-2015-155, EECS Department, University of California, Berkeley, May 2015.
[12]
R. Huggahalli, R. Iyer, and S. Tetrick. Direct cache access for high bandwidth network i/o. In 32nd International Symposium on Computer Architecture (ISCA'05), pages 50--59, 2005.
[13]
S. Ibanez, A. Mallery, S. Arslan, T. Jepsen, M. Shahbaz, C. Kim, and N. McKeown. The nanoPU: A Nanosecond Network Stack for Data-centers. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 239--256. USENIX Association, July 2021.
[14]
Intel Corporation. An Introduction to the Intel QuickPath Interconnect. https://www.intel.ca/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf.
[15]
Intel Corporation. Intel Data Direct I/O Technology. https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html.
[16]
Intel Corporation. Intel Data Streaming Accelerator Architecture Specification. https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf.
[17]
Intel Corporation. Intel Memory Latency Checker v3.9a. https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html.
[18]
Intel DPDK Validation Team. Intel Ethernet Performance Report with DPDK 21.11. http://fast.dpdk.org/doc/perf/DPDK_21_11_Intel_NIC_performance_report.pdf.
[19]
E. Jeong, S. Wood, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 489--502, Seattle, WA, Apr. 2014. USENIX Association.
[20]
A. Kalia, M. Kaminsky, and D. Andersen. Datacenter RPCs can be General and Fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 1--16, Boston, MA, Feb. 2019. USENIX Association.
[21]
A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA Efficiently for Key-Value Services. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, page 295--306, New York, NY, USA, 2014. Association for Computing Machinery.
[22]
A. Kaufmann, T. Stamler, S. Peter, N. K. Sharma, A. Krishnamurthy, and T. Anderson. TAS: TCP Acceleration as an OS Service. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys '19, New York, NY, USA, 2019. Association for Computing Machinery.
[23]
N. Lazarev, S. Xiang, N. Adit, Z. Zhang, and C. Delimitrou. Dagger: Efficient and fast rpcs in cloud microservices with near-memory reconfigurable nics. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '21, page 36--51, New York, NY, USA, 2021. Association for Computing Machinery.
[24]
H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 574--587, New York, NY, USA, 2023. Association for Computing Machinery.
[25]
M. Liu, T. Cui, H. Schuh, A. Krishnamurthy, S. Peter, and K. Gupta. Offloading Distributed Applications onto SmartNICs Using IPipe. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM '19, page 318--333, New York, NY, USA, 2019. Association for Computing Machinery.
[26]
M. Marty, M. de Kruijf, J. Adriaens, C. Alfeld, S. Bauer, C. Contavalli, M. Dalton, N. Dukkipati, W. C. Evans, S. Gribble, N. Kidd, R. Kononov, G. Kumar, C. Mauer, E. Musick, L. Olson, E. Rubow, M. Ryan, K. Springborn, P. Turner, V. Valancius, X. Wang, and A. Vahdat. Snap: A Microkernel Approach to Host Networking. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, page 399--413, New York, NY, USA, 2019. Association for Computing Machinery.
[27]
Marvell. Marvell LiquidIO III. https://www.marvell.com/content/dam/marvell/en/public-collateral/embedded-processors/marvell-liquidio-III-solutions-brief.pdf.
[28]
S. Min, M. Alian, W.-M. Hwu, and N. S. Kim. Semi-coherent dma: An alternative i/o coherency management for embedded systems. IEEE Computer Architecture Letters, 17(2):221--224, 2018.
[29]
C. Mitchell, Y. Geng, and J. Li. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In 2013 USENIX Annual Technical Conference (USENIX ATC 13), pages 103--114, San Jose, CA, June 2013. USENIX Association.
[30]
C. Mitchell, K. Montgomery, L. Nelson, S. Sen, and J. Li. Balancing CPU and Network in the Cell Distributed B-Tree Store. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 451--464, Denver, CO, June 2016. USENIX Association.
[31]
B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout. Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '18, page 221--235, New York, NY, USA, 2018. Association for Computing Machinery.
[32]
R. Neugebauer, G. Antichi, J. F. Zazo, Y. Audzevich, S. López-Buedo, and A. W. Moore. Understanding PCIe Performance for End Host Networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '18, page 327--341, New York, NY, USA, 2018. Association for Computing Machinery.
[33]
S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot. Scale-out NUMA. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, page 3--18, New York, NY, USA, 2014. Association for Computing Machinery.
[34]
NVIDIA Corporation. NVIDIA Mellanox NICs Performance Report with DPDK 21.11. http://fast.dpdk.org/doc/perf/DPDK_21_11_Mellanox_NIC_performance_report.pdf.
[35]
NVIDIA Corporation. NVIDIA NVSwitch Technical Overview. https://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf.
[36]
OpenCAPI Consortium. OpenCAPI Specifications. https://opencapi.org/technical/specifications/.
[37]
PCI-SIG. PCI Express Specifications. https://pcisig.com/specifications/.
[38]
S. Pirelli and G. Candea. A Simpler and Faster NIC Driver Model for Network Functions. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 225--241. USENIX Association, Nov. 2020.
[39]
PK Gupta. Intel Xeon+FPGA Platform for the Data Center. https://reconfigurablecomputing4themasses.net/files/2.2%20PK.pdf.
[40]
Prakash Chauhan and Mahesh Wagh. CXL Memory Challenges. https://hc34.hotchips.org/assets/program/tutorials/CXL/Hot%20Chips%202022%20CXL%20MemoryChallenges.pdf.
[41]
Y. Ren, G. Liu, V. Nitu, W. Shao, R. Kennedy, G. Parmer, T. Wood, and A. Tchana. Fine-Grained Isolation for Scalable, Dynamic, Multi-tenant Edge Clouds. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 927--942. USENIX Association, July 2020.
[42]
L. Rizzo. netmap: A Novel Framework for Fast Packet I/O. In 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 101--112, Boston, MA, June 2012. USENIX Association.
[43]
L. Rizzo, P. Valente, G. Lettieri, and V. Maffione. PSPAT: Software packet scheduling at hardware speed. Computer Communications, 120, 02 2018.
[44]
H. N. Schuh, W. Liang, M. Liu, J. Nelson, and A. Krishnamurthy. Xenic: SmartNIC-Accelerated Distributed Transactions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP '21, page 740--755, New York, NY, USA, 2021. Association for Computing Machinery.
[45]
L. Shalev, J. Satran, E. Borovik, and M. Ben-Yehuda. IsoStack---Highly Efficient Network Processing on Dedicated Cores. In 2010 USENIX Annual Technical Conference (USENIX ATC 10). USENIX Association, June 2010.
[46]
A. Singhvi, A. Akella, M. Anderson, R. Cauble, H. Deshmukh, D. Gibson, M. M. K. Martin, A. Strominger, T. F. Wenisch, and A. Vahdat. CliqueMap: Productionizing an RMA-Based Distributed Caching System. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, SIGCOMM '21, page 93--105, New York, NY, USA, 2021. Association for Computing Machinery.
[47]
Y. Sun, Y. Yuan, Z. Yu, R. Kuper, I. Jeong, R. Wang, and N. S. Kim. Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices, 2023.
[48]
The Linux Kernel Archives. Linux Base Driver for the Intel Ethernet Controller 700 Series. https://www.kernel.org/doc/html/latest/networking/device_drivers/ethernet/intel/i40e.html.
[49]
Universal Chiplet Interconnect Express. UCIe 1.0 Specification. https://www.uciexpress.org/specification.
[50]
Universal Chiplet Interconnect Express. UCIe 1.0 Specification. https://www.uciexpress.org/specification.
[51]
Virtio. Libvirt Virtualization API. https://wiki.libvirt.org/Virtio.html.
[52]
VMWare Incorporated. Performance Evaluation of VMXNET3 Virtual Network Device. https://www.vmware.com/pdf/vsp_4_vmxnet3_perf.pdf.
[53]
X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen. Fast In-Memory Transaction Processing Using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, page 87--104, New York, NY, USA, 2015. Association for Computing Machinery.
[54]
Y. Yuan, M. Alian, Y. Wang, R. Wang, I. Kurakin, C. Tai, and N. S. Kim. Don't forget the i/o when allocating your llc. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 112--125, 2021.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1
April 2024
494 pages
ISBN:9798400703720
DOI:10.1145/3617232
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2024

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '24

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 1,851
    Total Downloads
  • Downloads (Last 12 months)1,851
  • Downloads (Last 6 weeks)283
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media