research-article

Manycore network interfaces for in-memory rack-scale computing

Authors:

Alexandros Daglis,

Stanko Novaković,

Edouard Bugnion,

Boris GrotAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 567 - 579

https://doi.org/10.1145/2749469.2750415

Published: 13 June 2015 Publication History

Abstract

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and high-bandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.

References

[1]

D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, "Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs," in ACM SIGARCH Computer Architecture News, vol. 37, no. 3, 2009, pp. 451--461.

Digital Library

[2]

A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. A. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung, "The MIT Alewife Machine: Architecture and Performance," in Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), 1995.

Digital Library

[3]

Anandtech, "Haswell: Up to 128MB On-Package Cache." {Online}. Available: http://www.anandtech.com/show/6277/haswell-up-to-128mb-onpackage-cache-ulv-gpu-performance-estimates.

[4]

K. Asanović, "A Hardware Building Block for 2020 Warehouse-Scale Computers," USENIX FAST Keynote, 2014.

[5]

B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, "Workload Analysis of a Large-Scale Key-Value Store," in ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 1, 2012, pp. 53--64.

Digital Library

[6]

L. A. Barroso, "Three Things to Save the Datacenter," ISSCC Keynote, 2014. {Online}. Available: http://www.theregister.co.uk/Print/2014/02/11/google_research_three_things_that_must_be_done_to_save_the_data_center_of_the_future/.

[7]

N. L. Binkert, A. G. Saidi, and S. K. Reinhardt, "Integrated Network Interfaces for High-Bandwidth TCP/IP," in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006.

Digital Library

[8]

Boston Limited, "Boston Limited Unveil Their Revolutionary Boston Viridis," 2011. {Online}. Available: http://www.boston.co.uk/press/2011/11/boston-limited-unveil-their-revolutionary-boston-viridis.aspx.

[9]

Calxeda Inc., "ECX-1000 Technical Specifications," 2012. {Online}. Available: http://www.calxeda.com/ecx-1000-techspecs/.

[10]

Cavium Networks, "Cavium Announces Availability of ThunderX™: Industry's First 48 Core Family of ARMv8 Workload Optimized Processors for Next Generation Data Center & Cloud Infrastructure," 2014. {Online}. Available: http://www.cavium.com/newsevents-Cavium-Announces-Availability-of-ThunderX.html.

[11]

J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta, "Hive: Fault Containment for Shared-Memory Multiprocessors," in Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995.

Digital Library

[12]

A. Dhodapkar, G. Lauterbach, S. Li, D. Mallick, J. Bauman, S. Kanthadai, T. Kuzuhara, G. S. M. Xu, and C. Zhang, "SeaMicro SM10000-64 Server: Building Datacenter Servers Using Cell Phone Chips," in Proceedings of the 23rd IEEE HotChips Symposium, 2011.

[13]

A. Dragojević, D. Narayanan, O. Hodson, and M. Castro, "FaRM: Fast Remote Memory," in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014.

Digital Library

[14]

EZchip Semiconductor Ltd., "EZchip Introduces TILE-Mx100World's Highest Core-Count ARM Processor Optimized for High-Performance Networking Applications," Press Release, 2015. {Online}. Available: http://www.tilera.com/News/PressRelease/?ezchip=97.

[15]

B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood, "Application-Specific Protocols for User-Level Shared Memory," in Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (SC), 1994.

Digital Library

[16]

B. Falsafi and D. A. Wood, "Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA," in Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997.

Digital Library

[17]

J. Gantz and D. Reinsel, "The Digital Universe in 2020." IDC, 2012. {Online}. Available: http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf.

[18]

E. Hagersten and M. Koster, "Wildfire: A scalable path for smps," in Proceedings of the Fifth International Symposium on High-Performance Computer Architecture (HPCA), 1999.

Digital Library

[19]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," in 36th International Symposium on Computer Architecture (ISCA), 2009.

Digital Library

[20]

J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor," in ACM SIGPLAN Notices, vol. 29, no. 11, 1994, pp. 38--50.

Digital Library

[21]

Hewlett - Packard Development Company, "HP ProLiant m400 Server Cartridge," 2014. {Online}. Available: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04384048.

[22]

Hewlett-Packard Development Company, "HP Moonshot System Family Guide," 2014. {Online}. Available: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=4AA4-6076ENW.

[23]

R. Huggahalli, R. Iyer, and S. Tetrick, "Direct Cache Access for High Bandwidth Network I/O," in Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), 2005.

Digital Library

[24]

Intel, "Moving Data with Silicon and Light," 2013. {Online}. Available: http://www.intel.com/content/www/us/en/research/intel-labs-silicon-photonics-research.html.

[25]

J. Jeddeloh and B. Keeth, "Hybrid Memory Cube New DRAM Architecture Increases Density and Performance," in 2012 International Symposium on VLSI Technology (VLSIT), 2012.

[26]

D. Kanter, "X-Gene 2 Aims Above Microservers," Microprocessor Report, vol. 28(9), pp. 20--24, 2014.

[27]

R. Kessler and J. Schwarzmeier, "Cray T3D: A New Dimension for Cray Research," in Compcon Spring '93, Digest of Papers, 1993.

[28]

J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy, "The Stanford FLASH Multiprocessor," in Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994.

Digital Library

[29]

J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," in Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997, pp. 241--251.

Digital Library

[30]

D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam, "The Stanford Dash Multiprocessor," IEEE Computer, vol. 25, no. 3, pp. 63--79, 1992.

Digital Library

[31]

G. Liao, X. Zhu, and L. Bnuyan, "A New Server I/O Architecture for High Speed Networks," in Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), 2011.

Digital Library

[32]

K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, "Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), 2013.

Digital Library

[33]

P. Lotfi-Kamran, B. Grot, and B. Falsafi, "NOC-Out: Microarchitecting a Scale-Out Processor," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012.

Digital Library

[34]

P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, Y. O. Koçberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Özer, and B. Falsafi, "Scale-Out Processors," in Proceedings of the 39th International Symposium on Computer Architecture (ISCA), 2012.

Digital Library

[35]

G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, "Pregel: A System for Large-Scale Graph Processing," in Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2010.

Digital Library

[36]

Mellanox Corp., "ConnectX-3 Pro Product Brief," 2012. {Online}. Available: http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-3_Pro_Card_EN.pdf.

[37]

S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood, "Coherent Network Interfaces for Fine-Grain Communication," in Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), 1996.

Digital Library

[38]

S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "Scale-Out NUMA," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014.

Digital Library

[39]

D. N. Paolo Costa, Hitesh Ballani, "Rethinking the Network Stack for Rack-Scale Computers," in Hot Topics in Cloud Computing (HotCloud). USENIX, 2014.

Digital Library

[40]

S. K. Reinhardt, J. R. Larus, and D. A. Wood, "Tempest and Typhoon: User-Level Shared Memory," in Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994.

Digital Library

[41]

S. L. Scott and G. M. Thorson, "The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus," in Hot Interconnects, 1996.

[42]

D. Seo, A. Ali, W.-T. Lim, N. Rafique, and M. Thottethodi, "Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks," in Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), 2005.

Digital Library

[43]

W. Shi, E. Collins, and V. Karamcheti, "Modeling Object Characteristics of Dynamic Web Content," Journal of Parallel and Distributed Computing, vol. 63, no. 10, pp. 963--980, 2003.

Digital Library

[44]

J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, "Reunion: Complexity-Effective Multicore Redundancy," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2006.

Digital Library

[45]

B. Towles, J. Grossman, B. Greskamp, and D. E. Shaw, "Unifying On-Chip and Inter-Node Switching within the Anton 2 Network," in Proceedings of the 41st International Symposium on Computer Architecture (ISCA), 2014.

Digital Library

[46]

T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, "SimFlex: Statistical Sampling of Computer System Simulation," IEEE Micro, vol. 26, pp. 18--31, 2006.

Digital Library

Cited By

Sutherland MFalsafi BDaglis AAamodt TJerger NSwift M(2023)Cooperative Concurrency Control for Write-Intensive Key-Value WorkloadsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3567955.3567957(30-46)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3567955.3567957
Yuan YHuang JSun YWang TNelson JPorts DWang YWang RTai CKim N(2023)Rambda: RDMA-driven Acceleration Framework for Memory-intensive µs-scale Datacenter Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071127(499-515)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071127
Pourhabibi ASutherland MDaglis AFalsafi B(2021)Cerebros: Evading the RPC Tax in DatacentersMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480055(407-420)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480055
Show More Cited By

Index Terms

Manycore network interfaces for in-memory rack-scale computing
1. Hardware

Recommendations

Manycore network interfaces for in-memory rack-scale computing
ISCA'15

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
A network congestion-aware memory subsystem for manycore
Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures

The network-on-chip (NoC) plays a crucial role in memory performance due to the fact that it can handle the majority of traffics from/to the DRAM memory controllers. However, there has been little work on the interplay between the NoC and memory ...
Performance evaluation of wormhole routed network processor-memory interconnects
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

Network line cards are experiencing ever increasing line rates, random data bursts, and limited space. Hence, they are more vulnerable than other processormemory environments, to create data transfer bottlenecks and hot-spots. Solutions to the memory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
534
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sutherland MFalsafi BDaglis AAamodt TJerger NSwift M(2023)Cooperative Concurrency Control for Write-Intensive Key-Value WorkloadsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3567955.3567957(30-46)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3567955.3567957
Yuan YHuang JSun YWang TNelson JPorts DWang YWang RTai CKim N(2023)Rambda: RDMA-driven Acceleration Framework for Memory-intensive µs-scale Datacenter Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071127(499-515)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071127
Pourhabibi ASutherland MDaglis AFalsafi B(2021)Cerebros: Evading the RPC Tax in DatacentersMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480055(407-420)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480055
Seemakhupt KLiu SSenevirathne YShahbaz MKhan S(2021)PMNet: In-Network Data Persistence2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00068(804-817)Online publication date: Jun-2021
https://doi.org/10.1109/ISCA52012.2021.00068
Falsafi B(2021)Post‐Moore Datacenter Server ArchitectureMulti‐Processor System‐on‐Chip 210.1002/9781119818410.ch6(123-134)Online publication date: 28-Apr-2021
https://doi.org/10.1002/9781119818410.ch6
Sutherland MGupta SFalsafi BMarathe VPnevmatikatos DDaglis AMartínez JDuato JEeckhout L(2020)The NeBuLa RPC-optimized architectureProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00027(199-212)Online publication date: 30-May-2020
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00027
Novakovic SDaglis AUstiugov DBugnion EFalsafi BGrot B(2019)Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory PoolingACM Transactions on Computer Systems10.1145/330998636:2(1-37)Online publication date: 9-Apr-2019
https://dl.acm.org/doi/10.1145/3309986
Daglis ASutherland MFalsafi BBahar IHerlihy MWitchel ELebeck A(2019)RPCValetProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304070(35-48)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304070
Koh KKim KJeon SHuh J(2019)Disaggregated Cloud Memory with Elastic Block ManagementIEEE Transactions on Computers10.1109/TC.2018.285156568:1(39-52)Online publication date: 1-Jan-2019
https://doi.org/10.1109/TC.2018.2851565
Markussen JKristiansen LStensland HSeifert FGriwodz CHalvorsen P(2018)Flexible Device Sharing in PCIe Clusters using Device LendingWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229759(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229759
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents