Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2749469.2750416acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open access

Architecting to achieve a billion requests per second throughput on a single key-value store server platform

Published: 13 June 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented datacenter infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of datacenters. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused upon improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts too showed orders of magnitude improvement over stock memcached.
    We aim at architecting high performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems, but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) on a single commodity server. Our implementation delivers 9.2X the performance (RPS) and 2.8X the system energy efficiency (RPS/watt) of the best-published FPGA-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.

    References

    [1]
    Amazon Elasticache, http://aws.amazon.com/elasticache/.
    [2]
    Intel® Data Direct I/O Technology, http://www.intel.com/content/www/us/en/io/direct-data-i-o.html.
    [3]
    Intel® Ethernet Flow Director, http://www.intel.com/content/www/us/en/ethernet-controllers/ethernet-flow-director-video.html.
    [4]
    How Linkedin uses memcached, http://www.oracle.com/technetwork/server-storage/ts-4696-159286.pdf.
    [5]
    Intel® I/O Acceleration Technology, http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html.
    [6]
    Mellanox® 100Gbps Ethernet NIC, http://www.mellanox.com/related-docs/prod_silicon/PB_ConnectX-4_VPI_Card.pdf.
    [7]
    Memcached: A distributed memory object caching system, http://memcached.org/.
    [8]
    Memcached SPOF Mystery, https://blog.twitter.com/2010/memcached-spof-mystery.
    [9]
    Netflix EVCache, http://techblog.netflix.com/2012/01/ephemeral-volatile-caching-in-cloud.html.
    [10]
    Mellanox® OpenFabrics Enterprise Distribution for Linux (MLNX_OFED), http://www.mellanox.com/page/products_dyn?product_family=26.
    [11]
    J. Ahn, S. Li, S. O, and N. P. Jouppi, "McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling," in ISPASS, 2013.
    [12]
    B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, "Workload analysis of a large-scale key-value store," in SIGMETRICS, 2012.
    [13]
    A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion, "IX: A protected dataplane operating system for high throughput and low latency," in OSDI, 2014.
    [14]
    M. Blott, K. Karras, L. Liu, K. Vissers, J. Bär, and Z. István, "Achieving 10Gbps line-rate key-value stores with FPGAs," in HotCloud, 2013.
    [15]
    S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala, "An FPGA Memcached appliance," in FPGA, 2013.
    [16]
    B. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, "Benchmarking cloud serving systems with YCSB," in SOCC, 2010.
    [17]
    M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy, "RouteBricks: Exploiting parallelism to scale software routers," in SOSP, 2009.
    [18]
    A. Dragojević, D. Narayanan, M. Castro, and O. Hodson, "FaRM: Fast remote memory," in NSDI, 2014.
    [19]
    B. Fan, D. G. Andersen, and M. Kaminsky, "MemC3: Compact and concurrent memcache with dumber caching and smarter hashing," in NSDI, 2013.
    [20]
    A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski, L. Ceze, and T. Mudge, "Integrated 3D-stacked server designs for increasing physical density of key-value stores," in ASPLOS, 2014.
    [21]
    S. Han, K. Jang, K. Park, and S. Moon, "PacketShader: a GPU-accelerated software router," in SIGCOMM, 2010.
    [22]
    M. Herlihy, N. Shavit, and M. Tzafrir, "Hopscotch hashing," in Distributed Computing. Springer, 2008, pp. 350--364.
    [23]
    R. Huggahalli, R. Iyer, and S. Tetrick, "Direct cache access for high bandwidth network I/O," in ISCA, 2005.
    [24]
    Intel, "Intel Data Plane Development Kit (Intel DPDK)," http://www.intel.com/go/dpdk, 2014.
    [25]
    R. Jevtic, H. Le, M. Blagojevic, S. Bailey, K. Asanovic, E. Alon, and B. Nikolic, "Per-core DVFS with switched-capacitor converters for energy efficiency in manycore processors," IEEE TVLSI, vol. 23, no. 4, pp. 723--730, 2015.
    [26]
    A. Kalia, M. Kaminsky, and D. G. Andersen, "Using RDMA efficiently for key-value services," in SIGCOMM, 2014.
    [27]
    R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat, "Chronos: Predictable low latency for data center applications," in SOCC, 2012.
    [28]
    M. Lavasani, H. Angepat, and D. Chiou, "An FPGA-based in-line accelerator for Memcached," in HotChips, 2013.
    [29]
    S. Li, J. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO, 2009.
    [30]
    S. Li, K. Lim, P. Faraboschi, J. Chang, P. Ranganathan, and N. P. Jouppi, "System-level integrated server architectures for scale-out datacenters," in MICRO, 2011.
    [31]
    H. Lim, D. Han, D. G. Andersen, and M. Kaminsky, "MICA: A holistic approach to fast in-memory key-value storage," in NSDI, 2014.
    [32]
    K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, "Thin Servers with Smart Pipes: Designing SoC accelerators for Memcached," in ISCA, 2013.
    [33]
    P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, "Scale-out processors," in ISCA, 2012.
    [34]
    Y. Mao, E. Kohler, and R. T. Morris, "Cache craftiness for fast multicore key-value storage," in EuroSys, 2012.
    [35]
    C. Mitchell, Y. Geng, and J. Li, "Using one-sided RDMA reads to build a fast, CPU-efficient key-value store," in USENIX ATC, 2013.
    [36]
    R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani, "Scaling Memcache at Facebook," in NSDI, 2013.
    [37]
    S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "Scale-out NUMA," in ASPLOS, 2014.
    [38]
    D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, "Fast crash recovery in RAMCloud," in SOSP, 2011.
    [39]
    R. Pagh and F. Rodler, "Cuckoo hashing," Journal of Algorithms, vol. 51, no. 2, pp. 122--144, May 2004.
    [40]
    D. A. Patterson, "Latency lags bandwith," Commun. ACM, vol. 47, no. 10, pp. 71--75, 2004.
    [41]
    A. Pesterev, J. Strauss, N. Zeldovich, and R. T. Morris, "Improving network connection locality on multicore systems," in EuroSys, 2012.
    [42]
    S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe, "Arrakis: The operating system is the control plane," in OSDI, 2014.
    [43]
    L. Rizzo, "netmap: A novel framework for fast packet I/O," in USENIX ATC, 2012.
    [44]
    D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," in ISCA, 1996.

    Cited By

    View all
    • (2024)DistMind: Efficient Resource Disaggregation for Deep Learning WorkloadsIEEE/ACM Transactions on Networking10.1109/TNET.2024.335501032:3(2422-2437)Online publication date: Jun-2024
    • (2023)Reconfigurable Virtual Memory for FPGA-Driven I/OProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582048(556-571)Online publication date: 25-Mar-2023
    • (2023)Cooperative Concurrency Control for Write-Intensive Key-Value WorkloadsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3567955.3567957(30-46)Online publication date: 25-Mar-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
    June 2015
    768 pages
    ISBN:9781450334020
    DOI:10.1145/2749469
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2015

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    • Korea government

    Conference

    ISCA '15
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)341
    • Downloads (Last 6 weeks)44

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DistMind: Efficient Resource Disaggregation for Deep Learning WorkloadsIEEE/ACM Transactions on Networking10.1109/TNET.2024.335501032:3(2422-2437)Online publication date: Jun-2024
    • (2023)Reconfigurable Virtual Memory for FPGA-Driven I/OProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582048(556-571)Online publication date: 25-Mar-2023
    • (2023)Cooperative Concurrency Control for Write-Intensive Key-Value WorkloadsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3567955.3567957(30-46)Online publication date: 25-Mar-2023
    • (2023)Memory-Efficient Hashed Page Tables2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071061(1221-1235)Online publication date: Feb-2023
    • (2022)RACE: One-sided RDMA-conscious Extendible HashingACM Transactions on Storage10.1145/351189518:2(1-29)Online publication date: 28-Apr-2022
    • (2022)The benefits of general-purpose on-NIC memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507711(1130-1147)Online publication date: 28-Feb-2022
    • (2022)Hydra: A Decentralized File System for Persistent Memory and RDMA NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318036933:12(4192-4206)Online publication date: 1-Dec-2022
    • (2022)iBalancer: Load-Aware in-Server Flow Scheduling for Sub-Millisecond Tail LatencyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312002133:8(1761-1774)Online publication date: 1-Aug-2022
    • (2022)memwalkd : Accelerating Key-value stores using Page Table Walkers2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC56025.2022.00021(69-74)Online publication date: Dec-2022
    • (2022)A Multi-hashing Index for hybrid DRAM-NVM memory systemsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102547128:COnline publication date: 1-Jul-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media