research-article

Mega-KV: a case for GPUs to maximize the throughput of in-memory key-value stores

Authors:

Lei Guo,

Xiaodong ZhangAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 11

Pages 1226 - 1237

https://doi.org/10.14778/2809974.2809984

Published: 01 July 2015 Publication History

Get Access

Abstract

In-memory key-value stores play a critical role in data processing to provide high throughput and low latency data accesses. In-memory key-value stores have several unique properties that include (1) data intensive operations demanding high memory bandwidth for fast data accesses, (2) high data parallelism and simple computing operations demanding many slim parallel computing units, and (3) a large working set. As data volume continues to increase, our experiments show that conventional and general-purpose multicore systems are increasingly mismatched to the special properties of key-value stores because they do not provide massive data parallelism and high memory bandwidth; the powerful but the limited number of computing cores do not satisfy the demand of the unique data processing task; and the cache hierarchy may not well benefit to the large working set.

In this paper, we make a strong case for GPUs to serve as special-purpose devices to greatly accelerate the operations of in-memory key-value stores. Specifically, we present the design and implementation of Mega-KV, a GPU-based in-memory key-value store system that achieves high performance and high throughput. Effectively utilizing the high memory bandwidth and latency hiding capability of GPUs, Mega-KV provides fast data accesses and significantly boosts overall performance. Running on a commodity PC installed with two CPUs and two GPUs, Mega-KV can process up to 160+ million key-value operations per second, which is 1.4-2.8 times as fast as the state-of-the-art key-value store system on a conventional CPU-based platform.

References

[1]

Intel dpdk. "http://dpdk.org/".

Google Scholar

[2]

Memcached. "http://memcached.org/".

Google Scholar

[3]

Redis. "http://redis.io/".

Google Scholar

[4]

S. R. Agrawal, V. Pistol, J. Pang, J. Tran, D. Tarjan, and A. R. Lebeck. Rhythm: Harnessing data parallel hardware for server workloads. In ASPLOS, pages 19--34, 2014.

Digital Library

Google Scholar

[5]

D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. In SIGGRAPH Asia, pages 154:1--154:9, 2009.

Digital Library

Google Scholar

[6]

B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In SIGMETRICS, pages 53--64, 2012.

Digital Library

Google Scholar

[7]

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In SoCC, pages 143--154, 2010.

Digital Library

Google Scholar

[8]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithm, Thrid Edition. The MIT Press, 2009.

Digital Library

Google Scholar

[9]

J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. Zdonik. Anti-caching: A new approach to database management system architecture. In PVLDB, 2013.

Digital Library

Google Scholar

[10]

Ú. Erlingsson, M. Manasse, and F. McSherry. A cool and practical alternative to traditional hash tables. In WDAS, pages 1--6, 2006.

Google Scholar

[11]

B. Fan, D. G. Andersen, and M. Kaminsky. Memc3: Compact and concurrent memcache with dumber caching and smarter hashing. In NSDI, pages 371--384, 2013.

Digital Library

Google Scholar

[12]

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. Popescu, A. Ailamaki, and B. Falsafi. A case for specialized processors for scale-out workloads. In Micro, pages 31--42, 2014.

Crossref

Google Scholar

[13]

J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly generating billion-record synthetic databases. In SIGMOD, pages 243--252, 1994.

Digital Library

Google Scholar

[14]

S. Han, K. Jang, K. Park, and S. Moon. Packetshader: A gpu-accelerated software router. In SIGCOMM, pages 195--206, 2010.

Digital Library

Google Scholar

[15]

B. He and J. X. Yu. High-throughput transaction executions on graphics processors. In PVLDB, 2011.

Digital Library

Google Scholar

[16]

M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. In PVLDB, pages 709--720, 2013.

Digital Library

Google Scholar

[17]

T. Hetherington, T. Rogers, L. Hsu, M. O'Connor, and T. Aamodt. Characterizing and evaluating a key-value store application on heterogeneous cpu-gpu systems. In ISPASS, pages 88--98, 2012.

Digital Library

Google Scholar

[18]

E. Y. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park. mtcp: A highly scalable user-level tcp stack for multicore systems. In NSDI, 2014.

Digital Library

Google Scholar

[19]

R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat. Chronos: Predictable low latency for data center applications. In SoCC, pages 9:1--9:14, 2012.

Digital Library

Google Scholar

[20]

T. Leng, R. Ali, J. Hsieh, V. Mashayekhi, and R. Rooholamini. An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution, 2002.

Google Scholar

[21]

H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. Mica: A holistic approach to fast in-memory key-value storage. In NSDI, pages 429--444, 2014.

Digital Library

Google Scholar

[22]

Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In EuroSys, pages 183--196, 2012.

Digital Library

Google Scholar

[23]

Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. Cphash: A cache-partitioned hash table. In PPoPP, pages 319--320, 2012.

Digital Library

Google Scholar

[24]

C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In USENIX ATC, pages 103--114, 2013.

Digital Library

Google Scholar

[25]

R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at facebook. In NSDI, pages 385--398, 2013.

Digital Library

Google Scholar

[26]

J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The case for ramclouds: Scalable high-performance storage entirely in dram. SIGOPS Oper. Syst. Rev., pages 92--105, 2010.

Digital Library

Google Scholar

[27]

R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, pages 122--144, 2003.

Digital Library

Google Scholar

[28]

H. Pirk, S. Manegold, and M. Kersten. Waste not... efficient co-processing of relational data. In ICDE, pages 508--519, 2014.

Crossref

Google Scholar

[29]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In PPoPP, pages 73--82, 2008.

Digital Library

Google Scholar

[30]

S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactions in multicore in-memory databases. In SOSP, 2013.

Digital Library

Google Scholar

[31]

K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang. Gdm: Device memory management for gpgpu computing. In SIGMETRICS, pages 533--545, 2014.

Digital Library

Google Scholar

[32]

K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating pathology image data cross-comparison on cpu-gpu hybrid systems. In PVLDB, pages 1543--1554, 2012.

Digital Library

Google Scholar

[33]

K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang. Concurrent analytical query processing with gpus. In PVLDB, pages 1011--1022, 2014.

Digital Library

Google Scholar

[34]

H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In MICRO, pages 107--118, 2012.

Digital Library

Google Scholar

[35]

Y. Yuan, R. Lee, and X. Zhang. The yin and yang of processing data warehousing queries on gpu devices. In PVLDB, pages 817--828, 2013.

Digital Library

Google Scholar

[36]

H. Zhang, B. M. Tudor, G. Chen, and B. C. Ooi. Efficient in-memory data management: An analysis. In PVLDB, 2014.

Digital Library

Google Scholar

Cited By

View all

Pandey SBasu A(2025)H-Rocks: CPU-GPU accelerated Heterogeneous RocksDB on Persistent MemoryProceedings of the ACM on Management of Data10.1145/37096943:1(1-28)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709694
Lu JLv MLi PYuan ZXie P(2025)Dhcache: a dual-hash cache for optimizing the read performance in key-value storeThe Journal of Supercomputing10.1007/s11227-024-06828-w81:2Online publication date: 19-Jan-2025
https://doi.org/10.1007/s11227-024-06828-w
Qian SGoel AGavrilovska ATerry D(2024)Massively parallel multi-versioned transaction processingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691979(765-781)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691979
Show More Cited By

Index Terms

Mega-KV: a case for GPUs to maximize the throughput of in-memory key-value stores
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Physical data models
    2. Database management system engines

Recommendations

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC
SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

Performance of in-memory key-value store (KVS) continues to be of great importance as modern KVS goes beyond the traditional object-caching workload and becomes a key infrastructure to support distributed main-memory computation in data centers. Recent ...
RDMP-KV: Designing remote direct memory persistence based key-value stores with PMEM
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious ...
FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores
The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L₀ flush and L₀-L₁ compaction. New ...

Reviews

Reviewer: Steve Carson

When a central processing unit (CPU) is augmented by a graphics processing unit (GPU), compute-intensive functions can be offloaded to the GPU. While a CPU typically contains a few cores, current generation GPUs contain over 2,000 cores that can operate in parallel. GPUs have been widely applied to accelerate computations in engineering, genetics, and many other disciplines, but have not been applied extensively to database applications. This paper documents a successful open-source application called Mega-KV (http://kay21s.github.io/megakv/) that uses commodity personal computers (PCs) and GPUs to accelerate the performance of a very important application: in-memory key-value (IMKV) stores. To report their results, the authors use the open-source IMKV store MICA [1] for comparison with Mega-KV because MICA is the CPU-based IMKV store with the highest documented throughput. Mega-KV, running on two off-the-shelf CPUs and GPUs, was 1.4 to 2.8 times as fast as CPU-based MICA. The major challenges of using GPUs in this case are: (1) limited GPU memory and slow transfers between the CPU and GPU; (2) finding a design point that balances transfer size (larger transfers mean higher latency) with throughput (smaller transfers mean less utilization and less throughput). Rather than directly porting [that is, re-coding in compute unified device architecture (CUDA)] a known IMKV store like memcached to a GPU, the authors developed a custom optimized solution that carefully considers the capabilities of GPU architectures. They studied possible techniques separately in multiple testbeds and chose design points (such as transfer size) before they combined techniques into an overall approach. Their study identified two main issues with previous IMKV approaches: the poor match of index operations to GPU architectures and the unpredictability of operation scheduling that does not distinguish among different types of operations. The techniques they employed to overcome these issues included the use of cuckoo hashing, selecting the best number of threads in each of the multiple processing units they defined within the GPU, and careful scheduling of batches so that, as desired, GETs execute faster than SETs. The paper documents the authors' design techniques well and is of value to anyone wanting to create custom nongraphical CUDA software for a GPU. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 11

July 2015

264 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2015

Published in PVLDB Volume 8, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

96
Total Citations
View Citations
803
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)11

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Pandey SBasu A(2025)H-Rocks: CPU-GPU accelerated Heterogeneous RocksDB on Persistent MemoryProceedings of the ACM on Management of Data10.1145/37096943:1(1-28)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709694
Lu JLv MLi PYuan ZXie P(2025)Dhcache: a dual-hash cache for optimizing the read performance in key-value storeThe Journal of Supercomputing10.1007/s11227-024-06828-w81:2Online publication date: 19-Jan-2025
https://doi.org/10.1007/s11227-024-06828-w
Qian SGoel AGavrilovska ATerry D(2024)Massively parallel multi-versioned transaction processingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691979(765-781)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691979
Wang FLee RTeng DZhang XSaltz J(2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685912
Fan DLee RZhang X(2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654634
Xie RMa LZhong AChen FZhang T(2024)ZipCache: A DRAM/SSD Cache with Built-in Transparent CompressionProceedings of the International Symposium on Memory Systems10.1145/3695794.3695805(116-128)Online publication date: 11-Dec-2024
https://doi.org/10.1145/3695794.3695805
Geng LLee RZhang X(2024)RayJoin: Fast and Precise Spatial JoinProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656610(124-136)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656610
Katsarakis AGavrielatos VNtarmos NMencagli GDazzi PLowenthal DBadia R(2024)DLHT: A Non-blocking Resizable Hashtable with Fast Deletes and Memory-awarenessProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658682(186-199)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658682
Wang KChen F(2023)Catalyst: Optimizing Cache Management for Large In-memory Key-value SystemsProceedings of the VLDB Endowment10.14778/3625054.362506816:13(4339-4352)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.14778/3625054.3625068
Henneberg JSchuhknecht F(2023)RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database IndexingProceedings of the VLDB Endowment10.14778/3625054.362506316:13(4268-4281)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.14778/3625054.3625063
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

RDMP-KV: Designing remote direct memory persistence based key-value stores with PMEM

FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations