Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Mega-KV: a case for GPUs to maximize the throughput of in-memory key-value stores

Published: 01 July 2015 Publication History

Abstract

In-memory key-value stores play a critical role in data processing to provide high throughput and low latency data accesses. In-memory key-value stores have several unique properties that include (1) data intensive operations demanding high memory bandwidth for fast data accesses, (2) high data parallelism and simple computing operations demanding many slim parallel computing units, and (3) a large working set. As data volume continues to increase, our experiments show that conventional and general-purpose multicore systems are increasingly mismatched to the special properties of key-value stores because they do not provide massive data parallelism and high memory bandwidth; the powerful but the limited number of computing cores do not satisfy the demand of the unique data processing task; and the cache hierarchy may not well benefit to the large working set.
In this paper, we make a strong case for GPUs to serve as special-purpose devices to greatly accelerate the operations of in-memory key-value stores. Specifically, we present the design and implementation of Mega-KV, a GPU-based in-memory key-value store system that achieves high performance and high throughput. Effectively utilizing the high memory bandwidth and latency hiding capability of GPUs, Mega-KV provides fast data accesses and significantly boosts overall performance. Running on a commodity PC installed with two CPUs and two GPUs, Mega-KV can process up to 160+ million key-value operations per second, which is 1.4-2.8 times as fast as the state-of-the-art key-value store system on a conventional CPU-based platform.

References

[1]
Intel dpdk. "http://dpdk.org/".
[2]
Memcached. "http://memcached.org/".
[3]
Redis. "http://redis.io/".
[4]
S. R. Agrawal, V. Pistol, J. Pang, J. Tran, D. Tarjan, and A. R. Lebeck. Rhythm: Harnessing data parallel hardware for server workloads. In ASPLOS, pages 19--34, 2014.
[5]
D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. In SIGGRAPH Asia, pages 154:1--154:9, 2009.
[6]
B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In SIGMETRICS, pages 53--64, 2012.
[7]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In SoCC, pages 143--154, 2010.
[8]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithm, Thrid Edition. The MIT Press, 2009.
[9]
J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. Zdonik. Anti-caching: A new approach to database management system architecture. In PVLDB, 2013.
[10]
Ú. Erlingsson, M. Manasse, and F. McSherry. A cool and practical alternative to traditional hash tables. In WDAS, pages 1--6, 2006.
[11]
B. Fan, D. G. Andersen, and M. Kaminsky. Memc3: Compact and concurrent memcache with dumber caching and smarter hashing. In NSDI, pages 371--384, 2013.
[12]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. Popescu, A. Ailamaki, and B. Falsafi. A case for specialized processors for scale-out workloads. In Micro, pages 31--42, 2014.
[13]
J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly generating billion-record synthetic databases. In SIGMOD, pages 243--252, 1994.
[14]
S. Han, K. Jang, K. Park, and S. Moon. Packetshader: A gpu-accelerated software router. In SIGCOMM, pages 195--206, 2010.
[15]
B. He and J. X. Yu. High-throughput transaction executions on graphics processors. In PVLDB, 2011.
[16]
M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. In PVLDB, pages 709--720, 2013.
[17]
T. Hetherington, T. Rogers, L. Hsu, M. O'Connor, and T. Aamodt. Characterizing and evaluating a key-value store application on heterogeneous cpu-gpu systems. In ISPASS, pages 88--98, 2012.
[18]
E. Y. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park. mtcp: A highly scalable user-level tcp stack for multicore systems. In NSDI, 2014.
[19]
R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat. Chronos: Predictable low latency for data center applications. In SoCC, pages 9:1--9:14, 2012.
[20]
T. Leng, R. Ali, J. Hsieh, V. Mashayekhi, and R. Rooholamini. An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution, 2002.
[21]
H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. Mica: A holistic approach to fast in-memory key-value storage. In NSDI, pages 429--444, 2014.
[22]
Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In EuroSys, pages 183--196, 2012.
[23]
Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. Cphash: A cache-partitioned hash table. In PPoPP, pages 319--320, 2012.
[24]
C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In USENIX ATC, pages 103--114, 2013.
[25]
R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at facebook. In NSDI, pages 385--398, 2013.
[26]
J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The case for ramclouds: Scalable high-performance storage entirely in dram. SIGOPS Oper. Syst. Rev., pages 92--105, 2010.
[27]
R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, pages 122--144, 2003.
[28]
H. Pirk, S. Manegold, and M. Kersten. Waste not... efficient co-processing of relational data. In ICDE, pages 508--519, 2014.
[29]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In PPoPP, pages 73--82, 2008.
[30]
S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactions in multicore in-memory databases. In SOSP, 2013.
[31]
K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang. Gdm: Device memory management for gpgpu computing. In SIGMETRICS, pages 533--545, 2014.
[32]
K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating pathology image data cross-comparison on cpu-gpu hybrid systems. In PVLDB, pages 1543--1554, 2012.
[33]
K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang. Concurrent analytical query processing with gpus. In PVLDB, pages 1011--1022, 2014.
[34]
H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In MICRO, pages 107--118, 2012.
[35]
Y. Yuan, R. Lee, and X. Zhang. The yin and yang of processing data warehousing queries on gpu devices. In PVLDB, pages 817--828, 2013.
[36]
H. Zhang, B. M. Tudor, G. Chen, and B. C. Ooi. Efficient in-memory data management: An analysis. In PVLDB, 2014.

Cited By

View all
  • (2025)H-Rocks: CPU-GPU accelerated Heterogeneous RocksDB on Persistent MemoryProceedings of the ACM on Management of Data10.1145/37096943:1(1-28)Online publication date: 11-Feb-2025
  • (2025)Dhcache: a dual-hash cache for optimizing the read performance in key-value storeThe Journal of Supercomputing10.1007/s11227-024-06828-w81:2Online publication date: 19-Jan-2025
  • (2024)Massively parallel multi-versioned transaction processingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691979(765-781)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Reviews

Steve Carson

When a central processing unit (CPU) is augmented by a graphics processing unit (GPU), compute-intensive functions can be offloaded to the GPU. While a CPU typically contains a few cores, current generation GPUs contain over 2,000 cores that can operate in parallel. GPUs have been widely applied to accelerate computations in engineering, genetics, and many other disciplines, but have not been applied extensively to database applications. This paper documents a successful open-source application called Mega-KV (http://kay21s.github.io/megakv/) that uses commodity personal computers (PCs) and GPUs to accelerate the performance of a very important application: in-memory key-value (IMKV) stores. To report their results, the authors use the open-source IMKV store MICA [1] for comparison with Mega-KV because MICA is the CPU-based IMKV store with the highest documented throughput. Mega-KV, running on two off-the-shelf CPUs and GPUs, was 1.4 to 2.8 times as fast as CPU-based MICA. The major challenges of using GPUs in this case are: (1) limited GPU memory and slow transfers between the CPU and GPU; (2) finding a design point that balances transfer size (larger transfers mean higher latency) with throughput (smaller transfers mean less utilization and less throughput). Rather than directly porting [that is, re-coding in compute unified device architecture (CUDA)] a known IMKV store like memcached to a GPU, the authors developed a custom optimized solution that carefully considers the capabilities of GPU architectures. They studied possible techniques separately in multiple testbeds and chose design points (such as transfer size) before they combined techniques into an overall approach. Their study identified two main issues with previous IMKV approaches: the poor match of index operations to GPU architectures and the unpredictability of operation scheduling that does not distinguish among different types of operations. The techniques they employed to overcome these issues included the use of cuckoo hashing, selecting the best number of threads in each of the multiple processing units they defined within the GPU, and careful scheduling of batches so that, as desired, GETs execute faster than SETs. The paper documents the authors' design techniques well and is of value to anyone wanting to create custom nongraphical CUDA software for a GPU. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 11
July 2015
264 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2015
Published in PVLDB Volume 8, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)94
  • Downloads (Last 6 weeks)11
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)H-Rocks: CPU-GPU accelerated Heterogeneous RocksDB on Persistent MemoryProceedings of the ACM on Management of Data10.1145/37096943:1(1-28)Online publication date: 11-Feb-2025
  • (2025)Dhcache: a dual-hash cache for optimizing the read performance in key-value storeThe Journal of Supercomputing10.1007/s11227-024-06828-w81:2Online publication date: 19-Jan-2025
  • (2024)Massively parallel multi-versioned transaction processingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691979(765-781)Online publication date: 10-Jul-2024
  • (2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 8-Nov-2024
  • (2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 30-May-2024
  • (2024)ZipCache: A DRAM/SSD Cache with Built-in Transparent CompressionProceedings of the International Symposium on Memory Systems10.1145/3695794.3695805(116-128)Online publication date: 11-Dec-2024
  • (2024)RayJoin: Fast and Precise Spatial JoinProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656610(124-136)Online publication date: 30-May-2024
  • (2024)DLHT: A Non-blocking Resizable Hashtable with Fast Deletes and Memory-awarenessProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658682(186-199)Online publication date: 3-Jun-2024
  • (2023)Catalyst: Optimizing Cache Management for Large In-memory Key-value SystemsProceedings of the VLDB Endowment10.14778/3625054.362506816:13(4339-4352)Online publication date: 1-Sep-2023
  • (2023)RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database IndexingProceedings of the VLDB Endowment10.14778/3625054.362506316:13(4268-4281)Online publication date: 1-Sep-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media