research-article

Public Access

SDC: a software defined cache for efficient data indexing

Authors:

Xingbo WuAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 82 - 93

https://doi.org/10.1145/3330345.3330353

Published: 26 June 2019 Publication History

Abstract

CPU cache has been used to bridge the processor-memory performance gap to enable high-performance computing. As the cache is of limited capacity, for its maximum efficacy it should (1) avoid caching data that are less likely to be accessed and (2) identify and cache data that would otherwise cost a program multiple memory accesses to reach. Unfortunately, existing cache architectures are inadequate on these two efforts. First, to cost-effectively exploit the spatial locality, they adopt a relatively large and fixed-size cache line as the caching unit. Thus, much of the space in a cache line can be wasted when the data locality is weak. Second, for easy use, the cache is designed to be transparent to programs, which hinders programs from fully exploiting its performance potentials.

To address these problems, we propose a high-performance Software Defined Cache (SDC) architecture providing a simple and generic key-value abstraction that allows (1) caching data at a granularity smaller than a cache line, and (2) enabling programs to explicitly insert, retrieve, and invalidate data in the cache with new instructions. By providing a program with the ability of explicitly using the cache as a lookaside key-value buffer, SDC enables a much more efficient cache without disruptively changing the existing cache organization and without substantially increasing hardware cost. We have prototyped SDC on the gem5 simulator and evaluated it with various data index structures and workloads. Experiment results show that SDC can improve the cache performance for the workloads by up to 5.3× over current cache design.

References

[1]

Nadav Amit. 2017. Optimizing the TLB Shootdown Algorithm with Page Access Tracking. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '17). USENIX Association, Berkeley, CA, USA, 27--39. http://dl.acm.org/citation.cfm?id=3154690.3154694

Digital Library

[2]

David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP'09). ACM, New York, NY, USA, 1--14.

Digital Library

[3]

Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload Analysis of a Large-scale Key-value Store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '12). ACM, New York, NY, USA, 53--64.

Digital Library

[4]

Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, Mahesh Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Hardware/Software Codesign, 2002. CODES 2002. Proceedings of the Tenth International Symposium on. IEEE, IEEE, Estes Park, CO, USA, 73--78.

Digital Library

[5]

Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable Software-defined Caches. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 213--224. http://dl.acm.org/citation.cfm?id=2523721.2523752

[6]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7.

Digital Library

[7]

W. A. Burkhard. 1976. Hashing and Trie Algorithms for Partial Match Retrieval. ACM Trans. Database Syst. 1, 2 (June 1976), 175--187.

Digital Library

[8]

Eric S. Chung, John D. Davis, and Jaewon Lee. 2013. LINQits: Big Data on Little Clients. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 261--272.

Digital Library

[9]

Daniel Citron and Dror G Feitelson. 2000. Hardware memoization of mathematical and trigonometric functions. Technical Report. The Hebrew University of Jerusalem.

[10]

Jamison Collins, Suleyman Sair, Brad Calder, and Dean M. Tullsen. 2002. Pointer Cache Assisted Prefetching. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO 35). IEEE Computer Society Press, Los Alamitos, CA, USA, 62--73. http://dl.acm.org/citation.cfm?id=774861.774869

Digital Library

[11]

Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA '01). ACM, New York, NY, USA, 14--25.

Digital Library

[12]

HyperTransport Technology Consortium et al. 2008. HyperTransport I/O link specification. Revision 1 (2008), 111--118.

[13]

Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A Stateless, Content-directed Data Prefetching Mechanism. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X). ACM, New York, NY, USA, 279--290.

Digital Library

[14]

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC '10). ACM, New York, NY, USA, 143--154.

Digital Library

[15]

Yonghua Ding and Zhiyuan Li. 2004. A Compiler Scheme for Reusing Intermediate Computation Results. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 279-. http://dl.acm.org/citation.cfm?id=977395.977679

Digital Library

[16]

E. Ebrahimi, O. Mutlu, and Y. N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, Raleigh, NC, USA, 7--17.

[17]

Brad Fitzpatrick. 2004. Distributed Caching with Memcached. Linux J. 2004, 124 (Aug. 2004), 5-. http://dl.acm.org/citation.cfm?id=1012889.1012894

Digital Library

[18]

gem5. 2014. Gem5-Classic Memory System. http://www.gem5.org/Classic_Memory_System.

[19]

Brian Gold, Anastassia Ailamaki, Larry Huston, and Babak Falsafi. 2005. Accelerating Database Operators Using a Network Processor. In Proceedings of the 1st International Workshop on Data Management on New Hardware (DaMoN '05). ACM, New York, NY, USA, Article 1, 6 pages.

Digital Library

[20]

Timothy Hayes, Oscar Palomar, Osman Unsal, Adrian Cristal, and Mateo Valero. 2012. Vector Extensions for Decision Support DBMS Acceleration. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 166--176.

Digital Library

[21]

Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu. 2016. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation. In 2016 IEEE 34th International Conference on Computer Design (ICCD). IEEE, Phoenix, USA, 25--32.

[22]

INTEL. 2013. Intel Haswell processors. http://www.7-cpu.com/cpu/Haswell.html.

[23]

Intel. 2016. Intel Xeon Processor E5-2683 v4. https://ark.intel.com/products/91766/Intel-Xeon-Processor-E5-2683-v4-40M-Cache-2-10-GHz-.

[24]

Doug Joseph and Dirk Grunwald. 1997. Prefetching Using Markov Predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA '97). ACM, New York, NY, USA, 252--263.

Digital Library

[25]

M. Karlsson, F. Dahlgren, and P. Stenstrom. 2000. A prefetching technique for irregular accesses to linked data structures. In Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550). IEEE, Touluse, France, 206--217.

[26]

Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. 2013. Meet the Walkers: Accelerating Index Traversals for In-memory Databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 468--479.

Digital Library

[27]

Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 707--719.

Digital Library

[28]

Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A Memory-efficient, High-performance Key-value Store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). ACM, New York, NY, USA, 1--13.

Digital Library

[29]

Mikko H Lipasti, William J Schmidt, Steven R Kunkel, and Robert R Roediger. 1995. SPAID: Software prefetching in pointer-and call-intensive environments. In Microarchitecture, 1995., Proceedings of the 28th Annual International Symposium on. IEEE, IEEE, Ann Arbor, MI, USA, 231--236.

Digital Library

[30]

Chi-Keung Luk. 2001. Tolerating Memory Latency Through Software-controlled Pre-execution in Simultaneous Multithreading Processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA '01). ACM, New York, NY, USA, 40--51.

Digital Library

[31]

Chi-Keung Luk and Todd C. Mowry. 1996. Compiler-based Prefetching for Recursive Data Structures. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII). ACM, New York, NY, USA, 222--233.

Digital Library

[32]

Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache Craftiness for Fast Multicore Key-value Storage. In Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys '12). ACM, New York, NY, USA, 183--196.

Digital Library

[33]

Rich Martin. 1996. A Vectorized Hash-Join. Technical Report. University of California at Berkeley, California, USA.

[34]

Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2016. Whirlpool: Improving Dynamic Cache Management with Static Data Classification. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 113--127.

Digital Library

[35]

Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling Memcache at Facebook. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX, Lombard, IL, 385--398. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala

Digital Library

[36]

Brendan O'Connor. 2011. How much text versus metadata is in a tweet. http://goo.gl/EBFIFs.

[37]

François Panneton and Pierre L'Ecuyer. 2005. On the Xorshift Random Number Generators. ACM Trans. Model. Comput. Simul. 15, 4 (Oct. 2005), 346--361.

Digital Library

[38]

Jun Rao and Kenneth A. Ross. 1999. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 78--89. http://dl.acm.org/citation.cfm?id=645925.671362

Digital Library

[39]

Jun Rao and Kenneth A. Ross. 2000. Making B+- Trees Cache Conscious in Main Memory. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00). ACM, New York, NY, USA, 475--486.

[40]

Freescale Semiconductor. 2005. PowerPC e500 Core Family Reference Manual. https://goo.gl/Jjs38u

[41]

Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA '07). IEEE Computer Society, Washington, DC, USA, 63--74.

[42]

Nitish Kumar Srivastava and Akshay Dilip Navalakha. 2018. Pointer-Chase Prefetcher for Linked Data Structures. CoRR abs/1801.08088 (2018), 12. arXiv:1801.08088 http://arxiv.org/abs/1801.08088

[43]

Y. Sun, Y. Hua, D. Feng, L. Yang, P. Zuo, and S. Cao. 2015. MinCounter: An efficient cuckoo hashing scheme for cloud storage systems. In 2015 31st Symposium on Mass Storage Systems and Technologies (MSST). IEEE, Santa Clara, California, USA, 1--7.

[44]

Yuanyuan Sun, Yu Hua, Song Jiang, Qiuyu Li, Shunde Cao, and Pengfei Zuo. 2017. SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 553--565. https://www.usenix.org/conference/atc17/technical-sessions/presentation/sun

[45]

Arjun Suresh, Erven Rohou, and André Seznec. 2017. Compile-time Function Memoization. In Proceedings of the 26th International Conference on Compiler Construction (CC 2017). ACM, New York, NY, USA, 45--54.

Digital Library

[46]

Symas. 2016. LMDB: Lightning Memory-Mapped Database Manager. http://www.lmdb.tech/doc/index.html.

[47]

Thomas Wang. 2007. Integer Hash Function. http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm.

[48]

Po-An Tsai, Nathan Beckmann, and Daniel Sanchez. 2017. Jenga: Software-defined cache hierarchies. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE/ACM, Toronto, Ontario, Canada, 652--665.

Digital Library

[49]

Tomoaki Tsumura, Ikuma Suzuki, Yasuki Ikeuchi, Hiroshi Matsuo, Hiroshi Nakashima, and Yasuhiko Nakashima. 2007. Design and Evaluation of an Auto-memoization Processor. In Proceedings of the 25th Conference on Proceedings of the 25th IASTED International Multi-Conference: Parallel and Distributed Computing and Networks (PDCN'07). ACTA Press, Anaheim, CA, USA, 245--250. http://dl.acm.org/citation.cfm?id=1295581.1295621

Digital Library

[50]

Stephen Tu. 2013. Silo source code on Github. https://github.com/stephentu/silo.

[51]

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy Transactions in Multicore In-memory Databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP'13). ACM, New York, NY, USA, 18--32.

Digital Library

[52]

Xingbo Wu, Fan Ni, and Song Jiang. 2017. Search Lookaside Buffer: Efficient Caching for Index Data Structures. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, New York, NY, USA, 27--39.

Digital Library

[53]

Xingbo Wu, Fan Ni, and Song Jiang. 2019. Wormhole: A Fast Ordered Index for In-memory Data Management. In Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys '19). ACM, New York, NY, USA, Article 18, 16 pages.

Digital Library

[54]

Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Proc. VLDB Endow. 6, 10 (Aug. 2013), 817--828.

Digital Library

[55]

Guowei Zhang and Daniel Sanchez. 2018. Leveraging Hardware Caches for Memoization. IEEE Comput. Archit. Lett. 17, 1 (Jan. 2018), 59--63.

Digital Library

[56]

Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang. 2015. Mega-KV: A Case for GPUs to Maximize the Throughput of In-memory Key-value Stores. Proc. VLDB Endow. 8, 11 (July 2015), 1226--1237.

Digital Library

[57]

Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. 2009. Towards Practical Page Coloring-based Multicore Cache Management. In Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys '09). ACM, New York, NY, USA, 89--102.

Digital Library

[58]

Dimitrios Ziakas, Allen Baum, Robert A Maddox, and Robert J Safranek. 2010. Intel® quickpath interconnect architectural features supporting scalable system architectures. In High Performance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium on. IEEE, IEEE, Mountain View, California, US, 1--6.

Digital Library

Cited By

Yao YWang XZhou DLi LWu JZhu LWang ZLuo Y(2024)EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for RedisEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_12(166-179)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-69577-3_12
Ye CXu YShen XLiao XJin HSolihin Y(2021)Hardware-Based Address-Centric Acceleration of Key-Value Store2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00067(736-748)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00067
Wang KLiu JChen F(2020)Put an elephant into a fridgeProceedings of the VLDB Endowment10.14778/3397230.339724713:9(1540-1554)Online publication date: 26-Jun-2020
https://dl.acm.org/doi/10.14778/3397230.3397247

Index Terms

SDC: a software defined cache for efficient data indexing
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Processors and memory architectures

Recommendations

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
SELECTIVE VICTIM CACHING: A METHOD TO IMPROVE THE PERFORMANCE OF DIRECT-MAPPED CACHES
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
512
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)18

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yao YWang XZhou DLi LWu JZhu LWang ZLuo Y(2024)EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for RedisEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_12(166-179)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-69577-3_12
Ye CXu YShen XLiao XJin HSolihin Y(2021)Hardware-Based Address-Centric Acceleration of Key-Value Store2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00067(736-748)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00067
Wang KLiu JChen F(2020)Put an elephant into a fridgeProceedings of the VLDB Endowment10.14778/3397230.339724713:9(1540-1554)Online publication date: 26-Jun-2020
https://dl.acm.org/doi/10.14778/3397230.3397247

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents