research-article

METAL: Caching Multi-level Indexes in Domain-Specific Architectures

Authors:

Anagha Molakalmur Anil Kumar,

Aditya Prasanna,

Jonathan Balkind,

Arrvindh ShriramanAuthors Info & Claims

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Pages 715 - 729

https://doi.org/10.1145/3620665.3640402

Published: 27 April 2024 Publication History

Abstract

State-of-the-art domain specific architectures (DSAs) work with sparse data, and need hardware support for index data-structures [31, 43, 57, 61]. Indexes are more space-efficient for sparse-data, and reduce DRAM bandwidth, if data reuse can be managed. However, indexes exhibit dynamic accesses, chase pointers, and need to walk-and-search. This inflates the working set and thrashes the cache. We observe that the cache organization itself is responsible for this behavior.

We develop METAL, a portable caching idiom that enables DSAs to employ index data-structures. METAL decouples reuse of the index metadata from data reuse, and optimizes it independently. We propose two ideas: i) IX-Cache: A cache that leverages range tags to short-circuits index walks, and reduces the working set. IX-cache helps capture the trade-off between wider index nodes that maximize reach vs those that are closer to leaf and minimize walk latency. ii) Reuse Patterns: An interface to explicitly manage the cache. Patterns orchestrate cache insertions and bypass as we dynamically traverse different index regions. METAL improves performance vs. streaming DSAs by 7.8×, address-caches by 4.1×, and state-of-the-art DSA-cache [50] by 2.4×. We reduce DRAM energy by 1.6× vs. prior state-of-the-art.

References

[1]

Michael Adler, Kermin E. Fleming, Angshuman Parashar, Michael Pellauer, and Joel Emer. 2011. Leap Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic. In Proc. of the 19th FPGA (FPGA '11). New York, NY, USA.

Digital Library

[2]

Md Nur Ahmed. 2024. https://dev.to/mdnurahmed/simple-scalable-search-autocomplete-systems-1j18.

[3]

Tutu Ajayi, Vidya A Chhabria, Mateus Fogaça, Soheil Hashemi, Abdelrahman Hosny, Andrew B Kahng, Minsoo Kim, Jeongsup Lee, Uday Mallappa, Marina Neseem, et al. 2019. Toward an open-source digital flow: First learnings from the openroad project. In Proc. of the 56th Annual Design Automation Conference 2019. 1--4.

Digital Library

[4]

Daehyeon Baek, Soojin Hwang, Taekyung Heo, Daehoon Kim, and Jaehyuk Huh. 2021. InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-Aware Inner Product Processing. In 30th Int'l. Conf. on Parallel Architectures and Compilation.

Digital Library

[5]

Thomas W Barr, Alan L Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don't Walk (the Page Table). In Proc. of the 37th ISCA.

Digital Library

[6]

Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating Two-Dimensional Page Walks for Virtualized Systems. SIGOPS Oper. Syst. Rev. 42, 2 (mar 2008), 26--35.

Digital Library

[7]

N. V. Vijaya Krishna Boppana and Saiyu Ren. 2016. A Low-Power and Area-Efficient 64-Bit Digital Comparator. J. Circuits Syst. Comput. 25, 12 (2016).

[8]

J. Carter, W. Hsieh, L. Stoller, M. Swanson, Lixin Zhang, E. Brunvand, A. Davis, Chen-Chi Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. 1999. Impulse: building a smarter memory controller. In Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[9]

Tao Chen and G Edward Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proc. of the 49th MICRO. 1--12.

[10]

Stephen Chou and Saman Amarasinghe. 2022. Compilation of dynamic sparse tensor algebra. Proc. of the ACM on Programming Languages 6, OOPSLA2, 1408--1437.

Digital Library

[11]

Chang Chua and R.B.N. Kumar. 2017. An Improved Design and Simulation of Low-Power and Area Efficient Parallel Binary Comparator. Microelectron. J. (aug 2017), 84--88.

[12]

Eric S Chung, James C Hoe, and Ken Mai. 2011. CoRAM: an in-fabric memory architecture for FPGA-based computing. In Proc. of the 19th FPGA.

Digital Library

[13]

Jason Clemons, Chih-Chi Cheng, Iuri Frosio, Daniel R Johnson, and Stephen W Keckler. 2016. A Patch Memory System for Image Processing and Computer Vision. In Proc. of the 49th MICRO. 1--13.

[14]

Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards general purpose acceleration by exploiting common data-dependence forms. In Proc. of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 924--939.

Digital Library

[15]

Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms. In Proc. of the 52nd MICRO. 924--939.

Digital Library

[16]

William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-Specific Hardware Accelerators. Commun. ACM 63, 7 (June 2020), 48--57.

Digital Library

[17]

E Ebrahimi, O Mutlu, and Y N Patt. 2009. Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In Proc. of the 15th HPCA.

[18]

Edward A Fox, Qi Fan Chen, Amjad M Daoud, and Lenwood S Heath. 1991. Order-preserving minimal perfect hash functions and information retrieval. ACM Transactions on Information Systems (TOIS) 9, 3 (1991), 281--308.

Digital Library

[19]

Fabio Frustaci, Stefania Perri, Marco Lanuzza, and Pasquale Corsonello. 2012. Energy-efficient single-clock-cycle binary comparator. Int. J. Circuit Theory Appl. 40, 3 (2012), 237--246.

Digital Library

[20]

Daichi Fujiki, Niladrish Chatterjee, Donghyuk Lee, and Mike O'Connor. 2019. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA, Article 55.

[21]

A González-Beltrán, Peter Milligan, and Paul Sage. 2008. Range queries over skip tree graphs. Computer Communications 31, 2 (2008), 358--374.

Digital Library

[22]

Goetz Graefe et al. 2011. Modern B-tree techniques. Foundations and Trends® in Databases 3, 4 (2011), 203--402.

[23]

Bingsheng He, Naga K. Govindaraju, Qiong Luo, and Burton Smith. 2007. Efficient gather and scatter operations on graphics processors. In SC '07: Proc. of the 2007 ACM/IEEE Conference on Supercomputing. 1--12.

Digital Library

[24]

John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM 62, 2 (Jan. 2019), 48--60.

Digital Library

[25]

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In Proc. of the 42nd ISCA. 118--130.

Digital Library

[26]

Ibrahim Kamel and Christos Faloutsos. 1992. Parallel R-trees. ACM SIGMOD Record 21, 2 (1992), 195--204.

Digital Library

[27]

Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In Proc. of the 52nd MICRO.

Digital Library

[28]

Ujval J. Kapasi, Scott Rixner, William J. Dally, Brucek Khailany, Jung Ho Ahn, Peter Mattson, and John D. Owens. 2003. Programmable Stream Processors. Computer 36, 8 (aug 2003), 54--62.

Digital Library

[29]

M Karlsson, F Dahlgren, and P Stenstrom. 2000. A Prefetching Technique for Irregular Accesses to Linked Data Structures. In Proc. of the 6th HPCA.

[30]

Michael S Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access path selection in main-memory optimized data systems: Should I scan or should I probe?. In Proc. of the 2017 ACM International Conference on Management of Data. 715--730.

Digital Library

[31]

Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin T Lim, and Parthasarathy Ranganathan. 2013. Meet the walkers: accelerating index traversals for in-memory databases. In Proc. of the 46th MICRO. 468--479.

Digital Library

[32]

Rakesh Komuravelli, Matthew D Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V Adve, and Vikram S Adve. 2015. Stash: have your scratchpad and cache it too. In Proc. of the 42nd ISCA. 707--719.

Digital Library

[33]

Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman, and Vijayalakshmi Srinivasan. 2015. DASX: Hardware accelerator for software data structures. Proc. of the International Conference on Supercomputing 2015-June (2015), 361--371.

Digital Library

[34]

Jochen Liedtke and Kevin Elphinstone. 1996. Guarded Page Tables on Mips R4600 or an Exercise in Architecture-Dependent Micro Optimization. SIGOPS Oper. Syst. Rev. 30, 1 (jan 1996).

Digital Library

[35]

Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. of the 29th ACM on International Conference on Supercomputing. 339--350.

Digital Library

[36]

Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krishnamoorthy, and P. Sadayappan. 2019. An Efficient Mixed-Mode Representation of Sparse Tensors. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]

Tayo Oguntebi and Kunle Olukotun. 2016. Graphops: A dataflow library for graph analytics acceleration. In Proc. of the FPGA.

Digital Library

[38]

Oracle. [n. d.]. Scans. https://databaseinternalmechanism.com/oracle-database-internals/index-lookup-unique-scanrange-scan-full-scan-fast-full-scan-skip-scan/.

[39]

Prashant Pandey, Brian Wheatman, Helen Xu, and Aydin Buluc. 2021. Terrace: A hierarchical graph container for skewed dynamic graphs. In Proc. of the 2021 International Conference on Management of Data. 1372--1385.

Digital Library

[40]

Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Clayton Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W Keckler, Christopher W Fletcher, and Joel S Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proc. of the 24th ASPLOS. 137--151.

Digital Library

[41]

Stefania Perri and Pasquale Corsonello. 2008. Fast Low-Cost Implementation of Single-Clock-Cycle Binary Comparator. IEEE Transactions on Circuits and Systems II: Express Briefs 55, 12 (2008), 1239--1243.

[42]

Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matthew Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For Parallel Paterns. In Proc. of the 44th ISCA. 389--402.

Digital Library

[43]

Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In IEEE International Symposium on High Performance Computer Architecture (HPCA).

[44]

Sriram Ramabhadran, Sylvia Ratnasamy, Joseph M Hellerstein, and Scott Shenker. 2004. Prefix hash tree: An indexing data structure over distributed hash tables. In Proc. of the 23rd ACM symposium on principles of distributed computing, Vol. 37. St. John's Newfoundland, Canada.

[45]

Redis. 2024. https://github.com/redis/redis/blob/unstable/src/t_zset.c.

[46]

Redis. 2024. https://redis.com/glossary/redis-sorted-sets/.

[47]

Samuel Rogers, Joshua Slycord, Mohammadreza Baharani, and Hamed Tabkhi. 2020. Gem5-SALAM: A System Architecture for LLVM-Based Accelerator Modeling. In Proc. of the 53rd MICRO. 471--482.

[48]

Amir Roth, Andreas Moshovos, and Gurindar S Sohi. 1998. Dependence Based Prefetching for Linked Data Structures. In Proc. of the 8th ASPLOS.

Digital Library

[49]

Alexander Rucker, Matthew Vilim, Tian Zhao 0001, Yaqi Zhang 0001, Raghu Prabhakar, and Kunle Olukotun. 2021. Capstan: A Vector RDA for Sparsity. In Proc. of the 54th MICRO. 1022--1035.

Digital Library

[50]

Ali Sedaghati, Milad Hakimi, Reza Hojabr, and Arrvindh Shriraman. 2022. X-cache: a modular architecture for domain-specific caches. In Proc. of the 49th Annual International Symposium on Computer Architecture. 396--409.

Digital Library

[51]

Vivek Seshadri, Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-Unit Strided Accesses. In Proc. of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 267--280.

[52]

Smrchy. [n. d.]. https://github.com/smrchy/redis-tagging.

[53]

Po-An Tsai, Nathan Beckmann, and Daniel Sanchez. 2017. Jenga: Software-Defined Cache Hierarchies. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 652--665.

Digital Library

[54]

Po An Tsai, Yee Ling Gan, and Daniel Sanchez. 2018. Rethinking the memory hierarchy for modern languages. Proc. of the Annual International Symposium on Microarchitecture, MICRO 2018-Octob (2018), 203--216.

Digital Library

[55]

Piyush Tyagi and Rishikesh Pandey. 2020. High-Speed and Area-Efficient Scalable N-bit Digital Comparator. IET Circuits, Devices & Systems 14, 4 (2020), 450--458.

[56]

Matthew Vilim, Alexander Rucker, Yaqi Zhang 0001, Sophia Liu, and Kunle Olukotun. 2020. Gorgon: Accelerating Machine Learning from Relational Data. In Proc. of the 47th ISCA.

Digital Library

[57]

Matthew Vilim, Alexander Rucker, and Kunle Olukotun. 2021. Aurochs: An Architecture for Dataflow Threads. In Proc. of the 48th ISCA.

Digital Library

[58]

Jian Weng, Jian, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, and Tony Nowatzki. 2020. DSAGEN: Synthesizing Programmable Spatial Accelerators. In Proc. of the 47th ISCA. 268--281.

Digital Library

[59]

Idan Yaniv and Dan Tsafrir. 2016. Hash, Don't Cache (the Page Table). SIGMETRICS Perform. Eval. Rev. 44, 1 (jun 2016), 337--350.

Digital Library

[60]

Guowei Zhang, Nithya Attaluri, Joel S. Emer, and Daniel Sánchez. 2021. Gamma: leveraging Gustavson's algorithm to accelerate sparse matrix multiplication. In Proc. of the 26th ASPLOS.

Digital Library

[61]

Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. 2020. SpArch: Efficient architecture for sparse matrix multiplication. In In Proc. of 26th HPCA.

[62]

ZhangYunHao. 2024. https://github.com/zhangyunhao116/skipset.

Index Terms

METAL: Caching Multi-level Indexes in Domain-Specific Architectures
1. Computer systems organization
  1. Architectures
2. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
      1. Hardware-software codesign

Recommendations

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks

Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache ...
DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Using the first-level caches as filters to reduce the pollution caused by speculative memory references

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

April 2024

1299 pages

ISBN:9798400703850

DOI:10.1145/3620665

General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSERC CRD

Conference

ASPLOS '24

Sponsor:

ASPLOS '24: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

April 27 - May 1, 2024

CA, La Jolla, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
325
Total Downloads

Downloads (Last 12 months)325
Downloads (Last 6 weeks)29

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents