Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3620665.3640402acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

METAL: Caching Multi-level Indexes in Domain-Specific Architectures

Published: 27 April 2024 Publication History

Abstract

State-of-the-art domain specific architectures (DSAs) work with sparse data, and need hardware support for index data-structures [31, 43, 57, 61]. Indexes are more space-efficient for sparse-data, and reduce DRAM bandwidth, if data reuse can be managed. However, indexes exhibit dynamic accesses, chase pointers, and need to walk-and-search. This inflates the working set and thrashes the cache. We observe that the cache organization itself is responsible for this behavior.
We develop METAL, a portable caching idiom that enables DSAs to employ index data-structures. METAL decouples reuse of the index metadata from data reuse, and optimizes it independently. We propose two ideas: i) IX-Cache: A cache that leverages range tags to short-circuits index walks, and reduces the working set. IX-cache helps capture the trade-off between wider index nodes that maximize reach vs those that are closer to leaf and minimize walk latency. ii) Reuse Patterns: An interface to explicitly manage the cache. Patterns orchestrate cache insertions and bypass as we dynamically traverse different index regions. METAL improves performance vs. streaming DSAs by 7.8×, address-caches by 4.1×, and state-of-the-art DSA-cache [50] by 2.4×. We reduce DRAM energy by 1.6× vs. prior state-of-the-art.

References

[1]
Michael Adler, Kermin E. Fleming, Angshuman Parashar, Michael Pellauer, and Joel Emer. 2011. Leap Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic. In Proc. of the 19th FPGA (FPGA '11). New York, NY, USA.
[2]
Md Nur Ahmed. 2024. https://dev.to/mdnurahmed/simple-scalable-search-autocomplete-systems-1j18.
[3]
Tutu Ajayi, Vidya A Chhabria, Mateus Fogaça, Soheil Hashemi, Abdelrahman Hosny, Andrew B Kahng, Minsoo Kim, Jeongsup Lee, Uday Mallappa, Marina Neseem, et al. 2019. Toward an open-source digital flow: First learnings from the openroad project. In Proc. of the 56th Annual Design Automation Conference 2019. 1--4.
[4]
Daehyeon Baek, Soojin Hwang, Taekyung Heo, Daehoon Kim, and Jaehyuk Huh. 2021. InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-Aware Inner Product Processing. In 30th Int'l. Conf. on Parallel Architectures and Compilation.
[5]
Thomas W Barr, Alan L Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don't Walk (the Page Table). In Proc. of the 37th ISCA.
[6]
Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating Two-Dimensional Page Walks for Virtualized Systems. SIGOPS Oper. Syst. Rev. 42, 2 (mar 2008), 26--35.
[7]
N. V. Vijaya Krishna Boppana and Saiyu Ren. 2016. A Low-Power and Area-Efficient 64-Bit Digital Comparator. J. Circuits Syst. Comput. 25, 12 (2016).
[8]
J. Carter, W. Hsieh, L. Stoller, M. Swanson, Lixin Zhang, E. Brunvand, A. Davis, Chen-Chi Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. 1999. Impulse: building a smarter memory controller. In Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[9]
Tao Chen and G Edward Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proc. of the 49th MICRO. 1--12.
[10]
Stephen Chou and Saman Amarasinghe. 2022. Compilation of dynamic sparse tensor algebra. Proc. of the ACM on Programming Languages 6, OOPSLA2, 1408--1437.
[11]
Chang Chua and R.B.N. Kumar. 2017. An Improved Design and Simulation of Low-Power and Area Efficient Parallel Binary Comparator. Microelectron. J. (aug 2017), 84--88.
[12]
Eric S Chung, James C Hoe, and Ken Mai. 2011. CoRAM: an in-fabric memory architecture for FPGA-based computing. In Proc. of the 19th FPGA.
[13]
Jason Clemons, Chih-Chi Cheng, Iuri Frosio, Daniel R Johnson, and Stephen W Keckler. 2016. A Patch Memory System for Image Processing and Computer Vision. In Proc. of the 49th MICRO. 1--13.
[14]
Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards general purpose acceleration by exploiting common data-dependence forms. In Proc. of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 924--939.
[15]
Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms. In Proc. of the 52nd MICRO. 924--939.
[16]
William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-Specific Hardware Accelerators. Commun. ACM 63, 7 (June 2020), 48--57.
[17]
E Ebrahimi, O Mutlu, and Y N Patt. 2009. Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In Proc. of the 15th HPCA.
[18]
Edward A Fox, Qi Fan Chen, Amjad M Daoud, and Lenwood S Heath. 1991. Order-preserving minimal perfect hash functions and information retrieval. ACM Transactions on Information Systems (TOIS) 9, 3 (1991), 281--308.
[19]
Fabio Frustaci, Stefania Perri, Marco Lanuzza, and Pasquale Corsonello. 2012. Energy-efficient single-clock-cycle binary comparator. Int. J. Circuit Theory Appl. 40, 3 (2012), 237--246.
[20]
Daichi Fujiki, Niladrish Chatterjee, Donghyuk Lee, and Mike O'Connor. 2019. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA, Article 55.
[21]
A González-Beltrán, Peter Milligan, and Paul Sage. 2008. Range queries over skip tree graphs. Computer Communications 31, 2 (2008), 358--374.
[22]
Goetz Graefe et al. 2011. Modern B-tree techniques. Foundations and Trends® in Databases 3, 4 (2011), 203--402.
[23]
Bingsheng He, Naga K. Govindaraju, Qiong Luo, and Burton Smith. 2007. Efficient gather and scatter operations on graphics processors. In SC '07: Proc. of the 2007 ACM/IEEE Conference on Supercomputing. 1--12.
[24]
John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM 62, 2 (Jan. 2019), 48--60.
[25]
Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In Proc. of the 42nd ISCA. 118--130.
[26]
Ibrahim Kamel and Christos Faloutsos. 1992. Parallel R-trees. ACM SIGMOD Record 21, 2 (1992), 195--204.
[27]
Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In Proc. of the 52nd MICRO.
[28]
Ujval J. Kapasi, Scott Rixner, William J. Dally, Brucek Khailany, Jung Ho Ahn, Peter Mattson, and John D. Owens. 2003. Programmable Stream Processors. Computer 36, 8 (aug 2003), 54--62.
[29]
M Karlsson, F Dahlgren, and P Stenstrom. 2000. A Prefetching Technique for Irregular Accesses to Linked Data Structures. In Proc. of the 6th HPCA.
[30]
Michael S Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access path selection in main-memory optimized data systems: Should I scan or should I probe?. In Proc. of the 2017 ACM International Conference on Management of Data. 715--730.
[31]
Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin T Lim, and Parthasarathy Ranganathan. 2013. Meet the walkers: accelerating index traversals for in-memory databases. In Proc. of the 46th MICRO. 468--479.
[32]
Rakesh Komuravelli, Matthew D Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V Adve, and Vikram S Adve. 2015. Stash: have your scratchpad and cache it too. In Proc. of the 42nd ISCA. 707--719.
[33]
Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman, and Vijayalakshmi Srinivasan. 2015. DASX: Hardware accelerator for software data structures. Proc. of the International Conference on Supercomputing 2015-June (2015), 361--371.
[34]
Jochen Liedtke and Kevin Elphinstone. 1996. Guarded Page Tables on Mips R4600 or an Exercise in Architecture-Dependent Micro Optimization. SIGOPS Oper. Syst. Rev. 30, 1 (jan 1996).
[35]
Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. of the 29th ACM on International Conference on Supercomputing. 339--350.
[36]
Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krishnamoorthy, and P. Sadayappan. 2019. An Efficient Mixed-Mode Representation of Sparse Tensors. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis.
[37]
Tayo Oguntebi and Kunle Olukotun. 2016. Graphops: A dataflow library for graph analytics acceleration. In Proc. of the FPGA.
[38]
Oracle. [n. d.]. Scans. https://databaseinternalmechanism.com/oracle-database-internals/index-lookup-unique-scanrange-scan-full-scan-fast-full-scan-skip-scan/.
[39]
Prashant Pandey, Brian Wheatman, Helen Xu, and Aydin Buluc. 2021. Terrace: A hierarchical graph container for skewed dynamic graphs. In Proc. of the 2021 International Conference on Management of Data. 1372--1385.
[40]
Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Clayton Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W Keckler, Christopher W Fletcher, and Joel S Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proc. of the 24th ASPLOS. 137--151.
[41]
Stefania Perri and Pasquale Corsonello. 2008. Fast Low-Cost Implementation of Single-Clock-Cycle Binary Comparator. IEEE Transactions on Circuits and Systems II: Express Briefs 55, 12 (2008), 1239--1243.
[42]
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matthew Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For Parallel Paterns. In Proc. of the 44th ISCA. 389--402.
[43]
Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In IEEE International Symposium on High Performance Computer Architecture (HPCA).
[44]
Sriram Ramabhadran, Sylvia Ratnasamy, Joseph M Hellerstein, and Scott Shenker. 2004. Prefix hash tree: An indexing data structure over distributed hash tables. In Proc. of the 23rd ACM symposium on principles of distributed computing, Vol. 37. St. John's Newfoundland, Canada.
[45]
Redis. 2024. https://github.com/redis/redis/blob/unstable/src/t_zset.c.
[46]
Redis. 2024. https://redis.com/glossary/redis-sorted-sets/.
[47]
Samuel Rogers, Joshua Slycord, Mohammadreza Baharani, and Hamed Tabkhi. 2020. Gem5-SALAM: A System Architecture for LLVM-Based Accelerator Modeling. In Proc. of the 53rd MICRO. 471--482.
[48]
Amir Roth, Andreas Moshovos, and Gurindar S Sohi. 1998. Dependence Based Prefetching for Linked Data Structures. In Proc. of the 8th ASPLOS.
[49]
Alexander Rucker, Matthew Vilim, Tian Zhao 0001, Yaqi Zhang 0001, Raghu Prabhakar, and Kunle Olukotun. 2021. Capstan: A Vector RDA for Sparsity. In Proc. of the 54th MICRO. 1022--1035.
[50]
Ali Sedaghati, Milad Hakimi, Reza Hojabr, and Arrvindh Shriraman. 2022. X-cache: a modular architecture for domain-specific caches. In Proc. of the 49th Annual International Symposium on Computer Architecture. 396--409.
[51]
Vivek Seshadri, Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-Unit Strided Accesses. In Proc. of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 267--280.
[52]
Smrchy. [n. d.]. https://github.com/smrchy/redis-tagging.
[53]
Po-An Tsai, Nathan Beckmann, and Daniel Sanchez. 2017. Jenga: Software-Defined Cache Hierarchies. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 652--665.
[54]
Po An Tsai, Yee Ling Gan, and Daniel Sanchez. 2018. Rethinking the memory hierarchy for modern languages. Proc. of the Annual International Symposium on Microarchitecture, MICRO 2018-Octob (2018), 203--216.
[55]
Piyush Tyagi and Rishikesh Pandey. 2020. High-Speed and Area-Efficient Scalable N-bit Digital Comparator. IET Circuits, Devices & Systems 14, 4 (2020), 450--458.
[56]
Matthew Vilim, Alexander Rucker, Yaqi Zhang 0001, Sophia Liu, and Kunle Olukotun. 2020. Gorgon: Accelerating Machine Learning from Relational Data. In Proc. of the 47th ISCA.
[57]
Matthew Vilim, Alexander Rucker, and Kunle Olukotun. 2021. Aurochs: An Architecture for Dataflow Threads. In Proc. of the 48th ISCA.
[58]
Jian Weng, Jian, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, and Tony Nowatzki. 2020. DSAGEN: Synthesizing Programmable Spatial Accelerators. In Proc. of the 47th ISCA. 268--281.
[59]
Idan Yaniv and Dan Tsafrir. 2016. Hash, Don't Cache (the Page Table). SIGMETRICS Perform. Eval. Rev. 44, 1 (jun 2016), 337--350.
[60]
Guowei Zhang, Nithya Attaluri, Joel S. Emer, and Daniel Sánchez. 2021. Gamma: leveraging Gustavson's algorithm to accelerate sparse matrix multiplication. In Proc. of the 26th ASPLOS.
[61]
Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. 2020. SpArch: Efficient architecture for sparse matrix multiplication. In In Proc. of 26th HPCA.
[62]
ZhangYunHao. 2024. https://github.com/zhangyunhao116/skipset.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
April 2024
1299 pages
ISBN:9798400703850
DOI:10.1145/3620665
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Author Tags

  1. domain-specific architectures
  2. caches
  3. dataflow architectures
  4. indexes

Qualifiers

  • Research-article

Funding Sources

  • NSERC CRD

Conference

ASPLOS '24

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 325
    Total Downloads
  • Downloads (Last 12 months)325
  • Downloads (Last 6 weeks)29
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media