research-article

A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose Processors

Authors:

Marco Siracusa,

Víctor Soria-Pardos,

Francesco Sgherzi,

Joshua Randall,

Douglas J. Joseph,

Miquel Moretó Planas,

Adrià ArmejachAuthors Info & Claims

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1332 - 1346

https://doi.org/10.1145/3613424.3614284

Published: 08 December 2023 Publication History

Abstract

This paper proposes the Tensor Marshaling Unit (TMU), a near-core programmable dataflow engine for multicore architectures that accelerates tensor traversals and merging, the most critical operations of sparse tensor workloads running on today’s computing infrastructures. The TMU leverages a novel multi-lane design that enables parallel tensor loading and merging, which naturally produces vector operands that are marshaled into the core for efficient SIMD computation. The TMU supports all the necessary primitives to be tensor-format and tensor-algebra complete. We evaluate the TMU on a simulated multicore system using a broad set of tensor algebra workloads, achieving 3.6 ×, 2.8 ×, and 4.9 × speedups over memory-intensive, compute-intensive, and merge-intensive vectorized software implementations, respectively.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.

[2]

Sam Ainsworth and Timothy M. Jones. 2016. Graph Prefetching Using Data Structure Knowledge. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1-3, 2016, Ozcan Ozturk, Kemal Ebcioglu, Mahmut T. Kandemir, and Onur Mutlu (Eds.). ACM, 39:1–39:11. https://doi.org/10.1145/2925426.2926254

Digital Library

[3]

Sam Ainsworth and Timothy M. Jones. 2018. An Event-Triggered Programmable Prefetcher for Irregular Workloads. SIGPLAN Not. 53, 2 (mar 2018), 578–592. https://doi.org/10.1145/3296957.3173189

Digital Library

[4]

Ariful Azad and Aydin Buluç. 2019. LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory. In 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20-24, 2019. IEEE, 2–12. https://doi.org/10.1109/IPDPS.2019.00012

[5]

Adrián Barredo, Adrià Armejach, Jonathan Beard, and Miquel Moreto. 2021. PLANAR: A Programmable Accelerator for near-Memory Data Rearrangement. In Proceedings of the ACM International Conference on Supercomputing (Virtual Event, USA) (ICS ’21). Association for Computing Machinery, New York, NY, USA, 164–176. https://doi.org/10.1145/3447818.3460368

Digital Library

[6]

Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst. 1994. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611971538 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611971538

[7]

Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP Benchmark Suite.

[8]

Maciej Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Ratsch, T. Hoefler, and E. Solomonik. 2020. Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA, 1122–1132. https://doi.org/10.1109/IPDPS47924.2020.00118

[9]

V. Bharadwaj, A. Buluc, and J. Demmel. 2022. Distributed-Memory Sparse Kernels for Machine Learning. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA, 47–58. https://doi.org/10.1109/IPDPS53621.2022.00014

[10]

M. Brandt, J. Brooks, M. Cahir, T. Hewitt, E. LopezPineda, and D. Sandness. [n. d.]. The Benchmarker’s Guide for CRAY SV1 Systems. https://parallel.ru/sites/default/files/ftp/computers/cray/sv1_bmguide.pdf.

[11]

Benjamin Brock, Scott McMillan, Aydın Buluç, Timothy Mattson, and José Moreira. 2022. GraphBLAS C++ Specification. https://github.com/GraphBLAS/graphblas-api-cpp.

[12]

Aydin Buluc and John R. Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1–11. https://doi.org/10.1109/IPDPS.2008.4536313

[13]

Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format Abstraction for Sparse Tensor Algebra Compilers. Proc. ACM Program. Lang. 2, OOPSLA, Article 123 (oct 2018), 30 pages. https://doi.org/10.1145/3276493

Digital Library

[14]

Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards general purpose acceleration by exploiting common data-dependence forms. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 924–939.

Digital Library

[15]

Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. 2021. Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights. Proc. IEEE 109, 10 (2021), 1706–1752. https://doi.org/10.1109/JPROC.2021.3098483

[16]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (dec 2011), 25 pages. https://doi.org/10.1145/2049662.2049663

Digital Library

[17]

Jack Dongarra and Michael A Heroux. 2013. Toward a new metric for ranking high performance computing systems. Sandia Report, SAND2013-4744 312 (2013), 150.

[18]

Jack Dongarra, Piotr Luszczek, and M Heroux. 2013. HPCG technical specification. Sandia National Laboratories, Sandia Report SAND2013-8752 (2013).

[19]

R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernandez, T. Juan, G. Lowney, M. Mattina, and A. Seznec. 2002. Tarantula: a vector extension to the alpha architecture. In Proceedings 29th Annual International Symposium on Computer Architecture. 281–292. https://doi.org/10.1109/ISCA.2002.1003586

[20]

Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A Systematic Survey of General Sparse Matrix-Matrix Multiplication. ACM Comput. Surv. 55, 12, Article 244 (mar 2023), 36 pages. https://doi.org/10.1145/3571157

Digital Library

[21]

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, and Aydın Buluç. 2020. BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper. bioRxiv (2020). https://doi.org/10.1101/464420 arXiv:https://www.biorxiv.org/content/early/2020/03/23/464420.full.pdf

[22]

Giulia Guidi, Oguz Selvitopi, Marquita Ellis, Leonid Oliker, Katherine Yelick, and Aydin Buluc. 2020. Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly. In 2020 IEEE International Parallel and Distributed Processing Symposium.

[23]

U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang. 2020. The Architectural Implications of Facebook’s DNN-Based Personalized Recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 488–501. https://doi.org/10.1109/HPCA47549.2020.00047

[24]

Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 319–333. https://doi.org/10.1145/3352460.3358275

Digital Library

[25]

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. 2021. Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks. J. Mach. Learn. Res. 22, 1, Article 241 (jan 2021), 124 pages.

[26]

Olivia Hsu, Maxwell Strange, Ritvik Sharma, Jaeyeon Won, Kunle Olukotun, Joel S. Emer, Mark A. Horowitz, and Fredrik Kjølstad. 2023. The Sparse Abstract Machine. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 710–726. https://doi.org/10.1145/3582016.3582051

Digital Library

[27]

M. Hussain, G. Abhishek, A. Buluc, and A. Azad. 2022. Parallel Algorithms for Adding a Collection of Sparse Matrices. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE Computer Society, Los Alamitos, CA, USA, 285–294. https://doi.org/10.1109/IPDPSW55747.2022.00058

[28]

Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 600–614. https://doi.org/10.1145/3352460.3358286

Digital Library

[29]

Jeremy Kepner and John Gilbert. 2011. Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719918 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898719918

[30]

Jinsung Kim, Aravind Sukumaran-Rajam, Changwan Hong, Ajay Panyala, Rohit Kumar Srivastava, Sriram Krishnamoorthy, and P. Sadayappan. 2018. Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs. In Proceedings of the 2018 International Conference on Supercomputing (Beijing, China) (ICS ’18). Association for Computing Machinery, New York, NY, USA, 96–106. https://doi.org/10.1145/3205289.3205296

Digital Library

[31]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (oct 2017), 29 pages. https://doi.org/10.1145/3133901

Digital Library

[32]

Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. 2013. Meet the walkers accelerating index traversals for in-memory databases. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 468–479.

[33]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (Palo Alto, California) (CGO ’04). IEEE Computer Society, USA, 75.

Digital Library

[34]

Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. 2020. There’s plenty of room at the Top: What will drive computer performance after Moore’s law?Science 368, 6495 (2020), eaam9744. https://doi.org/10.1126/science.aam9744 arXiv:https://www.science.org/doi/pdf/10.1126/science.aam9744

[35]

Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. 2021. Sparta: High-Performance, Element-Wise Sparse Tensor Contraction on Heterogeneous Memory. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP ’21). Association for Computing Machinery, New York, NY, USA, 318–333. https://doi.org/10.1145/3437801.3441581

Digital Library

[36]

Scott Lloyd and Maya Gokhale. 2015. In-Memory Data Rearrangement for Irregular, Data-Intensive Computing. Computer 48, 8 (2015), 18–25. https://doi.org/10.1109/MC.2015.230

Digital Library

[37]

J. Lowe-Power, A. Mutaal Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black, G. Bloom, B. R. Bruce, D. Rodrigues Carvalho, J. Castrillon, L. Chen, N. Derumigny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, A. Gutierrez, B. Hanindhito, A. Hansson, S. Haria, A. Harris, T. Hayes, A. Herrera, M. Horsnell, S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul, T. M. Jones, M. Jung, S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Krishna, T. Marinelli, C. Menard, A. Mondelli, M. Moreto, T. Mück, O. Naji, K. Nathella, H. Nguyen, N. Nikoleris, L. E. Olson, M. Orr, B. Pham, P. Prieto, T. Reddy, A. Roelke, M. Samani, A. Sandberg, J. Setoain, B. Shingarov, M. D. Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Upton, N. Vaish, I. Vougioukas, W. Wang, Z. Wang, N. Wehn, C. Weis, D. A. Wood, H. Yoon, and É. F. Zulian. 2020. The gem5 Simulator: Version 20.0+. arxiv:2007.03152 [cs.AR]

[38]

Srundefinedan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlić, and Aydin Buluç. 2022. Parallel Algorithms for Masked Sparse Matrix-Matrix Products. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seoul, Republic of Korea) (PPoPP ’22). Association for Computing Machinery, New York, NY, USA, 453–454. https://doi.org/10.1145/3503221.3508430

Digital Library

[39]

Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–14. https://doi.org/10.1109/MICRO.2018.00010

Digital Library

[40]

Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. 2018. High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures. In Workshop Proceedings of the 47th International Conference on Parallel Processing (Eugene, OR, USA) (ICPP Workshops ’18). Association for Computing Machinery, New York, NY, USA, Article 34, 10 pages. https://doi.org/10.1145/3229710.3229720

Digital Library

[41]

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, and Samuel Williams. 2022. FPGA-based HPC accelerators: An evaluation on performance and energy efficiency. Concurrency and Computation: Practice and Experience 34, 20 (2022), e6570. https://doi.org/10.1002/cpe.6570 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.6570

[42]

NVIDIA. 2023. Best Practices for Building and Deploying Recommender Systems. https://docs.nvidia.com/deeplearning/performance/recsys-best-practices/index.html

[43]

Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 724–736.

[44]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[45]

Andrea Pellegrini, Nigel Stephens, Magnus Bruce, Yasuo Ishii, Joseph Pusdesris, Abhishek Raja, Chris Abernathy, Jinson Koppanalil, Tushar Ringe, Ashok Tummala, Jamshed Jalal, Mark Werkheiser, and Anitha Kona. 2020. The Arm Neoverse N1 Platform: Building Blocks for the Next-Gen Cloud-to-Edge Infrastructure SoC. IEEE Micro 40, 2 (2020), 53–62. https://doi.org/10.1109/MM.2020.2972222

[46]

et al. Phipps, Eric. [n. d.]. Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories. https://gitlab.com/tensors/genten/.

[47]

Eric T. Phipps and Tamara G. Kolda. 2019. Software for Sparse Tensor Decomposition on Emerging Computing Architectures. SIAM Journal on Scientific Computing 41, 3 (2019), C269–C290. https://doi.org/10.1137/18M1210691 arXiv:https://doi.org/10.1137/18M1210691

Digital Library

[48]

Stephan Rabanser, Oleksandr Shchur, and Stephan Günnemann. 2017. Introduction to Tensor Decompositions and their Applications in Machine Learning. ArXiv abs/1711.10781 (2017).

[49]

Alexander Rucker, Matthew Vilim, Tian Zhao, Yaqi Zhang, Raghu Prabhakar, and Kunle Olukotun. 2021. Capstan: A Vector RDA for Sparsity. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 1022–1035. https://doi.org/10.1145/3466752.3480047

Digital Library

[50]

Mitsuhisa Sato, Yutaka Ishikawa, Hirofumi Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, Hisashi Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, Akira Asato, Kuniki Morita, and Toshiyuki Shimizu. 2020. Co-Design for A64FX Manycore Processor and "Fugaku". In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC ’20). IEEE Press, Article 47, 15 pages.

Digital Library

[51]

Oguz Selvitopi, S. Ekanayake, G. Guidi, G. A. Pavlopoulos, A. Azad, and A. Buluc. 2020. Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, Los Alamitos, CA, USA, 1–14. https://doi.org/10.1109/SC41405.2020.00079

[52]

Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjolstad. 2020. A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra. Proc. ACM Program. Lang. 4, OOPSLA, Article 158 (Nov. 2020), 30 pages. https://doi.org/10.1145/3428226

Digital Library

[53]

Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (2017), 3551–3582. https://doi.org/10.1109/TSP.2017.2690524

Digital Library

[54]

Marco Siracusa, Emanuele Del Sozzo, Marco Rabozzi, Lorenzo Di Tucci, Samuel Williams, Donatella Sciuto, and Marco Domenico Santambrogio. 2022. A Comprehensive Methodology to Optimize FPGA Designs via the Roofline Model. IEEE Trans. Comput. 71, 8 (2022), 1903–1915. https://doi.org/10.1109/TC.2021.3111761

[55]

Marco Siracusa and Fabrizio Ferrandi. 2020. Tensor Optimization for High-Level Synthesis Design Flows. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 4217–4228. https://doi.org/10.1109/TCAD.2020.3012318

[56]

Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/

[57]

Shaden Smith and George Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas) (IA³ ’15). Association for Computing Machinery, New York, NY, USA, Article 5, 7 pages. https://doi.org/10.1145/2833179.2833183

Digital Library

[58]

Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 766–780.

[59]

Aaron Stillmaker and Bevan Baas. 2017. Scaling equations for the accurate prediction of CMOS device performance from 180nm to 7nm. Integration 58 (2017), 74–81. https://doi.org/10.1016/j.vlsi.2017.02.002

[60]

Nishil Talati, Kyle May, Armand Behroozi, Yichen Yang, Kuba Kaszyk, Christos Vasiladiotis, Tarunesh Verma, Lu Li, Brandon Nguyen, Jiawen Sun, John Magnus Morton, Agreen Ahmadi, Todd Austin, Michael O’Boyle, Scott Mahlke, Trevor Mudge, and Ronald Dreslinski. 2021. Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 654–667. https://doi.org/10.1109/HPCA51647.2021.00061

[61]

Nishil Talati, Haojie Ye, Yichen Yang, Leul Belayneh, Kuan-Yu Chen, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2022. NDMiner: Accelerating Graph Pattern Mining Using near Data Processing. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 146–159. https://doi.org/10.1145/3470496.3527437

Digital Library

[62]

Neil C. Thompson and Svenja Spanuth. 2021. The Decline of Computers as a General Purpose Technology. Commun. ACM 64, 3 (feb 2021), 64–72. https://doi.org/10.1145/3430936

Digital Library

[63]

J. Wawrzynek, K. Asanovic, B. Kingsbury, D. Johnson, J. Beck, and N. Morgan. 1996. Spert-II: a vector microprocessor system. Computer 29, 3 (1996), 79–86. https://doi.org/10.1109/2.485896

Digital Library

[64]

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. 1–12. https://doi.org/10.1145/1362622.1362674

Digital Library

[65]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 52, 4 (apr 2009), 65–76. https://doi.org/10.1145/1498765.1498785

Digital Library

[66]

Yifan Yang, Joel S. Emer, and Daniel Sanchez. 2021. SpZip: Architectural Support for Effective Data Compression In Irregular Applications. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 1069–1082. https://doi.org/10.1109/ISCA52012.2021.00087

Digital Library

[67]

Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 178–190. https://doi.org/10.1145/2830772.2830807

Digital Library

[68]

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 1450, 15 pages.

[69]

Guowei Zhang, Nithya Attaluri, Joel S Emer, and Daniel Sanchez. 2021. GAMMA: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 687–701.

Digital Library

[70]

Lixin Zhang, Zhen Fang, M. Parker, B.K. Mathew, L. Schaelicke, J.B. Carter, W.C. Hsieh, and S.A. McKee. 2001. The Impulse memory controller. IEEE Trans. Comput. 50, 11 (2001), 1117–1132. https://doi.org/10.1109/12.966490

Digital Library

[71]

Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. 2020. SpArch: Efficient architecture for sparse matrix multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 261–274.

Index Terms

A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose Processors

Recommendations

Tensor algebra compilation with workspaces
CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization

This paper shows how to extend sparse tensor algebra compilers to introduce temporary tensors called workspaces to avoid inefficient sparse data structures accesses. We develop an intermediate representation (IR) for tensor operations called concrete ...
Sparse and Truncated Nuclear Norm Based Tensor Completion

One of the main difficulties in tensor completion is the calculation of the tensor rank. Recently a tensor nuclear norm, which is equal to the weighted sum of matrix nuclear norms of all unfoldings of the tensor, was proposed to address this issue. ...
Automatic generation of efficient sparse tensor format conversion routines
PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation

This paper shows how to generate code that efficiently converts sparse tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL, and many others. We decompose sparse tensor conversion into three logical phases: coordinate remapping,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

October 2023

1528 pages

ISBN:9798400703294

DOI:10.1145/3613424

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '23

Sponsor:

SIGMICRO

MICRO '23: 56th Annual IEEE/ACM International Symposium on Microarchitecture

October 28 - November 1, 2023

ON, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
296
Total Downloads

Downloads (Last 12 months)296
Downloads (Last 6 weeks)26

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents