Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

Published: 28 February 2022 Publication History

Abstract

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-In-Memory Accelerator for Parallel Graph Processing. In ISCA .
[2]
Bahar Asgari, Ramyad Hadidi, Joshua Dierberger, Charlotte Steinichen, and Hyesoon Kim. 2020 a. Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads. In CoRR . https://arxiv.org/abs/2011.10932
[3]
Bahar Asgari, Ramyad Hadidi, Tushar Krishna, Hyesoon Kim, and Sudhakar Yalamanchili. 2020 b. ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator. In HPCA .
[4]
Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and PracticalNear-DRAM Acceleration Architecture for Large Memory Systems. In MICRO .
[5]
Mehmet Belgin, Godmar Back, and Calvin J. Ribbens. 2009. Pattern-Based Sparse Matrix Representation for Memory-Efficient SMVM Kernels. In ICS .
[6]
Akrem Benatia, Weixing Ji, and Yizhuo Wang. 2019. Sparse Matrix Partitioning for Optimizing SpMV on CPU-GPU Heterogeneous Platforms. In IJHPCA .
[7]
Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2016. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In ICPP .
[8]
Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2018. BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU. In TACO .
[9]
Maciej Besta, Florian Marending, Edgar Solomonik, and Torsten Hoefler. 2017. SlimSell: A Vectorizable Graph Representation for Breadth-First Search. In IPDPS .
[10]
Rob H. Bisseling and Wouter Meesen. 2005. Communication Balancing in Parallel Sparse Matrix-Vector Multiplication. In ETNA. Electronic Transactions on Numerical Analysis .
[11]
Åke Björck. 1996. Numerical Methods for Least Squares Problems. In SIAM .
[12]
Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schröder. 2003 a. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In SIGGRAPH .
[13]
Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schröder. 2003 b. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In ACM Transactions on Graphics .
[14]
Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine. In WWW .
[15]
Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. In IPDPS .
[16]
Beata Bylina, Jaroslaw Bylina, Przemyslaw Stpiczy'ski, and Dominik Szakowski. 2014. Performance Analysis of Multicore and Multinodal Implementation of SpMV Operation. In FedCSIS .
[17]
Benjamin Y. Cho, Yongkee Kwon, Sangkug Lym, and Mattan Erez. 2020. Near Data Acceleration with Concurrent Host Access. In ISCA .
[18]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. In PpopP .
[19]
CSR5. 2015. CSR5 Cuda . https://github.com/weifengliu-ssslab/Benchmark_SpMV_using_CSR5
[20]
cuSparse. 2021. cuSparse . https://docs.nvidia.com/cuda/cusparse/index.html
[21]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An Industry-Standard API for Shared-Memory Programming. In IEEE Comput. Sci. Eng.
[22]
Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. In TOMS .
[23]
F. Devaux. 2019. The True Processing In Memory Accelerator. In Hot Chips .
[24]
Jack Dongarra, Andrew Lumsdaine, Xinhui Niu, Roldan Pozoz, and Karin Remington. 1994. Sparse Matrix Libraries in C
[25]
for High Performance Architectures. In Mathematics .
[26]
Athena Elafrou, G. Goumas, and N. Koziris. 2017. Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors. In ICPP .
[27]
Athena Elafrou, Georgios Goumas, and Nectarios Koziris. 2019. Conflict-Free Symmetric Sparse Matrix-Vector Multiplication on Multicore Architectures. In SC .
[28]
Athena Elafrou, Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2018. SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms. In ACM TOMS .
[29]
R. D. Falgout. 2006. An Introduction to Algebraic Multigrid. In Computing in Science Engineering .
[30]
Robert D Falgout and Ulrike Meier Yang. 2002. hypre: A Library of High Performance Preconditioners. In ICCS .
[31]
Ivan Fernandez, Ricardo Quislant, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, Eladio Gutiérrez, Oscar Plata, and Onur Mutlu. 2020. NATSA: A Near-Data Processing Accelerator for Time Series Analysis. In ICCD .
[32]
Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. 2014. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication. In FCCM .
[33]
Daichi Fujiki, Niladrish Chatterjee, Donghyuk Lee, and Mike O'Connor. 2019. Near-Memory Data Transformation for Efficient Sparse Matrix Multi-Vector Multiplication. In SC .
[34]
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical Near-Data Processing for In-Memory Analytics Frameworks. In PACT .
[35]
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In ASPLOS .
[36]
Christina Giannoula, Ivan Fernandez, Juan Gó mez-Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems. In CoRR . https://arxiv.org/abs/2201.05072
[37]
Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gó mez-Luna, Lois Orosa, Nectarios Koziris, Georgios I. Goumas, and Onur Mutlu. 2021. SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures. In HPCA .
[38]
Juan Gó mez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2021. Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture. In CoRR . https://arxiv.org/abs/2105.03814
[39]
Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance Evaluation of the Sparse Matrix-Vector Multiplication on Modern Architectures. In J. Supercomput.
[40]
Paul Grigoras, Pavel Burovskiy, Eddie Hung, and Wayne Luk. 2015. Accelerating SpMV on FPGAs by Compressing Nonzero Values. In FCCM .
[41]
SAFARI Research Group. 2022. SparseP Software Package . https://github.com/Carnegie Mellon University-SAFARI/SparseP
[42]
Ping Guo, Liqiang Wang, and Po Chen. 2014. A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs. In IEEE TPDS .
[43]
Udit Gupta, Xiaodong Wang, Maxim Naumov, Carole-Jean Wu, Brandon Reagen, David Brooks, Bradford Cottel, Kim M. Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2019. The Architectural Implications of Facebook's DNN-based Personalized Recommendation. In CoRR .
[44]
Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2021. Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware. In IGSC .
[45]
Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In MICRO .
[46]
Pascal Hénon, Pierre Ramet, and Jean Roman. 2002. PASTIX: A High-Performance Parallel Direct Solver for Sparse Symmetric Positive Definite Systems. In PMAA .
[47]
Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. cCatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018a. Efficient Sparse-Matrix Multi-Vector Product on GPUs. In HPDC .
[48]
Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. cCatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018b. Efficient Sparse-Matrix Multi-Vector Product on GPUs. In HPDC .
[49]
Eun-Jin Im and Katherine A. Yelick. 1999. Optimizing Sparse Matrix Vector Multiplication on SMP. In PPSC.
[50]
Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization Framework for Sparse Matrix Kernels. In The International Journal of High Performance Computing Applications .
[51]
Sivaramakrishna Bharadwaj Indarapu, Manoj Maramreddy, and Kishore Kothapalli. 2014. Architecture- and Workload- Aware Heterogeneous Algorithms for Sparse Matrix Vector Multiplication. In COMPUTE .
[52]
Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In MICRO .
[53]
Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2009. Perfomance Models for Blocked Sparse Matrix-Vector Multiplication Kernels. In ICPP .
[54]
Enver Kayaaslan, Bora Uçar, and Cevdet Aykanat. 2015. Semi-Two-Dimensional Partitioning for Parallel Sparse Matrix-Vector Multiplication. In IPDPS Workshop .
[55]
Liu Ke, Udit Gupta, Carole-Jean Wu, Benjamin Youngjae Cho, Mark Hempstead, Brandon Reagen, Xuan Zhang, David Brooks, Vikas Chandra, Utku Diril, et almbox. 2020. RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing. In ISCA .
[56]
Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K Nurminen, and Zhonghong Ou. 2018. Rapl in Action: Experiences in Using RAPL for Power Measurements. In TOMPECS .
[57]
Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. 2012. A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM. In ISCA .
[58]
David R Kincaid, Thomas C Oppe, and David M Young. 1989. Itpackv 2D User's Guide .
[59]
Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. TACO: A Tool to Generate Tensor Algebra Kernels . In ASE .
[60]
Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2008. Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression. In CF .
[61]
Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2011. CSX: An Extended Compression Format for Spmv on Shared Memory Systems. In PPoPP .
[62]
Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. In MICRO .
[63]
Young-Cheon Kwon, Suk Han Lee, Jaehoon Lee, Sang-Hyuk Kwon, Je Min Ryu, Jong-Pil Son, O Seongil, Hak-Soo Yu, Haesuk Lee, Soo Young Kim, Youngmin Cho, Jin Guk Kim, Jongyoon Choi, Hyun-Sung Shin, Jin Kim, BengSeng Phuah, HyoungMin Kim, Myeong Jun Song, Ahn Choi, Daeho Kim, SooYoung Kim, Eun-Bong Kim, David Wang, Shinhaeng Kang, Yuhwan Ro, Seungwoo Seo, JoonHo Song, Jaeyoun Youn, Kyomin Sohn, and Nam Sung Kim. 2021. 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications. In ISSCC .
[64]
Daniel Langr and Pavel Tvrdík. 2016. Evaluation Criteria for Sparse Matrix Storage Formats. In TPDS .
[65]
Dominique Lavenier, Remy Cimadomo, and Romaric Jodin. 2020. Variant Calling Parallelization on Processor-in-Memory Architecture. In BIBM.
[66]
Seyong Lee and Rudolf Eigenmann. 2008. Adaptive Runtime Tuning of Parallel Sparse Matrix-Vector Multiplication on Distributed Memory Systems. In ICS .
[67]
Sukhan Lee, Shin-Haeng Kang, Jaehoon Lee, H. Kim, Eojin Lee, Seung young Seo, H. Yoon, Seungwon Lee, K. Lim, Hyunsung Shin, Jinhyun Kim, O. Seongil, Anand Iyer, David Wang, K. Sohn, and N. Kim. 2021. Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product. In ISCA .
[68]
J. Leskovec and R. Sosi?. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. In TIST .
[69]
Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication. In PLDI .
[70]
Kenli Li, Wangdong Yang, and Keqin Li. 2015. Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling. In IEEE TPDS .
[71]
Colin Yu Lin, Zheng Zhang, Ngai Wong, and Hayden Kwok-Hay So. 2010. Design Space Exploration for Sparse Matrix-Matrix Multiplication on FPGAs. In FPT .
[72]
Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. In IC .
[73]
Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse Convolutional Neural Networks. In CVPR .
[74]
Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. Towards Efficient SpMV on Sunway Manycore Architectures. In ICS .
[75]
Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In IPDPS .
[76]
Weifeng Liu and Brian Vinter. 2015a. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In ICS .
[77]
Weifeng Liu and Brian Vinter. 2015b. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In ICS .
[78]
Marco Maggioni and Tanya Berger-Wolf. 2013. AdELL: An Adaptive Warp-Balancing ELL Format for Efficient Sparse Matrix-Vector Multiplication on GPUs. In ICPP .
[79]
Duane Merrill and Michael Garland. 2016. Merge-Based Parallel Sparse Matrix-Vector Multiplication. In SC .
[80]
Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling. In MICRO .
[81]
Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2021. A Modern Primer on Processing in Memory. In Emerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann. https://arxiv.org/pdf/2012.03112.pdf
[82]
Naveen Namashivayam, Sanyam Mehta, and Pen-Chung Yew. 2021. Variable-Sized Blocks for Locality-Aware SpMV . In CGO .
[83]
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. In CoRR .
[84]
Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In IPDPS .
[85]
Eriko Nurvitadhi, Asit Mishra, Yu Wang, Ganesh Venkatesh, and Debbie Marr. 2016. Hardware Accelerator for Analytics of Sparse Data. In DAC .
[86]
NVIDIA. 2016. NVIDIA System Management Interface Program . http://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf .
[87]
Brian A. Page and Peter M. Kogge. 2018. Scalability of Hybrid Sparse Matrix Dense Vector (SpMV) Multiplication. In HPCS .
[88]
Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In HPCA .
[89]
peakperf. 2021. peakperf. https://github.com/Dr-Noob/peakperf.git
[90]
Ali Pinar and Michael T. Heath. 1999. Improving Performance of Sparse Matrix-Vector Multiplication. In SC .
[91]
Udo W. Pooch and Al Nieder. 1973. A Survey of Indexing Techniques for Sparse Matrices. In ACM Comput. Surv.
[92]
Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In HPCA .
[93]
Fazle Sadi, Joe Sweeney, Tze Meng Low, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2019. Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization. In MICRO .
[94]
SciPy. 2021. List-of-list Sparse Matrix .
[95]
Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In ICS .
[96]
Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. 2007. Scan Primitives for GPU Computing. In GH .
[97]
A. Smith. 2019. 6 New Facts About Facebook . http://mediashift.org
[98]
Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU. In ICS .
[99]
stream. 2021. stream. https://github.com/jeffhammond/STREAM.git
[100]
Bor-Yiing Su and Kurt Keutzer. 2012. ClSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs. In ICS .
[101]
Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture. In ACM Trans. Math. Softw.
[102]
Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huyng, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi. In CGO .
[103]
Yaman Umuroglu and Magnus Jahre. 2014. An Energy Efficient Column-Major Backend for FPGA SpMV Accelerators. In ICCD .
[104]
UPMEM. 2018. Introduction to UPMEM PIM. Processing-in-memory (PIM) on DRAM Accelerator (White Paper) .
[105]
UPMEM. 2020. UPMEM Website . https://www.upmem.com
[106]
UPMEM. 2021. UPMEM User Manual. Version 2021.3 .
[107]
R. Vuduc, J.W. Demmel, K.A. Yelick, S. Kamil, R. Nishtala, and B. Lee. 2002. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply. In SC .
[108]
Richard Wilson Vuduc and James W. Demmel. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. In PhD Thesis .
[109]
Richard W. Vuduc and Hyun-Jin Moon. 2005. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure. In HPCC .
[110]
Jeremiah Willcock and Andrew Lumsdaine. 2006. Accelerating Sparse Matrix Computations via Data Compression. In ICS .
[111]
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. In SC .
[112]
Tianji Wu, Bo Wang, Yi Shan, Feng Yan, Yu Wang, and Ningyi Xu. 2010. Efficient PageRank and SpMV Computation on AMD GPUs. In ICPP .
[113]
Xinfeng Xie, Zheng Liang, Peng Gu, Abanti Basak, Lei Deng, Ling Liang, Xing Hu, and Yuan Xie. 2021. SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator. In HPCA.
[114]
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014a. YaSpMV: Yet Another SpMV Framework on GPUs. In PPoPP .
[115]
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014b. YaSpMV: Yet Another SpMV Framework on GPUs. In PPoPP .
[116]
Wangdong Yang, Kenli Li, and Keqin Li. 2017. A Hybrid Computing Method of SpMV on CPU--GPU Heterogeneous Computing Systems. In JPDC .
[117]
Wangdong Yang, Kenli Li, Yan Liu, Lin Shi, and Lanjun Wan. 2014. Optimization of Quasi-Diagonal Matrix-Vector Multiplication on GPU. In Int. J. High Perform. Comput. Appl.
[118]
Wangdong Yang, Kenli Li, Zeyao Mo, and Keqin Li. 2015. Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs. In IEEE Transactions on Computers .
[119]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An Accelerator for Sparse Neural Networks. In MICRO .
[120]
Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018a. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In PPoPP .
[121]
Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018b. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In PPoPP .
[122]
Yue Zhao, Weijie Zhou, Xipeng Shen, and Graham Yiu. 2018c. Overhead-Conscious Format Selection for SpMV-Based Applications. In IPDPS .

Cited By

View all
  • (2025)HyperMR: Efficient Hypergraph-enhanced Matrix Storage on Compute-in-Memory ArchitectureProceedings of the ACM on Management of Data10.1145/37096953:1(1-27)Online publication date: 11-Feb-2025
  • (2025)Privacy-Preserving Outsourced Computation of Collaborative Operational Decisions Among Microgrids in an Active Distribution NetworkIEEE Transactions on Power Systems10.1109/TPWRS.2024.340797040:1(850-865)Online publication date: Jan-2025
  • (2024)Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise ProductACM Transactions on Architecture and Code Optimization10.1145/368861221:4(1-25)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 6, Issue 1
POMACS
March 2022
695 pages
EISSN:2476-1249
DOI:10.1145/3522731
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 February 2022
Published in POMACS Volume 6, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. benchmarking
  2. data movement bottleneck
  3. dram
  4. high-performance computing
  5. hpc
  6. memory systems
  7. multicore
  8. near-data processing
  9. processing-in-memory
  10. real-system characterization
  11. sparse matrix-vector multiplication
  12. spmv
  13. spmv library
  14. workload characterization

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)369
  • Downloads (Last 6 weeks)54
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)HyperMR: Efficient Hypergraph-enhanced Matrix Storage on Compute-in-Memory ArchitectureProceedings of the ACM on Management of Data10.1145/37096953:1(1-27)Online publication date: 11-Feb-2025
  • (2025)Privacy-Preserving Outsourced Computation of Collaborative Operational Decisions Among Microgrids in an Active Distribution NetworkIEEE Transactions on Power Systems10.1109/TPWRS.2024.340797040:1(850-865)Online publication date: Jan-2025
  • (2024)Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise ProductACM Transactions on Architecture and Code Optimization10.1145/368861221:4(1-25)Online publication date: 20-Nov-2024
  • (2024)ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in MLProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689905(246-257)Online publication date: 14-Oct-2024
  • (2024)PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory SystemProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676947(201-218)Online publication date: 14-Oct-2024
  • (2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 25-Apr-2024
  • (2024)Scalability Limitations of Processing-in-Memory using Real System EvaluationsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390468:1(1-28)Online publication date: 21-Feb-2024
  • (2024)Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMMProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640427(1200-1217)Online publication date: 27-Apr-2024
  • (2024)PhD Forum: Efficient Privacy-Preserving Processing via Memory-Centric Computing2024 43rd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS64841.2024.00039(322-325)Online publication date: 30-Sep-2024
  • (2024)Evaluating the Potential of In-Memory Processing to Accelerate Homomorphic Encryption: Practical Experience Report2024 43rd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS64841.2024.00019(92-103)Online publication date: 30-Sep-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media