research-article

Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization

Authors:

Franz FranchettiAuthors Info & Claims

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 347 - 358

https://doi.org/10.1145/3352460.3358330

Published: 12 October 2019 Publication History

Abstract

The importance of Sparse Matrix dense Vector multiplication (SpMV) operation in graph analytics and numerous scientific applications has led to development of custom accelerators that are intended to over-come the difficulties of sparse data operations on general purpose architectures. However, efficient SpMV operation on large problem (i.e. working set exceeds on-chip storage) is severely constrained due to strong dependence on limited amount of fast random access memory to scale. Additionally, unstructured matrix with high sparsity pose difficulties as most solutions rely on exploitation of data locality. This work presents an algorithm co-optimized scalable hardware architecture that can efficiently operate on very large (~billion nodes) and/or highly sparse (avg. degree <10) graphs with significantly less on-chip fast memory than existing solutions. A novel parallelization methodology for implementing large and high throughput multi-way merge network is the key enabler of this high performance SpMV accelerator. Additionally, a data compression scheme to reduce off-chip traffic and special computation for nodes with exceptionally large number of edges, commonly found in power-law graphs, are presented. This accelerator is demonstrated with 16-nm fabricated ASIC and Stratix® 10 FPGA platforms. Experimental results show more than an order of magnitude improvement over current custom hardware solutions and more than two orders of magnitude improvement over commercial off-the-shelf (COTS) architectures for both performance and energy efficiency.

References

[1]

[n. d.]. Intel® Stratix10® FPGA platform. https://www.intel.com/content/www/us/en /products/programmable/fpga/stratix10.html.

[2]

Alok Aggarwal and S. Vitter, Jeffrey. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM 31, 9 (Sept. 1988), 1116--1127. https://doi.org/10.1145/48529.48535

Digital Library

[3]

K. E. Batcher. 1968. Sorting Networks and Their Applications. In Proceedings of the April 30-May 2, 1968, Spring Joint Computer Conference (AFIPS '68 (Spring)). ACM, New York, NY, USA, 307--314. https://doi.org/10.1145/1468075.1468121

Digital Library

[4]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 18.

Digital Library

[5]

Bryan Black. 2014. Die Stacking is Happening in Mainstream Computing. Additional Conferences (Device Packaging, HiTEC, HiTEN, & CICMT) 2014, DPC (2014), 001183--001206.

[6]

Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.

Digital Library

[7]

Andrei Broder and Michael Mitzenmacher. 2004. Network applications of bloom filters: A survey. Internet mathematics 1, 4 (2004), 485--509.

[8]

A. Buluc and J. R. Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1--11. https://doi.org/10.1109/IPDPS.2008.4536313

[9]

Ke Chen, Sheng Li, N. Muralimanohar, Jung-Ho Ahn, J.B. Brockman, and N.P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Design, Automation Test in Europe (DATE). 33--38.

[10]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages. https://doi.org/10.1145/2049662.2049663

Digital Library

[11]

Y. El-Kurdi, W. J. Gross, and D. Giannacopoulos. 2006. Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs. In Field-Programmable Custom Computing Machines, 2006. FCCM '06. 14th Annual IEEE Symposium on. 293--294. https://doi.org/10.1109/FCCM.2006.65

[12]

Paul Erdos and Alfréd Rényi. 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 1 (1960), 17--60.

[13]

Kermin Fleming, Myron King, Man Cheuk Ng, Asif Khan, and Muralidaran Vijayaraghavan. 2008. High-throughput Pipelined Mergesort. In Proceedings of the Sixth ACM/IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE '08). IEEE Computer Society, Washington, DC, USA, 155--158. https://doi.org/10.1109/MEMCOD.2008.4547704

Digital Library

[14]

T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13. https://doi.org/10.1109/MICRO.2016.7783759

[15]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 37--47.

Digital Library

[16]

Nikos Hardavellas. 2012. The rise and fall of dark silicon. The advanced computing systems association (2012), 7--17.

[17]

M. Jacunski, D. Anand, R. Busch, J. Fifield, M. Lanahan, P. Lane, A. Paparelli, G. Pomichter, D. Pontius, M. Roberge, and S. Sliva. 2010. A 45nm SOI compiled embedded DRAM with random cycle times down to 1.3ns. In Custom Integrated Circuits Conference (CICC), 2010 IEEE. 1--4. https://doi.org/10.1109/CICC.2010.5617634

[18]

Dirk Koch and Jim Torresen. 2011. FPGASort: A high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays. ACM, 45--54.

Digital Library

[19]

Jérôme Kunegis. 2013. KONECT: The Koblenz Network Collection. In Proceedings of the 22Nd International Conference on World Wide Web (WWW '13 Companion). ACM, New York, NY, USA, 1343--1350. https://doi.org/10.1145/2487788.2488173

Digital Library

[20]

Kartik Lakhotia, Rajgopal Kannan, and Viktor K. Prasanna. 2017. Accelerating PageRank using Partition-Centric Processing. CoRR abs/1709.07122 (2017). arXiv:1709.07122 http://arxiv.org/abs/1709.07122

[21]

S. Mashimo, T. V. Chu, and K. Kise. 2017. High-Performance Hardware Merge Sorter. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 1--8. https://doi.org/10.1109/FCCM.2017.19

[22]

Susumu Mashimo, Thiem Van Chu, and Kenji Kise. 2017. Cost-Effective and High-Throughput Merge Network: Architecture for the Fastest FPGA Sorting Accelerator. ACM SIGARCH Computer Architecture News 44, 4 (2017), 8--13.

Digital Library

[23]

M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie. 2015. DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches. In Design, Automation Test in Europe Conference Exhibition (DATE), 2015. 1543--1546.

[24]

John Poulton. 1997. An embedded DRAM for CMOS ASICs. In Advanced Research in VLSI, 1997. Proceedings., Seventeenth Conference on. IEEE, 288--302.

[25]

Y. Qiao, T. Li, and S. Chen. 2011. One memory access bloom filters and their generalization. In 2011 Proceedings IEEE INFOCOM. 1745--1753. https://doi.org/10.1109/INFCOM.2011.5934972

[26]

A. Rungsawang and B. Manaskasemsak. 2012. Fast PageRank Computation on a GPU Cluster. In 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing. 450--456. https://doi.org/10.1109/PDP.2012.78

Digital Library

[27]

Leszek Rutkowski, Marcin Korytkowski, Rafal Scherer, Ryszard Tadeusiewicz, Lotfi A. Zadeh, and Jacek M. Zurada. 2013. Artificial Intelligence and Soft Computing: 12th International Conference, ICAISC 2013, Zakopane, Poland, June 9-13, 2013, Proceedings, Part I ... / Lecture Notes in Artificial Intelligence). Springer Publishing Company, Incorporated.

[28]

Fazle Sadi. 2018. Accelerating Sparse Matrix Kernels with Co-Optimized Architecture. Ph.D. Dissertation. Carnegie Mellon University.

[29]

F. Sadi, L. Fileggi, and F. Franchetti. 2017. Algorithm and hardware co-optimized solution for large SpMV problems. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7. https://doi.org/10.1109/HPEC.2017.8091096

[30]

Makoto Saitoh, Elsayed A Elsayed, Thiem Van Chu, Susumu Mashimo, and Kenji Kise. 2018. A High-Performance and Cost-Effective Hardware Merge Sorter without Feedback Datapath. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 197--204.

[31]

Y. Shan, T. Wu, Y. Wang, B. Wang, Z. Wang, N. Xu, and H. Yang. 2010. FPGA and GPU implementation of large scale SpMV. In Application Specific Processors (SASP), 2010 IEEE 8th Symposium on. 64--70. https://doi.org/10.1109/SASP.2010.5521144

Digital Library

[32]

Wei Song, Dirk Koch, Mikel Luján, and Jim Garside. 2016. Parallel hardware merge sorter. In Field-Programmable Custom Computing Machines (FCCM), 2016 IEEE 24th Annual International Symposium on. IEEE, 95--102.

[33]

T. Usui, T. V. Chu, and K. Kise. 2016. A Cost-Effective and Scalable Merge Sorter Tree on FPGAs. In 2016 Fourth International Symposium on Computing and Networking (CANDAR). 47--56. https://doi.org/10.1109/CANDAR.2016.0023

[34]

Stephan Wong, Stamatis Vassiliadis, and Jae Young Hur. 2005. Parallel merge sort on a binary tree on-chip network. In Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC). Citeseer, 365--368.

[35]

Y. Zhang, Y. H. Shalabi, R. Jain, K. K. Nagar, and J. D. Bakos. 2009. FPGA vs. GPU for sparse matrix vector multiply. In Field-Programmable Technology, 2009. FPT 2009. International Conference on. 255--262. https://doi.org/10.1109/FPT.2009.5377620

[36]

S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of PageRank. In 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1--6. https://doi.org/10.1109/ReConFig.2015.7393332

[37]

Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA Framework for Edge-centric Graph Processing. In Proceedings of the 15th ACM International Conference on Computing Frontiers (CF '18). ACM, New York, NY, USA, 69--77. https://doi.org/10.1145/3203217.3203233

[38]

S. Zhou, K. Lakhotia, S. G. Singapura, H. Zeng, R. Kannan, V. K. Prasanna, J. Fox, E. Kim, O. Green, and D. A. Bader. 2017. Design and implementation of parallel PageRank on multicore platforms. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--6. https://doi.org/10.1109/HPEC.2017.8091048

[39]

Dan Zou, Yong Dou, Song Guo, and Shice Ni. 2013. High performance sparse matrix-vector multiplication on FPGA. IEICE Electronics Express 10, 17 (2013), 20130529--20130529. https://doi.org/10.1587/elex.10.20130529

Cited By

Isaac–Chassande VEvans ADurand YRousseau F(2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1145/3640542
Zhong KZhu ZDai GWang HYang XZhang HSi JMao QZeng SHong KZhang GYang HWang YTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine LearningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651336(349-366)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651336
Lu LLuo ZZheng SYin JCong JLiang YYin J(2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3337208
Show More Cited By

Recommendations

Towards Efficient SpMV on Sunway Manycore Architectures
ICS '18: Proceedings of the 2018 International Conference on Supercomputing

Sparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially ...
Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for ...
Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format
PPoPP '16

We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 2019

1104 pages

ISBN:9781450369381

DOI:10.1145/3352460

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '52

Sponsor:

SIGMICRO

MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 12 - 16, 2019

OH, Columbus, USA

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
1,345
Total Downloads

Downloads (Last 12 months)176
Downloads (Last 6 weeks)11

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Isaac–Chassande VEvans ADurand YRousseau F(2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1145/3640542
Zhong KZhu ZDai GWang HYang XZhang HSi JMao QZeng SHong KZhang GYang HWang YTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine LearningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651336(349-366)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651336
Lu LLuo ZZheng SYin JCong JLiang YYin J(2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3337208
Zhang JHuang HSun JLuna JMutlu OWang Z(2024)SparseACC: A Generalized Linear Model Accelerator for Sparse DatasetsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332427643:3(840-853)Online publication date: Mar-2024
https://doi.org/10.1109/TCAD.2023.3324276
Ullah FAhmed MZamir MArif MFelipe-Riverón EGelbukh A(2024)Optimal Scheduling for the Performance Optimization of SpMV Computation using Machine Learning Techniques2024 7th International Conference on Information and Computer Technologies (ICICT)10.1109/ICICT62343.2024.00022(99-104)Online publication date: 15-Mar-2024
https://doi.org/10.1109/ICICT62343.2024.00022
Sgherzi FSiracusa MFernandez IArmejach AMoretó M(2024)SpChar: Characterizing the sparse puzzle via decision treesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104941(104941)Online publication date: Jun-2024
https://doi.org/10.1016/j.jpdc.2024.104941
Vasireddy PKavi KWeaver AMehta G(2023)Streaming Sparse Data on Architectures with Vector Extensions using Near Data ProcessingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631898(1-12)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631898
Feldmann ASanchez D(2023)Spatula: A Hardware Accelerator for Sparse Matrix FactorizationProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623783(91-104)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623783
Serafin NGhosh SDesai HBeckmann NLucia B(2023)Pipestitch: An energy-minimal dataflow architecture with lightweight threadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614283(1409-1422)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614283
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents