Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3352460.3358330acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization

Published: 12 October 2019 Publication History
  • Get Citation Alerts
  • Abstract

    The importance of Sparse Matrix dense Vector multiplication (SpMV) operation in graph analytics and numerous scientific applications has led to development of custom accelerators that are intended to over-come the difficulties of sparse data operations on general purpose architectures. However, efficient SpMV operation on large problem (i.e. working set exceeds on-chip storage) is severely constrained due to strong dependence on limited amount of fast random access memory to scale. Additionally, unstructured matrix with high sparsity pose difficulties as most solutions rely on exploitation of data locality. This work presents an algorithm co-optimized scalable hardware architecture that can efficiently operate on very large (~billion nodes) and/or highly sparse (avg. degree <10) graphs with significantly less on-chip fast memory than existing solutions. A novel parallelization methodology for implementing large and high throughput multi-way merge network is the key enabler of this high performance SpMV accelerator. Additionally, a data compression scheme to reduce off-chip traffic and special computation for nodes with exceptionally large number of edges, commonly found in power-law graphs, are presented. This accelerator is demonstrated with 16-nm fabricated ASIC and Stratix® 10 FPGA platforms. Experimental results show more than an order of magnitude improvement over current custom hardware solutions and more than two orders of magnitude improvement over commercial off-the-shelf (COTS) architectures for both performance and energy efficiency.

    References

    [1]
    [n. d.]. Intel® Stratix10® FPGA platform. https://www.intel.com/content/www/us/en /products/programmable/fpga/stratix10.html.
    [2]
    Alok Aggarwal and S. Vitter, Jeffrey. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM 31, 9 (Sept. 1988), 1116--1127. https://doi.org/10.1145/48529.48535
    [3]
    K. E. Batcher. 1968. Sorting Networks and Their Applications. In Proceedings of the April 30-May 2, 1968, Spring Joint Computer Conference (AFIPS '68 (Spring)). ACM, New York, NY, USA, 307--314. https://doi.org/10.1145/1468075.1468121
    [4]
    Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 18.
    [5]
    Bryan Black. 2014. Die Stacking is Happening in Mainstream Computing. Additional Conferences (Device Packaging, HiTEC, HiTEN, & CICMT) 2014, DPC (2014), 001183--001206.
    [6]
    Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.
    [7]
    Andrei Broder and Michael Mitzenmacher. 2004. Network applications of bloom filters: A survey. Internet mathematics 1, 4 (2004), 485--509.
    [8]
    A. Buluc and J. R. Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1--11. https://doi.org/10.1109/IPDPS.2008.4536313
    [9]
    Ke Chen, Sheng Li, N. Muralimanohar, Jung-Ho Ahn, J.B. Brockman, and N.P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Design, Automation Test in Europe (DATE). 33--38.
    [10]
    Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages. https://doi.org/10.1145/2049662.2049663
    [11]
    Y. El-Kurdi, W. J. Gross, and D. Giannacopoulos. 2006. Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs. In Field-Programmable Custom Computing Machines, 2006. FCCM '06. 14th Annual IEEE Symposium on. 293--294. https://doi.org/10.1109/FCCM.2006.65
    [12]
    Paul Erdos and Alfréd Rényi. 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 1 (1960), 17--60.
    [13]
    Kermin Fleming, Myron King, Man Cheuk Ng, Asif Khan, and Muralidaran Vijayaraghavan. 2008. High-throughput Pipelined Mergesort. In Proceedings of the Sixth ACM/IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE '08). IEEE Computer Society, Washington, DC, USA, 155--158. https://doi.org/10.1109/MEMCOD.2008.4547704
    [14]
    T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13. https://doi.org/10.1109/MICRO.2016.7783759
    [15]
    Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 37--47.
    [16]
    Nikos Hardavellas. 2012. The rise and fall of dark silicon. The advanced computing systems association (2012), 7--17.
    [17]
    M. Jacunski, D. Anand, R. Busch, J. Fifield, M. Lanahan, P. Lane, A. Paparelli, G. Pomichter, D. Pontius, M. Roberge, and S. Sliva. 2010. A 45nm SOI compiled embedded DRAM with random cycle times down to 1.3ns. In Custom Integrated Circuits Conference (CICC), 2010 IEEE. 1--4. https://doi.org/10.1109/CICC.2010.5617634
    [18]
    Dirk Koch and Jim Torresen. 2011. FPGASort: A high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays. ACM, 45--54.
    [19]
    Jérôme Kunegis. 2013. KONECT: The Koblenz Network Collection. In Proceedings of the 22Nd International Conference on World Wide Web (WWW '13 Companion). ACM, New York, NY, USA, 1343--1350. https://doi.org/10.1145/2487788.2488173
    [20]
    Kartik Lakhotia, Rajgopal Kannan, and Viktor K. Prasanna. 2017. Accelerating PageRank using Partition-Centric Processing. CoRR abs/1709.07122 (2017). arXiv:1709.07122 http://arxiv.org/abs/1709.07122
    [21]
    S. Mashimo, T. V. Chu, and K. Kise. 2017. High-Performance Hardware Merge Sorter. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 1--8. https://doi.org/10.1109/FCCM.2017.19
    [22]
    Susumu Mashimo, Thiem Van Chu, and Kenji Kise. 2017. Cost-Effective and High-Throughput Merge Network: Architecture for the Fastest FPGA Sorting Accelerator. ACM SIGARCH Computer Architecture News 44, 4 (2017), 8--13.
    [23]
    M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie. 2015. DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches. In Design, Automation Test in Europe Conference Exhibition (DATE), 2015. 1543--1546.
    [24]
    John Poulton. 1997. An embedded DRAM for CMOS ASICs. In Advanced Research in VLSI, 1997. Proceedings., Seventeenth Conference on. IEEE, 288--302.
    [25]
    Y. Qiao, T. Li, and S. Chen. 2011. One memory access bloom filters and their generalization. In 2011 Proceedings IEEE INFOCOM. 1745--1753. https://doi.org/10.1109/INFCOM.2011.5934972
    [26]
    A. Rungsawang and B. Manaskasemsak. 2012. Fast PageRank Computation on a GPU Cluster. In 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing. 450--456. https://doi.org/10.1109/PDP.2012.78
    [27]
    Leszek Rutkowski, Marcin Korytkowski, Rafal Scherer, Ryszard Tadeusiewicz, Lotfi A. Zadeh, and Jacek M. Zurada. 2013. Artificial Intelligence and Soft Computing: 12th International Conference, ICAISC 2013, Zakopane, Poland, June 9-13, 2013, Proceedings, Part I ... / Lecture Notes in Artificial Intelligence). Springer Publishing Company, Incorporated.
    [28]
    Fazle Sadi. 2018. Accelerating Sparse Matrix Kernels with Co-Optimized Architecture. Ph.D. Dissertation. Carnegie Mellon University.
    [29]
    F. Sadi, L. Fileggi, and F. Franchetti. 2017. Algorithm and hardware co-optimized solution for large SpMV problems. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7. https://doi.org/10.1109/HPEC.2017.8091096
    [30]
    Makoto Saitoh, Elsayed A Elsayed, Thiem Van Chu, Susumu Mashimo, and Kenji Kise. 2018. A High-Performance and Cost-Effective Hardware Merge Sorter without Feedback Datapath. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 197--204.
    [31]
    Y. Shan, T. Wu, Y. Wang, B. Wang, Z. Wang, N. Xu, and H. Yang. 2010. FPGA and GPU implementation of large scale SpMV. In Application Specific Processors (SASP), 2010 IEEE 8th Symposium on. 64--70. https://doi.org/10.1109/SASP.2010.5521144
    [32]
    Wei Song, Dirk Koch, Mikel Luján, and Jim Garside. 2016. Parallel hardware merge sorter. In Field-Programmable Custom Computing Machines (FCCM), 2016 IEEE 24th Annual International Symposium on. IEEE, 95--102.
    [33]
    T. Usui, T. V. Chu, and K. Kise. 2016. A Cost-Effective and Scalable Merge Sorter Tree on FPGAs. In 2016 Fourth International Symposium on Computing and Networking (CANDAR). 47--56. https://doi.org/10.1109/CANDAR.2016.0023
    [34]
    Stephan Wong, Stamatis Vassiliadis, and Jae Young Hur. 2005. Parallel merge sort on a binary tree on-chip network. In Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC). Citeseer, 365--368.
    [35]
    Y. Zhang, Y. H. Shalabi, R. Jain, K. K. Nagar, and J. D. Bakos. 2009. FPGA vs. GPU for sparse matrix vector multiply. In Field-Programmable Technology, 2009. FPT 2009. International Conference on. 255--262. https://doi.org/10.1109/FPT.2009.5377620
    [36]
    S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of PageRank. In 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1--6. https://doi.org/10.1109/ReConFig.2015.7393332
    [37]
    Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA Framework for Edge-centric Graph Processing. In Proceedings of the 15th ACM International Conference on Computing Frontiers (CF '18). ACM, New York, NY, USA, 69--77. https://doi.org/10.1145/3203217.3203233
    [38]
    S. Zhou, K. Lakhotia, S. G. Singapura, H. Zeng, R. Kannan, V. K. Prasanna, J. Fox, E. Kim, O. Green, and D. A. Bader. 2017. Design and implementation of parallel PageRank on multicore platforms. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--6. https://doi.org/10.1109/HPEC.2017.8091048
    [39]
    Dan Zou, Yong Dou, Song Guo, and Shice Ni. 2013. High performance sparse matrix-vector multiplication on FPGA. IEICE Electronics Express 10, 17 (2013), 20130529--20130529. https://doi.org/10.1587/elex.10.20130529

    Cited By

    View all
    • (2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
    • (2024)FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine LearningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651336(349-366)Online publication date: 27-Apr-2024
    • (2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
    October 2019
    1104 pages
    ISBN:9781450369381
    DOI:10.1145/3352460
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SpMV
    2. custom hardware
    3. merge parallelization
    4. sparse matrices

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    MICRO '52
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Upcoming Conference

    MICRO '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)176
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
    • (2024)FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine LearningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651336(349-366)Online publication date: 27-Apr-2024
    • (2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
    • (2024)SparseACC: A Generalized Linear Model Accelerator for Sparse DatasetsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.332427643:3(840-853)Online publication date: Mar-2024
    • (2024)Optimal Scheduling for the Performance Optimization of SpMV Computation using Machine Learning Techniques2024 7th International Conference on Information and Computer Technologies (ICICT)10.1109/ICICT62343.2024.00022(99-104)Online publication date: 15-Mar-2024
    • (2024)SpChar: Characterizing the sparse puzzle via decision treesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104941(104941)Online publication date: Jun-2024
    • (2023)Streaming Sparse Data on Architectures with Vector Extensions using Near Data ProcessingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631898(1-12)Online publication date: 2-Oct-2023
    • (2023)Spatula: A Hardware Accelerator for Sparse Matrix FactorizationProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623783(91-104)Online publication date: 28-Oct-2023
    • (2023)Pipestitch: An energy-minimal dataflow architecture with lightweight threadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614283(1409-1422)Online publication date: 28-Oct-2023
    • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media