research-article

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Authors:

Yusuke Nagasaka,

Satoshi Matsuoka,

Aydın BuluçAuthors Info & Claims

ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Article No.: 34, Pages 1 - 10

https://doi.org/10.1145/3229710.3229720

Published: 13 August 2018 Publication History

Abstract

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.

References

[1]

Sandeep R Agrawal, Christopher M Dee, and Alvin R Lebeck. 2016. Exploiting accelerators for efficient high dimensional similarity search. In PPoPP. ACM.

Digital Library

[2]

Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2016. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication. In ICS. ACM, New York, NY, USA, Article 36.

Digital Library

[3]

Ariful Azad, Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. 2016. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing 38, 6 (2016), C624--C651.

[4]

Ariful Azad, Aydın Buluç, and John Gilbert. 2015. Parallel triangle counting and enumeration using matrix algebra. In IPDPSW.

Digital Library

[5]

Ariful Azad, Georgios A Pavlopoulos, Christos A Ouzounis, Nikos C Kyrpides, and Aydin Buluç. 2018. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic acids research (2018).

[6]

Grey Ballard, Christopher Siefert, and Jonathan Hu. 2016. Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing 38, 3 (2016), C203--C231.

Digital Library

[7]

Nicolas Bock, Matt Challacombe, and Laxmikant V Kalé. 2016. Solvers for O(N) Electronic Structure in the Strong Scaling Limit. SIAM Journal on Scientific Computing 38, 1 (2016), C1--C21.

Digital Library

[8]

Aydın Buluç and John R. Gilbert. 2011. The Combinatorial BLAS: Design, Implementation, and Applications. IJHPCA 25, 4 (2011), 496--509.

Digital Library

[9]

Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, 442--446.

[10]

Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing sparse matrix--matrix multiplication for the gpu. ACM Transactions on Mathematical Software (TOMS) 41, 4 (2015), 25.

Digital Library

[11]

Timothy A Davis. {n. d.}. private communication.

[12]

Timothy A Davis. 2006. Direct methods for sparse linear systems. SIAM.

Digital Library

[13]

Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1.

Digital Library

[14]

Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. 2017. Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures. In IPDPSW. IEEE, 693--702.

[15]

Elizabeth D Dolan and Jorge J More. 2002. Benchmarking optimization software with performance profiles. Mathematical programming 91, 2 (2002), 201--213.

[16]

John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in MATLAB: Design and implementation. SIAM J. Matrix Anal. Appl. 13, 1 (1992), 333--356.

Digital Library

[17]

John R. Gilbert, Steve Reinhardt, and Viral B. Shah. 2007. High performance graph algorithms from parallel sparse matrices. In PARA. 260--269.

Digital Library

[18]

Felix Gremse, Andreas Hofter, Lars Ole Schwen, Fabian Kiessling, and Uwe Naumann. 2015. GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging. SIAM Journal on Scientific Computing 37, 1 (2015), C54--C71.

Digital Library

[19]

Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM TOMS 4, 3 (1978), 250--269.

Digital Library

[20]

Guoming He, Haijun Feng, Cuiping Li, and Hong Chen. 2010. Parallel SimRank computation on large graphs with iterative aggregation. In SIGKDD. ACM.

Digital Library

[21]

Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. In IPDPS. IEEE, 370--381.

Digital Library

[22]

Kiran Matam, Siva Rama Krishna Bharadwaj Indarapu, and Kishore Kothapalli. 2012. Sparse Matrix-Matrix Multiplication on Modern Architectures. In HiPC. IEEE.

[23]

John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia.

[24]

Johannes Sebastian Mueller-Roemer, Christian Altenhofen, and Andre Stork. 2017. Ternary Sparse Matrix Representation for Volumetric Mesh Subdivision and Processing on GPUs. In Computer Graphics Forum, Vol. 36.

Digital Library

[25]

Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. 2018. High-performance sparse matrix-matrix products on Intel KNL and multicore architectures. arXiv preprint arXiv:1804.01698 (2018).

Digital Library

[26]

Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. In ICPP. IEEE, 101--110.

[27]

Md Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, Sergey G Pudov, Vadim O Pirogov, and Pradeep Dubey. 2015. Parallel efficient sparse matrix-matrix multiplication on multicore platforms. In ISC. Springer, 48--57.

[28]

Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. 2007. Near linear time algorithm to detect community structures in large-scale networks. Physical review E 76, 3 (2007), 036106.

[29]

Kenneth A Ross. 2007. Efficient Hash Probes on Modern Processors. In ICDE. IEEE, 1297--1301.

[30]

Karl Rupp, Philippe Tillet, Florian Rudolf, Josef Weinbub, Andreas Morhammer, Tibor Grasser, Ansgar Jüngel, and Siegfried Selberherr. 2016. ViennaCL---Linear Algebra Library for Multi-and Many-Core Architectures. SIAM Journal on Scientific Computing 38, 5 (2016), S412--S439.

Digital Library

[31]

Viral B. Shah. 2007. An Interactive System for Combinatorial Scientific Computing with an Emphasis on Programmer Productivity. Ph.D. Dissertation. University of California, Santa Barbara.

Digital Library

[32]

Peter D Sulatycke and Kanad Ghose. 1998. Caching-efficient multithreaded fast multiplication of sparse matrices. In IPPS/SPDP. IEEE.

Digital Library

Cited By

Feng GJia WSun NTan GLi JLee IChabbi MSteuwer M(2024)POSTER: Optimizing Sparse Tensor Contraction with Revisiting Hash Table DesignProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638500(457-459)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638500
Cui HWang NHan QWang Y(2024)SPMSD: An Partitioning-Strategy for Parallel General Sparse Matrix-Matrix Multiplication on GPUParallel Processing Letters10.1142/S012962642450004X34:02Online publication date: 27-May-2024
https://doi.org/10.1142/S012962642450004X
Gao YQuan GHomsi SWen WWang L(2024)Secure and efficient general matrix multiplication on cloud using homomorphic encryptionThe Journal of Supercomputing10.1007/s11227-024-06428-880:18(26394-26434)Online publication date: 26-Aug-2024
https://doi.org/10.1007/s11227-024-06428-8
Show More Cited By

Index Terms

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
Abstract
Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been ...
Accelerating Sparse General Matrix-Matrix Multiplication for NVIDIA Volta GPU and Hygon DCU
HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Sparse general matrix-matrix multiplication (SpGEMM) is challenging especially on graphic accelerators. Existing solutions do not fully utilize the shared memory of the graphics accelerator. Our proposal could effectively utilize the graphics accelerator'...
An implementation of matrix---matrix multiplication on the Intel KNL processor with AVX-512

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

August 2018

409 pages

ISBN:9781450365239

DOI:10.1145/3229710

Copyright © 2018 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '18 Comp

ICPP '18 Comp: 47th International Conference on Parallel Processing Companion

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
359
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Feng GJia WSun NTan GLi JLee IChabbi MSteuwer M(2024)POSTER: Optimizing Sparse Tensor Contraction with Revisiting Hash Table DesignProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638500(457-459)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638500
Cui HWang NHan QWang Y(2024)SPMSD: An Partitioning-Strategy for Parallel General Sparse Matrix-Matrix Multiplication on GPUParallel Processing Letters10.1142/S012962642450004X34:02Online publication date: 27-May-2024
https://doi.org/10.1142/S012962642450004X
Gao YQuan GHomsi SWen WWang L(2024)Secure and efficient general matrix multiplication on cloud using homomorphic encryptionThe Journal of Supercomputing10.1007/s11227-024-06428-880:18(26394-26434)Online publication date: 26-Aug-2024
https://doi.org/10.1007/s11227-024-06428-8
Takayashiki HYagi HNishimoto HYoshifuji N(2023)A New Sparse GEneral Matrix-matrix Multiplication Method for Long Vector Architecture by Hierarchical Row MergingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3625131(756-759)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3625131
Siracusa MSoria-Pardos VSgherzi FRandall JJoseph DMoretó Planas MArmejach A(2023)A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose ProcessorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614284(1332-1346)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614284
Jang MKo YGwon HJo IPark YKim SFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)SAGE: A Storage-Based Approach for Scalable and Efficient Sparse Generalized Matrix-Matrix MultiplicationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615044(923-933)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615044
Davis T(2023)Algorithm 1037: SuiteSparse:GraphBLAS: Parallel Graph Algorithms in the Language of Sparse Linear AlgebraACM Transactions on Mathematical Software10.1145/357719549:3(1-30)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1145/3577195
Fan RWang WChu X(2023)Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00057(501-511)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00057
Yang SZhang CMa J(2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
https://doi.org/10.1109/HiPC58850.2023.00037
Zheng JJiang JDu JHuang DLu Y(2023)Optimizing massively parallel sparse matrix computing on ARM many-core processorParallel Computing10.1016/j.parco.2023.103035117(103035)Online publication date: Sep-2023
https://doi.org/10.1016/j.parco.2023.103035
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents