research-article

Public Access

Regularizing irregularity: bitmap-based and portable sparse matrix multiplication for graph data on GPUs

Authors:

Jianting Zhang,

Le GruenwaldAuthors Info & Claims

GRADES-NDA '18: Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)

Article No.: 4, Pages 1 - 8

https://doi.org/10.1145/3210259.3210263

Published: 10 June 2018 Publication History

Abstract

Graphs can be naturally represented as sparse matrices. The relationship between graph algorithms and linear algebra algorithms is well understood and many graph problems can be abstracted as Sparse General Matrix-Matrix Multiplication (SpGEMM) operations. While quite some matrix storage formats, including bitmap-based ones, have been proposed for sparse matrices, they are mostly evaluated on the simpler Sparse Matrix-Vector Multiplication (SpMV) problems. In this study, we have developed data parallel algorithms to pair up bitmap-indexed sparse matrix blocks for SpGEMM using data parallel primitives for portability. Experiments on the WebBase-1M dataset with more than one million rows and columns and three million non-zero values have shown that our technique on squaring the large-scale sparse matrix using a 2013 GTX Titan GPU can complete in about 300 milliseconds. The runtime is 2.4X faster than CUSP and 3.5X faster than bhSPARSE, the two leading open source SpGEMM packages. Furthermore, our bitmap-indexed sparse matrix blocks can be efficiently converted to regular small dense matrices and subsequently utilize new hardware accelerations, such as tensor cores inside Nvidia Volta GPUs and Google Tensor Processing Units (TPUs), for more efficient implementations.

References

[1]

J. Kepner and J. Gilbert, Graph Algorithms in the Language of Linear Algebra, SIAM, 2011, 348 pages.

Digital Library

[2]

S. Che, B. M. Beckmann and S. K. Reinhardt, "Programming GPGPU Graph Applications with Linear Algebra Building Blocks," International Journal of Parallel Programming, vol. 45, no. 3, pp. 657--679, 2017.

Digital Library

[3]

M. Besta, F. Marending, E. Solomonik and T. Hoefler, "SlimSell: A Vectorizable Graph Representation for Breadth-First Search," in Proc. IEEE IPDPS'17, 2017.

[4]

Nvidia, "NVGRAPH Library User's Guide," Online, 2017.

[5]

S. Che, B. M. Beckmann and S. K. Reinhardt, "BelRed: Constructing GPGPU graph applications with software building blocks," in Proc. IEEE HPEC'14, 2014.

[6]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarath and P. Sadayappan, "Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications," in Proc. SC'14, 2014.

Digital Library

[7]

X. Yang, S. Parthasarathy and P. Sadayappan, "Fast Sparse Matrix-vector Multiplication on GPUs: Implications for Graph Mining," Proc. VLDB Endow., vol. 4, no. 4, pp. 231--242, 2011.

Digital Library

[8]

R. Zayer, M. Steinberger and H.-P. Seidel, "A GPU-Adapted Structure for Unstructured Grids," Computer Graphics Forum, vol. 36, no. 2, pp. 1467--8659, 2017.

Digital Library

[9]

Y.-Y. Jo, S.-W. Kim and D.-H. Bae, "Efficient Sparse Matrix Multiplication on GPU for Large Social Network Analysis," in Proc. ACM CIKM'15, 2015.

Digital Library

[10]

D. Langr and P. Tvrdík, "Evaluation Criteria for Sparse Matrix Storage Formats," IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 2, pp. 428--440, 2016.

Digital Library

[11]

R. Kannan, "Efficient sparse matrix multiple-vector multiplication using a bitmapped format," in Proc. IEEE HiPC, 2013.

[12]

W. Liu and B. Vinter, "CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication," in Proc. SC'15, 2015.

Digital Library

[13]

L. Liu, M. Liu, C. Wang and J. Wang, "LSRB-CSR: A Low Overhead Storage Format for SpMV on the GPU Systems," in Proc. IEEE ICPADS'15, 2015.

Digital Library

[14]

B. Qin and F. Rusu, "Dot-Product Join: An Array-Relation Join Operator for Big Model Analytics," arXiv:1602.08845, 2017

[15]

J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach (5th Ed.), Morgan Kaufmann, 2011.

Digital Library

[16]

Nvidia, "Inside Volta: The World's Most Advanced Data Center GPU," 2017.

[17]

N. P. Jouppi, C. Young, N. Patil and D. Patterson, "In-Datacenter Performance Analysis of a Tensor Processing Unit," in Proc. ISCA'17, 2017.

Digital Library

[18]

University of Florida, "Sparse Matrix Collection," {Online}. https://www.cise.ufl.edu/research/sparse/matrices/Williams/webbase-1M.html.

[19]

S. Dalton, N. Bell, L. Olson and M. Garland. {Online}. Available: http://cusplibrary.github.io/.

[20]

W. Liu and B. Vinter, "A Framework for General Sparse Matrix-matrix Multiplication on GPUs and Heterogeneous Processors," J. Parallel Distrib. Comput., pp. 47--61, 2015.

Digital Library

[21]

B. Park, H.-M. Park, M. Yoon and U. Kang, "PMV: Pre-partitioned Generalized Matrix-Vector Multiplication," arXiv:1709.09099v1, 2017.

[22]

GraphBLAS. {Online}. Available: http://graphblas.org/.

[23]

ExaGraph. {Online}. Available: http://www.pnnl.gov/science/highlights/highlight.asp?id=4558.

[24]

H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang and W. J. Dally, "Exploring the Granularity of Sparsity in Convolutional Neural Networks," in IEEE CVPRW'17, 2017.

[25]

S. Filippone, V. Cardellini, D. Barbieri and A. Fanfarillo, "Sparse Matrix-Vector Multiplication on GPGPUs," ACM Trans. Math. Softw., vol. 43, no. 4, pp. 30:1--30:49, 2017.

Digital Library

[26]

Z. Koza, M. Matyka, S. Szkoda and Ł. Mirosław, "Compressed Multirow Storage Format for Sparse Matrices on Graphics Processing Units," SIAM J. Sci. Comput., vol. 36, no. 2, pp. 219--239.

[27]

E.-J. Im, K. Yelick and R. Vuduc, "Sparsity: Optimization Framework for Sparse Matrix Kernels," Int. J. High Perform. Comput. Appl., vol. 18, no. 1, pp. 135--158, 2014.

Digital Library

[28]

J. Zhang, J. Wan, F. Li, J. Mao, L. Zhuang, J. Yuan, E. Liu and Z. Yu, "Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format," Future Generation Computer Systems, vol. 54, pp. 490--500, 206.

Digital Library

[29]

A. Ashari, N. Sedaghati, J. Eisenlohr and P. Sadayappan, "An Efficient Two-dimensional Blocking Strategy for Sparse Matrix-vector Multiplication on GPUs," in Proc. ACM ICS'14, 2014.

Digital Library

[30]

S. Dalton, L. Olson and N. Bell, "Optimizing Sparse Matrix-Matrix Multiplication for the GPU," ACM Trans. Math. Softw., vol. 41, no. 4, pp. 25:1--25:20, 2015.

Digital Library

[31]

W. Liu and B. Vinte, "bhSPARSE," {Online}. Available: https://github.com/bhSPARSE/bhSPARSE.

[32]

P. N. Q. Anh, R. Fan and Y. Wen, "Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication," in Proc. ACM ICS'16, 2016.

Digital Library

[33]

R. Kunchum, A. Chaudhry, A. Sukumaran-Rajam, Q. Niu, I. Nisa and P. Sadayappan, "On Improving Performance of Sparse Matrix-matrix Multiplication on GPUs," in Proc. ICS'17, 2017.

Digital Library

[34]

F. G. Gustavson, "Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition," ACM Trans. Math. Softw., vol. 4, pp. 250--269, 1978.

Digital Library

[35]

M. Deveci, C. Trott and S. Rajamanickam, "Performance-portable sparse matrix-matrix multiplication for many-core architectures," in 2017 Proc. IEEE IPDPSW, 2017.

[36]

H. Edwards, C. R. Trott and D. Sunderland, "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns," Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202--3216, 2014.

Digital Library

[37]

M. McCool, A. Robison and J. Reinders, Structured Parallel Programming: Patterns for Efficient Computation, Morgan Kaufmann, 2012.

Digital Library

[38]

D. B. Kirk and W.-m. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd ed., Morgan Kaufmann, 2012.

Digital Library

[39]

Nvidia, "Thrust parallel algorithms library," {Online}. Available: https://thrust.github.io/.

[40]

S. Dalton, S. Baxter, D. Merrill, L. Olson and M. Garland, "Optimizing Sparse Matrix Operations on GPUs Using Merge Path," in Proc. IEEE IPDPS'15, 2015.

Digital Library

[41]

F. Gremse, A. Höfter, L. O. Schwen, F. Kiessling and U. Naumann, "GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging," SIAM Journal on Scientific Computing, vol. 37, no. 1, pp. 54--71, 2015.

[42]

Y. Nagasaka, A. Nukada and S. Matsuoka, "High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU," in Proc. ICPP'17, 2017.

[43]

M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. P. O. Pirogov and P. Dubey, "Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms," in Proc. SC'15, 2015.

[44]

A. Buluç and J. R. Gilbert, "Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments," SIAM J. Sci. Comput., vol. 34, no. 4, pp. 170--191, 2012.

[45]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske and A. R. Bishop, "A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units," SIAM J. Sci. Comput., vol. 36, no. 5, pp. 401--423, 2014.

Digital Library

[46]

N. Sedaghati, T. Mu, L.-N. Pouchet, S. Parthasarathy and P. Sadayappan, "Automatic Selection of Sparse Matrix Representation on GPUs," in Proc. ICS'15, 2015.

Digital Library

[47]

A. Derler, R. Zayer, H.-P. Seidel and M. Steinberger, "Dynamic Scheduling for Efficient Hierarchical Sparse Matrix Operations on the GPU," in Proc. ACM ICS'17, 2017.

Digital Library

[48]

J. King, T. Gilray, R. M. Kirby and M. Might, "Dynamic sparse-matrix allocation on GPUs," in Proc. ACM ICS'16, 2016.

[49]

M. Steinberger, R. Zayer and H.-P. Seidel, "Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU," in Proc. ACM ICS '17, 2017.

Digital Library

[50]

Brisaboa N.R., Ladra S., Navarro G. (2009) k2-Trees for Compact Web Graph Representation. In: Karlgren J., Tarhio J., Hyyrö H. (eds) String Processing and Information Retrieval (SPIRE'09), Springer LNCS 5721.

Digital Library

[51]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick and J. Demmel, "Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms," in Proc. SC'07, 2007

Digital Library

Cited By

Hu RWang HYang WOuyang RLi KLi K(2024)BCB-SpTC: An Efficient Sparse High-Dimensional Tensor Contraction Employing Tensor Core AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347774635:12(2435-2448)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3477746
Abazari MZahedi MSavadi A(2024)Improve the Utility of Tensor Cores by Compacting Sparse Matrix Technique2024 14th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE65377.2024.10874581(14-19)Online publication date: 19-Nov-2024
https://doi.org/10.1109/ICCKE65377.2024.10874581
Freire MMarichal RGonzaga de Oliveira SDufrechou EEzzatti P(2024)Enhancing the Sparse Matrix Storage Using Reordering TechniquesHigh Performance Computing10.1007/978-3-031-52186-7_5(66-76)Online publication date: 28-Jan-2024
https://doi.org/10.1007/978-3-031-52186-7_5
Show More Cited By

Recommendations

Design Principles for Sparse Matrix Multiplication on the GPU
Euro-Par 2018: Parallel Processing
Abstract
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While ...
Efficient Sparse Matrix Multiplication on GPU for Large Social Network Analysis
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

As a number of social network services appear online recently, there have been many attempts to analyze social networks for extracting valuable information. Most existing methods first represent a social network as a quite sparse adjacency matrix, and ...
Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

We implement a promising algorithm for sparse-matrix sparse-vector multiplication (SpMSpV) on the GPU. An efficient k-way merge lies at the heart of finding a fast parallel SpMSpV algorithm. We examine the scalability of three approaches -- no sorting, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GRADES-NDA '18: Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)

June 2018

94 pages

ISBN:9781450356954

DOI:10.1145/3210259

Editors:
Akhil Arora
American Express Big Data Labs
,
Arnab Bhattacharya
Indian Institute of Technology, Kanpur, India
,
George Fletcher
Eindhoven University of Technology
,
Josep Lluis Larriba Pey
UPC
,
Shourya Roy
American Express Big Data Labs
,
Robert West
EPFL

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation
PSC CUNY

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10, 2018

Texas, Houston

Acceptance Rates

GRADES-NDA '18 Paper Acceptance Rate 10 of 26 submissions, 38%;

Overall Acceptance Rate 29 of 61 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
974
Total Downloads

Downloads (Last 12 months)250
Downloads (Last 6 weeks)36

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hu RWang HYang WOuyang RLi KLi K(2024)BCB-SpTC: An Efficient Sparse High-Dimensional Tensor Contraction Employing Tensor Core AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347774635:12(2435-2448)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3477746
Abazari MZahedi MSavadi A(2024)Improve the Utility of Tensor Cores by Compacting Sparse Matrix Technique2024 14th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE65377.2024.10874581(14-19)Online publication date: 19-Nov-2024
https://doi.org/10.1109/ICCKE65377.2024.10874581
Freire MMarichal RGonzaga de Oliveira SDufrechou EEzzatti P(2024)Enhancing the Sparse Matrix Storage Using Reordering TechniquesHigh Performance Computing10.1007/978-3-031-52186-7_5(66-76)Online publication date: 28-Jan-2024
https://doi.org/10.1007/978-3-031-52186-7_5
Freire MMarichal RDufrechou EEzzatti PPedemonte M(2023)Trajectory-based Metaheuristics for Improving Sparse Matrix Storage2023 IEEE Latin American Conference on Computational Intelligence (LA-CCI)10.1109/LA-CCI58595.2023.10409303(1-6)Online publication date: 29-Oct-2023
https://doi.org/10.1109/LA-CCI58595.2023.10409303
Yoo MSong JLee JKim NKim YLee J(2023)SGCN: Exploiting Compressed-Sparse Features in Deep Graph Convolutional Network Accelerators2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071102(1-14)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071102
Berger GDufrechou EEzzatti P(2023)Sparse Matrix-Vector Product for the bmSparse Matrix Format in GPUsEuro-Par 2023: Parallel Processing Workshops10.1007/978-3-031-50684-0_19(246-256)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-50684-0_19
Xie ZTan GLiu WSun N(2022)A Pattern-Based SpGEMM Library for Multi-Core and Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309032833:1(159-175)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TPDS.2021.3090328
Marini RDufrechou EEzzatti P(2022)Towards an Efficient Sparse Storage Format for the SpMM Kernel in GPUsEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_9(104-115)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_9
Berger GFreire MMarini RDufrechou EEzzatti P(2022)Advancing on an efficient sparse matrix multiplication kernel for modern GPUsConcurrency and Computation: Practice and Experience10.1002/cpe.727135:20Online publication date: 19-Aug-2022
https://doi.org/10.1002/cpe.7271
Berger GFreire MMarini RDufrechou EEzzatti P(2021)Unleashing the performance of bmSparse for the sparse matrix multiplication in GPUs2021 12th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)10.1109/ScalA54577.2021.00008(19-26)Online publication date: Nov-2021
https://doi.org/10.1109/ScalA54577.2021.00008
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten