research-article

TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPU

Authors:

Feng ShiAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 33, Issue 12

Pages 3732 - 3745

https://doi.org/10.1109/TPDS.2022.3170501

Published: 01 December 2022 Publication History

Abstract

Binary Sparse Matrix-Vector Multiplication (SpMV) is a heavy computational kernel in weblink analysis, integer factorization, compressed sensing, spectral graph theory, and other domains. Testing several popular GPU-based SpMV implementations on 400 sparse matrices, we observed that data transfer to GPU memory accounts for a large part of the total computation time. The transfer of constant value “1”s can be easily eliminated for binary sparse matrices. However, compressing index arrays has always been a great challenge. This article proposes a new compression format TaiChi to further reduce index data copies and improve the performance of SpMV, especially for diagonally dominant binary sparse matrices. Input matrices are first partitioned into relatively dense and ultra-sparse areas. Then the dense areas are encoded inversely by marking “0”s, while the ultra-sparse area is encoded by marking “1”s. We also designed a new SpMV algorithm only using addition and subtraction for binary matrices based on our partition and encoding format. Evaluation results on real-world binary sparse matrices show that our hybrid encoding for binary matrix significantly reduces the data transfer and speeds up the kernel execution. It achieves the highest transfer and kernel execution speedups of 5.63x and 3.84x on GTX 1080 Ti, 3.39x and 3.91x on Tesla V100.

References

[1]

J. Dvorský, P. Gajdos, E. Ochodkova, J. Martinovic, and V. Snásel, “Social network problem in Enron corpus,” in Proc. 9th East-Eur. Conf. Adv. Databases Inf. Syst., 2005, pp. 123–134. [Online]. Available: http://ceur-ws.org/Vol-152/paper7.ps

[2]

A. Keprt, “Binary matrix pseudo-division and its applications,” in Proc. 4th Int. Conf. Innov. Bio-Inspired Comput. Appl., 2013, pp. 153–164.

[3]

K. Aoki, T. Shimoyama, and H. Ueda, “Experiments on the linear algebra step in the number field sieve,” in Proc. Advances in Information and Computer Security, A. Miyaji, H. Kikuchi, and K. Rannenberg, Eds., Berlin, Germany: Springer, 2007, pp. 58–73.

[4]

M. A. Iwen, “Compressed sensing with sparse binary matrices: Instance optimal error guarantees in near-optimal time,” J. Complexity, vol. 30, no. 12, pp. 1–15, 2014.

Digital Library

[5]

J. D. Batson, D. A. Spielman, N. Srivastava, and S. Teng, “Spectral sparsification of graphs: Theory and algorithms,” Commun. ACM, vol. 56, no. 8, pp. 87–94, 2013. [Online]. Available: https://doi.org/10.1145/2492007.2492029

Digital Library

[6]

H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, “Binary neural networks: A survey,” Pattern Recognit., vol. 105, Sep. 2020, Art. no.

[7]

M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1,” CoRR, vol. abs/1602.02830, 2016.

[8]

T. Wadayama, “Error detection by binary sparse matrices,” J. Phys.: Conf. Ser., vol. 233, Jun. 2010, Art. no.

[9]

R. Li and Y. Saad, “GPU-accelerated preconditioned iterative linear solvers,” J. Supercomput., vol. 63, no. 2, pp. 443–466, 2013.

Digital Library

[10]

T. Kleinjunget al., “Factorization of a 768-bit RSA modulus,” in Proc. Annu. Cryptol. Conf., 2010, pp. 333–350.

[11]

T. Kleinjung, L. Nussbaum, and E. Thomé, “Using a grid platform for solving large sparse linear systems over GF(2),” in Proc. 11th IEEE/ACM Int. Conf. Grid Comput., 2010, pp. 161–168.

[12]

S. Neves and F. Araujo, “Representing sparse binary matrices as straight-line programs for fast matrix-vector multiplication,” in Proc. Int. Conf. High Perform. Comput. Simul., 2012, pp. 520–526.

[13]

M. P. Heinrich and O. Oktay, “BRIEFnet: Deep pancreas segmentation using binary sparse convolutions,” in Proc. Int. Conf. Med. Image Comput. Comput. Assist. Interv., 2017, pp. 329–337.

[14]

S. Aoki, S. Koyama, and T. Saito, “Theoretical analysis of dynamic binary neural networks with simple sparse connection,” Neurocomputing, vol. 341, pp. 149–155, 2019.

Digital Library

[15]

K. He, S. X. Tan, H. Zhao, X. Liu, H. Wang, and G. Shi, “Parallel GMRES solver for fast analysis of large linear dynamic systems on GPU platforms,” Integration, vol. 52, pp. 10–22, 2016.

Digital Library

[16]

A. C. Ahamed and F. Magoulès, “GPU accelerated substructuring methods for sparse linear systems,” in Proc. IEEE Int. Conf. Comput. Sci. Eng. IEEE Int. Conf. Embedded Ubiquitous Comput. 15th Int. Symp. Distrib. Comput. Appl. Bus. Eng., 2016, pp. 614–625.

[17]

S. Lin and Z. Xie, “A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster,” J. Supercomput., vol. 73, no. 1, pp. 433–454, 2017.

Digital Library

[18]

R. F. Barrettet al., Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. Philadelphia, PA, USA: SIAM, 1994.

[19]

E. Im, K. A. Yelick, and R. W. Vuduc, “Sparsity: Optimization framework for sparse matrix kernels,” Int. J. High Perform. Comput. Appl., vol. 18, no. 1, pp. 135–158, 2004.

Digital Library

[20]

S. Yan, C. Li, Y. Zhang, and H. Zhou, “yaSpMV: Yet another SpMV framework on GPUs,” in Proc. 19th ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2014, pp. 107–118.

[21]

W. Liu and B. Vinter, “CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication,” in Proc. 29th ACM Int. Conf. Supercomput., 2015, pp. 339–350.

[22]

S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid GPU accelerated manycore systems,” Parallel Comput., vol. 36, no. 5/6, pp. 232–240, Jun. 2010.

Digital Library

[23]

H. Anztet al., “Ginkgo: A modern linear operator Algebra framework for high performance computing,” ACM Trans. Math. Softw., vol. 48, no. 1, pp. 2:1–2:33, 2022.

[24]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann, and A. R. Bishop, “Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation,” in Proc. IEEE 26th Int. Parallel Distrib. Process. Symp. Workshops PhD Forum, 2012, pp. 1696–1702.

[25]

M. Maggioni and T. Berger-Wolf, “AdELL: An adaptive warp-balancing ell format for efficient sparse matrix-vector multiplication on GPUs,” in Proc. 42nd Int. Conf. Parallel Process., 2013, pp. 11–20.

[26]

W. Cao, L. Yao, Z. Li, Y. Wang, and Z. Wang, “Implementing sparse matrix-vector multiplication using CUDA based on a hybrid sparse matrix format,” in Proc. Int. Conf. Comput. Appl. Syst. Model., 2010, pp. V11-161–V11-165.

[27]

B.-Y. Su and K. Keutzer, “ClSpMV: A cross-platform OpenCL SpMV framework on GPUs,” in Proc. 26th ACM Int. Conf. Supercomput., 2012, pp. 353–364.

[28]

T. A. Davis and Y. Hu, “The university of Florida sparse matrix collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, Dec. 2011.

Digital Library

[29]

Z. Tanet al., “MMSparse: 2D partitioning of sparse matrix based on mathematical morphology,” Future Gener. Comput. Syst., vol. 108, pp. 521–532, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167739X19327967

[30]

W. Armstrong and A. P. Rendell, “Runtime sparse matrix format selection,” Procedia Comput. Sci., vol. 1, no. 1, pp. 135–144, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050910000177

[31]

N. Sedaghati, T. Mu, L.-N. Pouchet, S. Parthasarathy, and P. Sadayappan, “Automatic selection of sparse matrix representation on GPUs,” in Proc. 29th ACM Int. Conf. Supercomput., 2015, pp. 99–108.

[32]

A. Benatia, W. Ji, Y. Wang, and F. Shi, “Sparse matrix format selection with multiclass SVM for SpMV on GPU,” in Proc. 45th Int. Conf. Parallel Process., 2016, pp. 496–505.

[33]

Y. Zhao, J. Li, C. Liao, and X. Shen, “Bridging the gap between deep learning and sparse matrix format selection,” in Proc. 23rd ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2018, pp. 94–108. [Online]. Available: http://doi.acm.org/10.1145/3178487.3178495

[34]

P. Guo, H. Huang, Q. Chen, L. Wang, E.-J. Lee, and P. Chen, “A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs,” in Proc. TeraGrid Conf.: Extreme Digit. Discov., 2011, Art. no.

[35]

P. Guo, L. Wang, and P. Chen, “A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 5, pp. 1112–1123, May 2014.

Digital Library

[36]

W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization using partitioned SpMV on GPUs and multicore CPUs,” IEEE Trans. Comput., vol. 64, no. 9, pp. 2623–2636, Sep. 2015.

Digital Library

[37]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, “Efficient sparse matrix-vector multiplication on x86-based many-core processors,” in Proc. 27th Int. ACM Conf. Int. Conf. Supercomput., 2013, pp. 273–282.

[38]

C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, and X. Liu, “Towards efficient SpMV on sunway manycore architectures,” in Proc. Int. Conf. Supercomput., 2018, pp. 363–373.

[39]

A. Benatia, W. Ji, Y. Wang, and F. Shi, “BestSF: A sparse meta-format for optimizing SpMV on GPU,” ACM Trans. Archit. Code Optim., vol. 15, no. 3, pp. 29:1–29:27, 2018.

[40]

N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in Proc. ACM/IEEE Conf. High Perform. Comput., 2009, pp. 1–11. [Online]. Available: https://doi.org/10.1145/1654059.1654078

[41]

J. Li, G. Tan, M. Chen, and N. Sun, “SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication,” in Proc. ACM SIGPLAN Conf. Program. Lang. Des. Implementation, 2013, pp. 117–126.

[42]

J. Serra, “Introduction to mathematical morphology,” Comput. Vis. Graph. Image Process., vol. 35, no. 3, pp. 283–305, Sep. 1986.

Digital Library

[43]

S. Dalton, N. Bell, L. Olson, and M. Garland, “CUSP: Generic parallel algorithms for sparse matrix and graph computations,” 2014. [Online]. Available: http://cusplibrary.github.io/

[44]

I. R. Eguly and M. Giles, “Efficient sparse matrix-vector multiplication on cache-based GPUs,” in Proc. Innov. Parallel Comput., 2012, pp. 1–12.

[45]

D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A recursive model for graph mining,” in Proc. SIAM Int. Conf. Data Mining, 2004, pp. 442–446.

[46]

L. Hübschle-Schneider and P. Sanders, “Linear work generation of R-MAT graphs,” CoRR, 2019. [Online]. Available: http://arxiv.org/abs/1905.03525

[47]

D. A. Bader and K. Madduri, “Gtgraph: A suite of synthetic random graph generators,” 2020, [Online]. Available: https://github.com/Bader-Research/GTgraph

[48]

C. Yang, A. Buluc, and J. D. Owens, “GraphBLAST: A high-performance linear Algebra-based graph framework on the GPU,” 2019,.

[49]

N. Sundaramet al., “GraphMat: High performance graph analytics made productive,” 2015,.

[50]

R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,” Cray Users Group, May 05, 2010.

[51]

C. Yang, Y. Wang, and J. D. Owens, “Fast sparse matrix and sparse vector multiplication algorithm on the GPU,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshop, 2015, pp. 841–847.

[52]

D. Guo, W. Gropp, and L. N. Olson, “A hybrid format for better performance of sparse matrix-vector multiplication on a GPU,” Int. J. High Perform. Comput. Appl., vol. 30, no. 1, pp. 103–120, 2016.

Digital Library

[53]

W. Yang, K. Li, and K. Li, “A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems,” J. Parallel Distrib. Comput., vol. 104, pp. 49–60, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731517300011

[54]

W. Yang, K. Li, Y. Liu, L. Shi, and L. Wan, “Optimization of quasi-diagonal matrix-vector multiplication on GPU,” Int. J. High Perform. Comput. Appl., vol. 28, no. 2, pp. 183–195, May 2014.

Digital Library

[55]

Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan, “TileSpMV: A tiled algorithm for sparse matrix-vector multiplication on GPUs,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2021, pp. 68–78.

[56]

C. Gregg and K. Hazelwood, “Where is the data? Why you cannot debate CPU vs. GPU performance without the answer,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2011, pp. 134–144.

[57]

Y. Fujii, T. Azumi, N. Nishio, S. Kato, and M. Edahiro, “Data transfer matters for GPU computing,” in Proc. Int. Conf. Parallel Distrib. Syst., 2013, pp. 275–282.

[58]

B. V. Werkhoven, J. Maassen, F. J. Seinstra, and H. E. Bal, “Performance models for CPU-GPU data transfers,” in Proc. 14th IEEE/ACM Int. Symp. Cluster Cloud Grid Comput., 2014, pp. 11–20.

[59]

J. Li, H.-W. Tseng, C. Lin, Y. Papakonstantinou, and S. Swanson, “HippogrifDB: Balancing I/O and GPU bandwidth in big data analytics,” Proc. VLDB Endowment, vol. 9, no. 14, pp. 1647–1658, Oct. 2016.

[60]

C. Fu, Z. Wang, and Y. Zhai, “A CPU-GPU data transfer optimization approach based on code migration and merging,” in Proc. 16th Int. Symp. Distrib. Comput. Appl. Bus. Eng. Sci., 2017, pp. 23–26.

[61]

X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan, and L. Rao, “Optimizing SpMV for diagonal sparse matrices on GPU,” in Proc. Int. Conf. Parallel Process., 2011, pp. 492–501.

[62]

D. Barbieri, V. Cardellini, A. Fanfarillo, and S. Filippone, “Three storage formats for sparse matrices on GPGPUs,” Univ. Roma Tor Vergata, Roma RM, Italy, Tech. Rep., 2015.

[63]

T. Fukaya, K. Ishida, A. Miura, T. Iwashita, and H. Nakashima, “Accelerating the SpMV kernel on standard CPUs by exploiting the partially diagonal structures,” CoRR, abs/2105.04937, 2021. [Online]. Available: https://arxiv.org/abs/2105.04937

[64]

J. Gao, Y. Xia, R. Yin, and G. He, “Adaptive diagonal sparse matrix-vector multiplication on GPU,” J. Parallel Distrib. Comput., vol. 157, pp. 287–302, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731521001532

[65]

K. Culik and V. Valenta, “Finite automata based compression of bi-level and simple color images,” Comput. Graph., vol. 21, no. 1, pp. 61–68, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0097849396000702

[66]

H. Yu, “An improved combinatorial algorithm for boolean matrix multiplication,” Inf. Comput., vol. 261, pp. 240–247, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0890540118300099

[67]

T. M. Chan, “Speeding up the four Russians algorithm by about one more logarithmic factor,” in Proc. 26th Annu. ACM-SIAM Symp. Discrete Algorithms, 2015, pp. 212–217. [Online]. Available: http://dl.acm.org/citation.cfm?id=2722129.2722145

Cited By

Zhao ZZhang GWu YHong RYang YFu Y(2024)Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUsThe Journal of Supercomputing10.1007/s11227-024-05949-680:10(13681-13713)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-05949-6
Shi YNie NWang JLin KZhou CLi SYao KLi SFeng YZeng YLiu FWang YGao YMohror KArnold DBadia R(2023)Large-Scale Simulation of Structural Dynamics Computing on GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607082(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607082

Recommendations

GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Many high performance computing applications require computing both sparse matrix-vector product SMVP and sparse matrix-transpose vector product SMTVP for better overall performance. Under such a circumstance, it is critical to maintain a similarly high ...
Optimization of quasi-diagonal matrix-vector multiplication on GPU

Sparse matrix-vector multiplication SpMV is of singular importance in sparse linear algebra, which is an important issue in scientific computing and engineering practice. Much effort has been put into accelerating SpMV, and a few parallel solutions have ...
GPU Sparse Matrix Vector Multiplication Optimization Based on ELLB Storage Format
ICSCA '23: Proceedings of the 2023 12th International Conference on Software and Computer Applications

ELLPACK(ELL) sparse matrix storage format has problems such as high storage consumption and low efficiency of sparse matrix vector multiplication(SpMV). To solve this problem, we propose a Graphic Processing Unit(GPU)-based efficient ELLPACK-Block(ELLB)...

Comments

Information & Contributors

Information

Published In

1045-9219 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 December 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao ZZhang GWu YHong RYang YFu Y(2024)Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUsThe Journal of Supercomputing10.1007/s11227-024-05949-680:10(13681-13713)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-05949-6
Shi YNie NWang JLin KZhou CLi SYao KLi SFeng YZeng YLiu FWang YGao YMohror KArnold DBadia R(2023)Large-Scale Simulation of Structural Dynamics Computing on GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607082(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607082

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents