Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPU

Published: 01 December 2022 Publication History

Abstract

Binary Sparse Matrix-Vector Multiplication (SpMV) is a heavy computational kernel in weblink analysis, integer factorization, compressed sensing, spectral graph theory, and other domains. Testing several popular GPU-based SpMV implementations on 400 sparse matrices, we observed that data transfer to GPU memory accounts for a large part of the total computation time. The transfer of constant value “1”s can be easily eliminated for binary sparse matrices. However, compressing index arrays has always been a great challenge. This article proposes a new compression format TaiChi to further reduce index data copies and improve the performance of SpMV, especially for diagonally dominant binary sparse matrices. Input matrices are first partitioned into relatively dense and ultra-sparse areas. Then the dense areas are encoded inversely by marking “0”s, while the ultra-sparse area is encoded by marking “1”s. We also designed a new SpMV algorithm only using addition and subtraction for binary matrices based on our partition and encoding format. Evaluation results on real-world binary sparse matrices show that our hybrid encoding for binary matrix significantly reduces the data transfer and speeds up the kernel execution. It achieves the highest transfer and kernel execution speedups of 5.63x and 3.84x on GTX 1080 Ti, 3.39x and 3.91x on Tesla V100.

References

[1]
J. Dvorský, P. Gajdos, E. Ochodkova, J. Martinovic, and V. Snásel, “Social network problem in Enron corpus,” in Proc. 9th East-Eur. Conf. Adv. Databases Inf. Syst., 2005, pp. 123–134. [Online]. Available: http://ceur-ws.org/Vol-152/paper7.ps
[2]
A. Keprt, “Binary matrix pseudo-division and its applications,” in Proc. 4th Int. Conf. Innov. Bio-Inspired Comput. Appl., 2013, pp. 153–164.
[3]
K. Aoki, T. Shimoyama, and H. Ueda, “Experiments on the linear algebra step in the number field sieve,” in Proc. Advances in Information and Computer Security, A. Miyaji, H. Kikuchi, and K. Rannenberg, Eds., Berlin, Germany: Springer, 2007, pp. 58–73.
[4]
M. A. Iwen, “Compressed sensing with sparse binary matrices: Instance optimal error guarantees in near-optimal time,” J. Complexity, vol. 30, no. 12, pp. 1–15, 2014.
[5]
J. D. Batson, D. A. Spielman, N. Srivastava, and S. Teng, “Spectral sparsification of graphs: Theory and algorithms,” Commun. ACM, vol. 56, no. 8, pp. 87–94, 2013. [Online]. Available: https://doi.org/10.1145/2492007.2492029
[6]
H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, “Binary neural networks: A survey,” Pattern Recognit., vol. 105, Sep. 2020, Art. no.
[7]
M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1,” CoRR, vol. abs/1602.02830, 2016.
[8]
T. Wadayama, “Error detection by binary sparse matrices,” J. Phys.: Conf. Ser., vol. 233, Jun. 2010, Art. no.
[9]
R. Li and Y. Saad, “GPU-accelerated preconditioned iterative linear solvers,” J. Supercomput., vol. 63, no. 2, pp. 443–466, 2013.
[10]
T. Kleinjunget al., “Factorization of a 768-bit RSA modulus,” in Proc. Annu. Cryptol. Conf., 2010, pp. 333–350.
[11]
T. Kleinjung, L. Nussbaum, and E. Thomé, “Using a grid platform for solving large sparse linear systems over GF(2),” in Proc. 11th IEEE/ACM Int. Conf. Grid Comput., 2010, pp. 161–168.
[12]
S. Neves and F. Araujo, “Representing sparse binary matrices as straight-line programs for fast matrix-vector multiplication,” in Proc. Int. Conf. High Perform. Comput. Simul., 2012, pp. 520–526.
[13]
M. P. Heinrich and O. Oktay, “BRIEFnet: Deep pancreas segmentation using binary sparse convolutions,” in Proc. Int. Conf. Med. Image Comput. Comput. Assist. Interv., 2017, pp. 329–337.
[14]
S. Aoki, S. Koyama, and T. Saito, “Theoretical analysis of dynamic binary neural networks with simple sparse connection,” Neurocomputing, vol. 341, pp. 149–155, 2019.
[15]
K. He, S. X. Tan, H. Zhao, X. Liu, H. Wang, and G. Shi, “Parallel GMRES solver for fast analysis of large linear dynamic systems on GPU platforms,” Integration, vol. 52, pp. 10–22, 2016.
[16]
A. C. Ahamed and F. Magoulès, “GPU accelerated substructuring methods for sparse linear systems,” in Proc. IEEE Int. Conf. Comput. Sci. Eng. IEEE Int. Conf. Embedded Ubiquitous Comput. 15th Int. Symp. Distrib. Comput. Appl. Bus. Eng., 2016, pp. 614–625.
[17]
S. Lin and Z. Xie, “A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster,” J. Supercomput., vol. 73, no. 1, pp. 433–454, 2017.
[18]
R. F. Barrettet al., Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. Philadelphia, PA, USA: SIAM, 1994.
[19]
E. Im, K. A. Yelick, and R. W. Vuduc, “Sparsity: Optimization framework for sparse matrix kernels,” Int. J. High Perform. Comput. Appl., vol. 18, no. 1, pp. 135–158, 2004.
[20]
S. Yan, C. Li, Y. Zhang, and H. Zhou, “yaSpMV: Yet another SpMV framework on GPUs,” in Proc. 19th ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2014, pp. 107–118.
[21]
W. Liu and B. Vinter, “CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication,” in Proc. 29th ACM Int. Conf. Supercomput., 2015, pp. 339–350.
[22]
S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid GPU accelerated manycore systems,” Parallel Comput., vol. 36, no. 5/6, pp. 232–240, Jun. 2010.
[23]
H. Anztet al., “Ginkgo: A modern linear operator Algebra framework for high performance computing,” ACM Trans. Math. Softw., vol. 48, no. 1, pp. 2:1–2:33, 2022.
[24]
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann, and A. R. Bishop, “Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation,” in Proc. IEEE 26th Int. Parallel Distrib. Process. Symp. Workshops PhD Forum, 2012, pp. 1696–1702.
[25]
M. Maggioni and T. Berger-Wolf, “AdELL: An adaptive warp-balancing ell format for efficient sparse matrix-vector multiplication on GPUs,” in Proc. 42nd Int. Conf. Parallel Process., 2013, pp. 11–20.
[26]
W. Cao, L. Yao, Z. Li, Y. Wang, and Z. Wang, “Implementing sparse matrix-vector multiplication using CUDA based on a hybrid sparse matrix format,” in Proc. Int. Conf. Comput. Appl. Syst. Model., 2010, pp. V11-161–V11-165.
[27]
B.-Y. Su and K. Keutzer, “ClSpMV: A cross-platform OpenCL SpMV framework on GPUs,” in Proc. 26th ACM Int. Conf. Supercomput., 2012, pp. 353–364.
[28]
T. A. Davis and Y. Hu, “The university of Florida sparse matrix collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, Dec. 2011.
[29]
Z. Tanet al., “MMSparse: 2D partitioning of sparse matrix based on mathematical morphology,” Future Gener. Comput. Syst., vol. 108, pp. 521–532, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167739X19327967
[30]
W. Armstrong and A. P. Rendell, “Runtime sparse matrix format selection,” Procedia Comput. Sci., vol. 1, no. 1, pp. 135–144, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050910000177
[31]
N. Sedaghati, T. Mu, L.-N. Pouchet, S. Parthasarathy, and P. Sadayappan, “Automatic selection of sparse matrix representation on GPUs,” in Proc. 29th ACM Int. Conf. Supercomput., 2015, pp. 99–108.
[32]
A. Benatia, W. Ji, Y. Wang, and F. Shi, “Sparse matrix format selection with multiclass SVM for SpMV on GPU,” in Proc. 45th Int. Conf. Parallel Process., 2016, pp. 496–505.
[33]
Y. Zhao, J. Li, C. Liao, and X. Shen, “Bridging the gap between deep learning and sparse matrix format selection,” in Proc. 23rd ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2018, pp. 94–108. [Online]. Available: http://doi.acm.org/10.1145/3178487.3178495
[34]
P. Guo, H. Huang, Q. Chen, L. Wang, E.-J. Lee, and P. Chen, “A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs,” in Proc. TeraGrid Conf.: Extreme Digit. Discov., 2011, Art. no.
[35]
P. Guo, L. Wang, and P. Chen, “A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 5, pp. 1112–1123, May 2014.
[36]
W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization using partitioned SpMV on GPUs and multicore CPUs,” IEEE Trans. Comput., vol. 64, no. 9, pp. 2623–2636, Sep. 2015.
[37]
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, “Efficient sparse matrix-vector multiplication on x86-based many-core processors,” in Proc. 27th Int. ACM Conf. Int. Conf. Supercomput., 2013, pp. 273–282.
[38]
C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, and X. Liu, “Towards efficient SpMV on sunway manycore architectures,” in Proc. Int. Conf. Supercomput., 2018, pp. 363–373.
[39]
A. Benatia, W. Ji, Y. Wang, and F. Shi, “BestSF: A sparse meta-format for optimizing SpMV on GPU,” ACM Trans. Archit. Code Optim., vol. 15, no. 3, pp. 29:1–29:27, 2018.
[40]
N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in Proc. ACM/IEEE Conf. High Perform. Comput., 2009, pp. 1–11. [Online]. Available: https://doi.org/10.1145/1654059.1654078
[41]
J. Li, G. Tan, M. Chen, and N. Sun, “SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication,” in Proc. ACM SIGPLAN Conf. Program. Lang. Des. Implementation, 2013, pp. 117–126.
[42]
J. Serra, “Introduction to mathematical morphology,” Comput. Vis. Graph. Image Process., vol. 35, no. 3, pp. 283–305, Sep. 1986.
[43]
S. Dalton, N. Bell, L. Olson, and M. Garland, “CUSP: Generic parallel algorithms for sparse matrix and graph computations,” 2014. [Online]. Available: http://cusplibrary.github.io/
[44]
I. R. Eguly and M. Giles, “Efficient sparse matrix-vector multiplication on cache-based GPUs,” in Proc. Innov. Parallel Comput., 2012, pp. 1–12.
[45]
D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A recursive model for graph mining,” in Proc. SIAM Int. Conf. Data Mining, 2004, pp. 442–446.
[46]
L. Hübschle-Schneider and P. Sanders, “Linear work generation of R-MAT graphs,” CoRR, 2019. [Online]. Available: http://arxiv.org/abs/1905.03525
[47]
D. A. Bader and K. Madduri, “Gtgraph: A suite of synthetic random graph generators,” 2020, [Online]. Available: https://github.com/Bader-Research/GTgraph
[48]
C. Yang, A. Buluc, and J. D. Owens, “GraphBLAST: A high-performance linear Algebra-based graph framework on the GPU,” 2019,.
[49]
N. Sundaramet al., “GraphMat: High performance graph analytics made productive,” 2015,.
[50]
R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,” Cray Users Group, May 05, 2010.
[51]
C. Yang, Y. Wang, and J. D. Owens, “Fast sparse matrix and sparse vector multiplication algorithm on the GPU,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshop, 2015, pp. 841–847.
[52]
D. Guo, W. Gropp, and L. N. Olson, “A hybrid format for better performance of sparse matrix-vector multiplication on a GPU,” Int. J. High Perform. Comput. Appl., vol. 30, no. 1, pp. 103–120, 2016.
[53]
W. Yang, K. Li, and K. Li, “A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems,” J. Parallel Distrib. Comput., vol. 104, pp. 49–60, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731517300011
[54]
W. Yang, K. Li, Y. Liu, L. Shi, and L. Wan, “Optimization of quasi-diagonal matrix-vector multiplication on GPU,” Int. J. High Perform. Comput. Appl., vol. 28, no. 2, pp. 183–195, May 2014.
[55]
Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan, “TileSpMV: A tiled algorithm for sparse matrix-vector multiplication on GPUs,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2021, pp. 68–78.
[56]
C. Gregg and K. Hazelwood, “Where is the data? Why you cannot debate CPU vs. GPU performance without the answer,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2011, pp. 134–144.
[57]
Y. Fujii, T. Azumi, N. Nishio, S. Kato, and M. Edahiro, “Data transfer matters for GPU computing,” in Proc. Int. Conf. Parallel Distrib. Syst., 2013, pp. 275–282.
[58]
B. V. Werkhoven, J. Maassen, F. J. Seinstra, and H. E. Bal, “Performance models for CPU-GPU data transfers,” in Proc. 14th IEEE/ACM Int. Symp. Cluster Cloud Grid Comput., 2014, pp. 11–20.
[59]
J. Li, H.-W. Tseng, C. Lin, Y. Papakonstantinou, and S. Swanson, “HippogrifDB: Balancing I/O and GPU bandwidth in big data analytics,” Proc. VLDB Endowment, vol. 9, no. 14, pp. 1647–1658, Oct. 2016.
[60]
C. Fu, Z. Wang, and Y. Zhai, “A CPU-GPU data transfer optimization approach based on code migration and merging,” in Proc. 16th Int. Symp. Distrib. Comput. Appl. Bus. Eng. Sci., 2017, pp. 23–26.
[61]
X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan, and L. Rao, “Optimizing SpMV for diagonal sparse matrices on GPU,” in Proc. Int. Conf. Parallel Process., 2011, pp. 492–501.
[62]
D. Barbieri, V. Cardellini, A. Fanfarillo, and S. Filippone, “Three storage formats for sparse matrices on GPGPUs,” Univ. Roma Tor Vergata, Roma RM, Italy, Tech. Rep., 2015.
[63]
T. Fukaya, K. Ishida, A. Miura, T. Iwashita, and H. Nakashima, “Accelerating the SpMV kernel on standard CPUs by exploiting the partially diagonal structures,” CoRR, abs/2105.04937, 2021. [Online]. Available: https://arxiv.org/abs/2105.04937
[64]
J. Gao, Y. Xia, R. Yin, and G. He, “Adaptive diagonal sparse matrix-vector multiplication on GPU,” J. Parallel Distrib. Comput., vol. 157, pp. 287–302, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731521001532
[65]
K. Culik and V. Valenta, “Finite automata based compression of bi-level and simple color images,” Comput. Graph., vol. 21, no. 1, pp. 61–68, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0097849396000702
[66]
H. Yu, “An improved combinatorial algorithm for boolean matrix multiplication,” Inf. Comput., vol. 261, pp. 240–247, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0890540118300099
[67]
T. M. Chan, “Speeding up the four Russians algorithm by about one more logarithmic factor,” in Proc. 26th Annu. ACM-SIAM Symp. Discrete Algorithms, 2015, pp. 212–217. [Online]. Available: http://dl.acm.org/citation.cfm?id=2722129.2722145

Cited By

View all
  • (2024)Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUsThe Journal of Supercomputing10.1007/s11227-024-05949-680:10(13681-13713)Online publication date: 1-Jul-2024
  • (2023)Large-Scale Simulation of Structural Dynamics Computing on GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607082(1-14)Online publication date: 12-Nov-2023

Recommendations

Comments

Information & Contributors

Information

Published In

Publisher

IEEE Press

Publication History

Published: 01 December 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUsThe Journal of Supercomputing10.1007/s11227-024-05949-680:10(13681-13713)Online publication date: 1-Jul-2024
  • (2023)Large-Scale Simulation of Structural Dynamics Computing on GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607082(1-14)Online publication date: 12-Nov-2023

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media