Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503221.3508431acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs

Published: 28 March 2022 Publication History

Abstract

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. The existing parallel approaches for shared memory SpGEMM mostly use the row-row style with possibly good parallelism. However, because of the irregularity in sparsity structures, the existing row-row methods often suffer from three problems: (1) load imbalance, (2) high global space complexity and unsatisfactory data locality, and (3) sparse accumulator selection.
We in this paper propose a tiled parallel SpGEMM algorithm named TileSpGEMM. Our algorithm sparsifies the tiled method in dense general matrix-matrix multiplication (GEMM), and saves each non-empty tile in a sparse form. Its first advantage is that the basic working unit is now a fixed-size sparse tile containing a small number of nonzeros, but not a row possibly very long. Thus the load imbalance issue can be naturally alleviated. Secondly, the temporary space needed for each tile is small and can always be in on-chip scratchpad memory. Thus there is no need to allocate an off-chip space for a large amount of intermediate products, and the data locality can be much better. Thirdly, because the computations are restricted within a single tile, it is relatively easier to select a fast sparse accumulator for a sparse tile. Our experimental results on two newest NVIDIA GPUs show that our TileSpGEMM outperforms four state-of-the-art SpGEMM methods cuSPARSE, bhSPARSE, NSPARSE and spECK in 139, 138, 127 and 94 out of all 142 square matrices executing no less than one billion flops for an SpGEMM operation, and delivers up to 2.78x, 145.35x, 97.86x and 3.70x speedups, respectively.

References

[1]
K. Akbudak and C. Aykanat. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Transactions on Parallel and Distributed Systems, 28(8):2258--2271, 2017.
[2]
K. Akbudak, O. Selvitopi, and C. Aykanat. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Trans. Parallel Comput., 4(3), 2018.
[3]
P. N. Q. Anh, R. Fan, and Y. Wen. Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. In ICS '16, 2016.
[4]
A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing, 38(6):C624--C651, 2016.
[5]
A. Azad and A. Buluç. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In IPDPS '17, 2017.
[6]
A. Azad, A. Buluç, and J. R. Gilbert. Parallel triangle counting and enumeration using matrix algebra. In Workshop on Graph Algorithm Building Blocks '15, pages 804 -- 811, 2015.
[7]
A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç. HipMCL: A high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research (NAR), 2018.
[8]
A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluç. Combinatorial blas 2.0: Scaling combinatorial algorithms on distributed-memory systems. IEEE Transactions on Parallel and Distributed Systems, 33(4):989--1001, 2022.
[9]
A. H. Baker, T. Gamblin, M. Schulz, and U. M. Yang. Challenges of scaling algebraic multigrid across modern multicore architectures. In IPDPS '11, pages 275--286, 2011.
[10]
G. Ballard, A. Buluç, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. Communication optimal parallel multiplication of sparse random matrices. In SPAA '13, 2013.
[11]
G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:1--155, 2014.
[12]
G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput., 3(3), 2016.
[13]
G. Ballard, C. Siefert, and J. Hu. Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing, 38(3):C203--C231, 2016.
[14]
N. Bell, S. Dalton, and L. N. Olson. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, 34(4):C123--C152, 2012.
[15]
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09, pages 1--11, 2009.
[16]
T. Ben-Nun, M. Sutton, S. Pai, and K. Pingali. Groute: An asynchronous multi-gpu programming model for irregular computations. In PPoPP '17, pages 235--248, 2017.
[17]
G. Bikshandi, B. B. Fraguela, J. Guo, M. J. Garzarán, G. Almási, J. Moreira, and D. Padua. Implementation of parallel numerical algorithms using hierarchically tiled arrays. In Languages and Compilers for High Performance Computing, pages 87--101, 2005.
[18]
M. Bisson and M. Fatica. A gpu implementation of the sparse deep neural network graph challenge. In HPEC '19, pages 1--8, 2019.
[19]
J. C. Brodman, G. C. Evans, M. Manguoglu, A. Sameh, M.J. Garzarán, and D. Padua. A parallel numerical solver using hierarchically tiled arrays. In LCPC '11, pages 46--61, 2011.
[20]
A. Buluç. Linear Algebraic Primitives for Parallel Computing on Large Graphs. PhD thesis, University of California, Santa Barbara, 2010.
[21]
A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In SPAA '09, pages 233--244, 2009.
[22]
A. Buluç and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP '08, pages 503--510, 2008.
[23]
A. Buluç and J. R. Gilbert. On the Representation and Multiplication of Hypersparse Matrices. In IPDPS '08, 2008.
[24]
A. Buluç and J. R. Gilbert. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM Journal of Scientific Computing, 34(4):170--191, 2012.
[25]
A. Buluç, T. Mattson, S. McMillan, J. Moreira, and C. Yang. Design of the GraphBLAS API for C. In Workshop on Graph Algorithm Building Blocks, 2017.
[26]
A. Buluç and J. R. Gilbert. The combinatorial blas: design, implementation, and applications. The International Journal of High Performance Computing Applications, 25(4):496--509, 2011.
[27]
A. Buluç, S. Williams, L. Oliker, and J. Demmel. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In IPDPS '11, pages 721--733, 2011.
[28]
Y. Chen, A. B. Hayes, C. Zhang, T. Salmon, and E. Z. Zhang. Locality-aware software throttling for sparse matrix operation on gpus. In USENIX ATC '18, pages 413--425, 2018.
[29]
Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, and T. Li. Performance-aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Transactions on Parallel and Distributed Systems, 30(4):923--938, 2019.
[30]
S. Chou, F. Kjolstad, and S. Amarasinghe. Automatic generation of efficient sparse tensor format conversion routines. In PLDI '20, pages 823--838, 2020.
[31]
N. Corp. The cusparse library, 2020.
[32]
S. Dalton, S. Baxter, D. Merrill, L. Olson, and M. Garland. Optimizing sparse matrix operations on gpus using merge path. In IPDPS '15, pages 407--416, 2015.
[33]
S. Dalton, N. Bell, L. Olson, and M. Garland. Cusp: Generic parallel algorithms for sparse matrix and graph computations, 2014.
[34]
S. Dalton, L. Olson, and N. Bell. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Trans. Math. Softw., 41(4), 2015.
[35]
T. A. Davis. Graph algorithms via suitesparse: Graphblas: triangle counting and k-truss. In HPEC '18, pages 1--6, 2018.
[36]
T. A. Davis. Algorithm 1000: Suitesparse:graphblas: Graph algorithms in the language of sparse linear algebra. ACM Trans. Math. Softw., 45(4), 2019.
[37]
T. A. Davis, M. Aznaveh, and S. Kolodziej. Write quick, run fast: Sparse deep neural network in 20 minutes of development time via suitesparse:graphblas. In HPEC '19, pages 1--6, 2019.
[38]
T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1), 2011.
[39]
G. V. Demirci and C. Aykanat. Cartesian partitioning models for 2d and 3d parallel spgemm algorithms. IEEE Transactions on Parallel and Distributed Systems, 31(12):2763--2775, 2020.
[40]
G. V. Demirci and C. Aykanat. Scaling sparse matrix-matrix multiplication in the accumulo database. Distributed and Parallel Databases, 38(1):31--62, 2020.
[41]
J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, and K. Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2):293--312, 2005.
[42]
M. Deveci, C. Trott, and S. Rajamanickam. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In IPDPSW '17, pages 693--702, 2017.
[43]
M. Deveci, C. Trott, and S. Rajamanickam. Multithreaded sparse matrix-matrix multiplication for many-core and gpu architectures. Parallel Computing, 78:33 -- 46, 2018.
[44]
I. S. Duff, M. A. Heroux, and R. Pozo. An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum. ACM Trans. Math. Softw., 28(2):239--267, 2002.
[45]
I. S. Duff, M. Marrone, G. Radicati, and C. Vittoli. Level 3 basic linear algebra subprograms for sparse matrices: A user-level interface. ACM Trans. Math. Softw., 23(3):379--401, 1997.
[46]
J. A. Ellis and S. Rajamanickam. Scalable inference for sparse deep neural networks using kokkos kernels. In HPEC '19, pages 1--7, 2019.
[47]
R. D. Falgout and U. M. Yang. hypre: A library of high performance preconditioners. In ICCS '02, pages 632--641, 2002.
[48]
H. Gahvari, A. H. Baker, M. Schulz, U. M. Yang, K. E. Jordan, and W. Gropp. Modeling the performance of an algebraic multigrid cycle on hpc platforms. In ICS '11, pages 172--181, 2011.
[49]
H. Gahvari, W. Gropp, K. E. Jordan, M. Schulz, and U. M. Yang. Modeling the performance of an algebraic multigrid cycle using hybrid mpi/openmp. In ICPP '12, pages 128--137, 2012.
[50]
I. Gelado and M. Garland. Throughput-oriented gpu memory allocation. In PPoPP '19, page 27--37, 2019.
[51]
J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in matlab: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333--356, 1992.
[52]
C. Gómez, F. Mantovani, E. Focht, and M. Casas. Efficiently running spmv on long vector architectures. In PPoPP '21, page 292--303, 2021.
[53]
F. Gremse, A. Höfter, L. O. Schwen, F. Kiessling, and U. Naumann. Gpu-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM Journal on Scientific Computing, 37(1):C54--C71, 2015.
[54]
F. Gremse, K. Küpper, and U. Naumann. Memory-efficient sparse matrix-matrix multiplication by row merging on many-core architectures. SIAM Journal on Scientific Computing, 40(4):C429--C449, 2018.
[55]
Z. Gu, J. Moreira, D. Edelsohn, and A. Azad. Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking. In SPAA '20, pages 293--303, 2020.
[56]
G. Guidi, O. Selvitopi, M. Ellis, L. Oliker, K. Yelick, and A. Buluç. Parallel string graph construction and transitive reduction for de novo genome assembly. In IPDPS '21, pages 517--526, 2021.
[57]
F. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Trans. Math. Softw., 4(3):250--269, 1978.
[58]
C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, U. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan. Efficient sparse-matrix multi-vector product on gpus. In HPDC '18, pages 66--79, 2018.
[59]
C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan. Adaptive sparse tiling for sparse matrix multiplication. In PPoPP '19, pages 300--314, 2019.
[60]
K. Hou, W. Liu, H. Wang, and W.-c. Feng. Fast segmented sort on gpus. In ICS '17, 2017.
[61]
G. Huang, G. Dai, Y. Wang, and H. Yang. Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In SC '20, pages 1--12, 2020.
[62]
M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad. Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale. In IPDPS '21, pages 90--100, 2021.
[63]
M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad. Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale. In IPDPS '21, pages 90--100, 2021.
[64]
E.-J. Im and K. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01, pages 127--136, 2001.
[65]
E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications, 18(1):135--158, 2004.
[66]
H. Ji, S. Lu, K. Hou, H. Wang, Z. Jin, W. Liu, and B. Vinter. Segmented merge: A new primitive for parallel sparse matrix computations. International Journal of Parallel Programming, pages 1--13, 2021.
[67]
J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. Owens, C. Yang, M. Zalewski, and T. Mattson. Mathematical foundations of the GraphBLAS. In HPEC '16, 2016.
[68]
J. Kepner, D. Bader, A. Buluç, J. Gilbert, J. Kepner, T. Mattson, and H. Meyerhenke. Graphs, matrices, and the GraphBLAS: Seven good reasons. In ICCS '15, 2015.
[69]
F. Kjolstad, P. Ahrens, S. Kamil, and S. Amarasinghe. Tensor algebra compilation with workspaces. In CGO '19, pages 180--192, 2019.
[70]
F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1, 2017.
[71]
N. Knight, E. Carson, and J. Demmel. Exploiting data sparsity in parallel matrix powers computations. In PPAM '14, pages 15--25, 2014.
[72]
P. Koanantakool, A. Azad, A. Buluç, D. Morozov, S.-Y. Oh, L. Oliker, and K. Yelick. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In IPDPS '16, 2016.
[73]
V. Kotlyar, K. Pingali, and P. Stodghill. Compiling parallel code for sparse matrix applications. In SC '97, pages 1--18, 1997.
[74]
R. Kunchum, A. Chaudhry, A. Sukumaran-Rajam, Q. Niu, I. Nisa, and P. Sadayappan. On improving performance of sparse matrix-matrix multiplication on gpus. In ICS '17, 2017.
[75]
J. Lee, S. Kang, Y. Yu, Y.-Y. Jo, S.-W. Kim, and Y. Park. Optimization of gpu-based sparse matrix multiplication for large sparse networks. In ICDE '20, pages 925--936, 2020.
[76]
J. Li, J. Sun, and R. Vuduc. Hicoo: Hierarchical storage of sparse tensors. In SC '18, pages 238--252, 2018.
[77]
X. Li, Y. Liang, S. Yan, L. Jia, and Y. Li. A coordinated tiling and batching framework for efficient gemm on gpus. In PPoPP '19, page 229--241, 2019.
[78]
J. Liu, X. He, W. Liu, and G. Tan. Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In PPoPP '18, page 407--408, 2018.
[79]
J. Liu, X. He, W. Liu, and G. Tan. Register-aware optimizations for parallel sparse matrix-matrix multiplication. International Journal of Parallel Programming, 2019.
[80]
W. Liu and B. Vinter. An efficient gpu general sparse matrix-matrix multiplication for irregular data. In IPDPS '14, pages 370--381, 2014.
[81]
W. Liu and B. Vinter. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In ICS '15, pages 339--350, 2015.
[82]
W. Liu and B. Vinter. A framework for general sparse matrix-matrix multiplication on gpus and heterogeneous processors. Journal of Parallel and Distributed Computing, 85(C):47--61, 2015.
[83]
W. Liu and B. Vinter. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49(C):179--193, 2015.
[84]
Z. Lu, Y. Niu, and W. Liu. Efficient block algorithms for parallel sparse triangular solve. In ICPP '20, 2020.
[85]
S. Maleki, G. C. Evans, and D. A. Padua. Tiled linear algebra a system for parallel graph algorithms. In LCPC '15, pages 116--130, 2015.
[86]
T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos, J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson, A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo. Standards for graph algorithm primitives. In HPEC '13, pages 1--2, 2013.
[87]
T. Mattson, T. A. Davis, M. Kumar, A. Buluç, S. McMillan, J. Moreira, and C. Yang. LAGraph: A community effort to collect graph algorithms built on top of the GraphBLAS. In GrAPL Workshop '19, 2019.
[88]
T. G. Mattson, C. Yang, S. McMillan, A. Buluç, and J. E. Moreira. GraphBLAS C API: Ideas for future versions of the specification. In HPEC '17, 2017.
[89]
D. Merrill and M. Garland. Merge-based parallel sparse matrix-vector multiplication. In SC '16, pages 678--689, 2016.
[90]
D. Merrill and M. Garland. Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In PPoPP '16, 2016.
[91]
M. S. Mohammadi, T. Yuki, K. Cheshmi, E. C. Davis, M. Hall, M. M. Dehnavi, P. Nandy, C. Olschanowsky, A. Venkat, and M. M. Strout. Sparse computation data dependence simplification for efficient compiler-generated inspectors. In PLDI '19, pages 594--609, 2019.
[92]
Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç. High-performance sparse matrix-matrix products on intel KNL and multicore architectures. In ICPPW '18, 2018.
[93]
Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Computing, 90, 2019.
[94]
Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In IPDPS '21, pages 68--78, 2021.
[95]
S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski. Outerspace: An outer product based sparse matrix multiplication accelerator. In HPCA '18, pages 724--736, 2018.
[96]
M. Parger, M. Winter, D. Mlakar, and M. Steinberger. Speck: Accelerating gpu sparse matrix-matrix multiplication through lightweight analysis. In PPoPP '20, pages 362--375, 2020.
[97]
O. Selvitopi, M. T. Hussain, A. Azad, and A. Buluç. Optimizing high performance markov clustering for pre-exascale architectures. In IPDPS '20, pages 116--126, 2020.
[98]
M. M. Strout, M. Hall, and C. Olschanowsky. The sparse polyhedral framework: Composing compiler-generated inspector-executor code. Proceedings of the IEEE, 106(11):1921--1934, 2018.
[99]
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16:521--530, 2005.
[100]
R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In SC '02, 2002.
[101]
R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In HPCC '05, pages 807--816, 2005.
[102]
X. Wang, W. Liu, W. Xue, and L. Wu. swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In PPoPP '18, pages 338--353, 2018.
[103]
Y. Wang, C. Zhang, Z. Xie, C. Guo, Y. Liu, and J. Leng. Dual-side sparse tensor core. In ISCA '21, page 1083--1095, 2021.
[104]
J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In ICS '06, pages 307--316, 2006.
[105]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07, 2007.
[106]
M. Winter, D. Mlakar, R. Zayer, H.-P. Seidel, and M. Steinberger. Adaptive sparse matrix-matrix multiplication on the gpu. In PPoPP '19, pages 68--81, 2019.
[107]
M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Rajamanickam. Fast linear algebra-based triangle counting with kokkoskernels. In HPEC '17, pages 1--7, 2017.
[108]
Y. Xia, P. Jiang, G. Agrawal, and R. Ramnath. Scaling sparse matrix multiplication on cpu-gpu nodes. In IPDPS '21, pages 392--401, 2021.
[109]
Z. Xie, G. Tan, W. Liu, and N. Sun. Ia-spgemm: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In ICS '19, pages 94--105, 2019.
[110]
Z. Xie, G. Tan, W. Liu, and N. Sun. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems, 33(1):159--175, 2022.
[111]
C. Yang, A. Buluç, and J. D. Owens. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU. arXiv preprint, 2019.
[112]
C. Yang, A. Buluç, and J. D. Owens. Design principles for sparse matrix multiplication on the GPU. In Euro-Par '18, 2018.
[113]
C. Yang, A. Buluç, and J. D. Owens. Implementing push-pull efficiently in GraphBLAS. In ICPP '18, 2018.
[114]
A. Yaşar, S. Rajamanickam, J. Berry, M. Wolf, J. S. Young, and V. ÇatalyÜrek. Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments : (update on static graph challenge). In HPEC '19, pages 1--4, 2019.
[115]
S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas. Speeding up spmv for power-law graph analytics by enhancing locality & vectorization. In SC '20, 2020.
[116]
K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance blas? Proceedings of the IEEE, 93(2):358--386, 2005.
[117]
R. Yuster and U. Zwick. Fast sparse matrix multiplication. ACM Trans. Algorithms, 1(1):2--13, 2005.
[118]
O. Zachariadis, N. Satpute, J. Gómez-Luna, and J. Olivares. Accelerating sparse matrix-matrix multiplication with gpu tensor cores. Computers & Electrical Engineering, 88:106848, 2020.
[119]
Z. Zhang, H. Wang, S. Han, and W. J. Dally. Sparch: Efficient architecture for sparse matrix multiplication. In HPCA '20, pages 261--274, 2020.

Cited By

View all
  • (2024)Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUsThe International Journal of High Performance Computing Applications10.1177/1094342024123192838:3(245-259)Online publication date: 5-Feb-2024
  • (2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
  • (2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
April 2022
495 pages
ISBN:9781450392044
DOI:10.1145/3503221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2022

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. GPU
  2. SpGEMM
  3. sparse matrix
  4. tiled algorithm

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China

Conference

PPoPP '22

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)547
  • Downloads (Last 6 weeks)38
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUsThe International Journal of High Performance Computing Applications10.1177/1094342024123192838:3(245-259)Online publication date: 5-Feb-2024
  • (2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
  • (2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
  • (2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
  • (2024)On Efficient Large Sparse Matrix Chain MultiplicationProceedings of the ACM on Management of Data10.1145/36549592:3(1-27)Online publication date: 30-May-2024
  • (2024)SpaHet: A Software/Hardware Co-design for Accelerating Heterogeneous-Sparsity based Sparse Matrix MultiplicationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655944(1-6)Online publication date: 23-Jun-2024
  • (2024)STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators AutomaticallyProceedings of the ACM on Management of Data10.1145/36393232:1(1-26)Online publication date: 26-Mar-2024
  • (2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: Apr-2024
  • (2024)Mille-feuille: A Tile-Grained Mixed Precision Single-Kernel Conjugate Gradient Solver on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00064(1-16)Online publication date: 17-Nov-2024
  • (2024)AmgT: Algebraic Multigrid Solver on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00058(1-16)Online publication date: 17-Nov-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media