Efficiently Running SpMV on Multi-core DSPs for Banded Matrix

Bi, Deshun; Li, Shengguo; Zhang, Yichen; Yang, Xiaojian; Dong, Dezun

doi:10.1007/978-981-97-0808-6_12

Deshun Bi¹⁰,
Shengguo Li¹⁰,
Yichen Zhang¹⁰,
Xiaojian Yang¹⁰ &
…
Dezun Dong¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14491))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

248 Accesses
1 Citations

Abstract

Sparse matrix-vector multiplication (SpMV) plays a pivotal role in large-scale scientific computing. Despite the increasing use of low-power multicore digital signal processors (DSPs) in high performance computing (HPC) systems, optimizing SpMV on these platforms has been largely overlooked. This paper introduces the FT-M7032, a new CPU-DSP heterogeneous processor multi-core platform for high-performance computing. The FT-M7032 provides programmable memory units at multiple levels, but effectively utilizing these units poses a challenge. To address this, we evaluate the transfer capability between different units to map matrix elements to storage units. Based on our evaluation, we propose an efficient parallel implementation, SpMV_Band, specifically designed for banded matrices. Furthermore, we devise a computation pipeline that optimizes memory access overhead by overlapping data transfers and computations. To evaluate our approach, we compare its performance with a baseline executed on the general-purpose CPU cores of the FT-M7032 heterogeneous platform. Experimental results demonstrate that our techniques achieve a significant speedup of 2.0$\times $ compared to the competing baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alappat, C., et al.: Performance modeling of streaming kernels and sparse matrix-vector multiplication on A64FX. In: IEEE/ACM PMBS, pp. 1–7. IEEE (2020)
Google Scholar
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–11 (2009)
Google Scholar
Chen, L., Jiang, P., Agrawal, G.: Exploiting recent SIMD architectural advances for irregular applications. In: IEEE/ACM CGO, pp. 47–58. IEEE (2016)
Google Scholar
Chen, S., Fang, J., Xu, C., Wang, Z.: Adaptive hybrid storage format for sparse matrix-vector multiplication on multi-core SIMD CPUs. Appl. Sci. 12(19), 9812 (2022)
Google Scholar
Crane, H., Jr., Gibbs, N.E., Poole, W.G., Jr., Stockmeyer, P.K.: Algorithm 508: Matrix bandwidth and profile reduction. ACM Trans. Mathematical Softw. (TOMS) 2(4), 375–377 (1976)
Article Google Scholar
Davis, T., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
Google Scholar
Fang, J., Zhang, P., Huang, C., Tang, T., Lu, K., Wang, R., Wang, Z.: Programming bare-metal accelerators with heterogeneous threading models: a case study of matrix-3000. Front. Inf. Technol. Electron. Eng. 24(4), 509–520 (2023)
Article Google Scholar
Gao, Y., Bakos, J.D.: Sparse matrix-vector multiply on the texas instruments c6678 digital signal processor. In: 2013 IEEE 24th ASAP, pp. 168–174. IEEE (2013)
Google Scholar
Gao, Y., Zhang, F., Bakos, J.D.: Sparse matrix-vector multiply on the keystone ii digital signal processor. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)
Google Scholar
Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD (1996)
Google Scholar
Igual, F.D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., van de Geijn, R.A.: Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE (2012)
Google Scholar
Im, E.J., Yelick, K.: Optimization of sparse matrix kernels for data mining. In: Submitted to First SIAM Conference on Data Mining (2000)
Google Scholar
Kincaid, D.R., Oppe, T.C., Young, D.M.: ITPACKV 2D user’s guide. Technical Report, Texas University, Austin, TX (USA). Center for Numerical Analysis (1989)
Google Scholar
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014)
Article MathSciNet Google Scholar
Kubota, Y., Takahashi, D.: Optimization of sparse matrix-vector multiplication by auto selecting storage schemes on GPU. In: Murgante, B., Gervasi, O., Iglesias, A., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2011. LNCS, vol. 6783, pp. 547–561. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21887-3_42
Chapter Google Scholar
Lewis, J.G.: Algorithm 582: The Gibbs-Poole-Stockmeyer and Gibbs-king algorithms for reordering sparse matrices. ACM Trans. Math. Softw. (TOMS) 8(2), 190–194 (1982)
Article Google Scholar
Li, C., Xia, T., Zhao, W., Zheng, N., Ren, P.: SpV8: Pursuing optimal vectorization and regular computation pattern in SpMV. In: 2021 58th ACM/IEEE Design Automation Conference (DAC), pp. 661–666. IEEE (2021)
Google Scholar
Liu, S., Cao, Y., Sun, S.: Mapping and optimization method of SpMV on Multi-DSP accelerator. Electronics 11(22), 3699 (2022)
Article Google Scholar
Liu, W., Vinter, B.: CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In: 29th ACM ICS’15, pp. 339–350. ACM, New York (2015)
Google Scholar
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015)
Article MathSciNet Google Scholar
Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix-vector multiplication on X86-based many-core processors. In: ICS’13, pp. 273–282. ACM, New York (2013)
Google Scholar
Liu, Y., Schmidt, B.: LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs. In: 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 82–89. IEEE (2015)
Google Scholar
Liu, Z., Tian, X.: Vectorization of matrix multiplication for multi-core vector processors. Chin. J. Comput. 41(10), 2251–2264 (2018)
MathSciNet Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995
Google Scholar
Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: SC’16. Salt Lake (2016)
Google Scholar
Mironowicz, P., Dziekonski, A., Mrozowski, M.: A task-scheduling approach for efficient sparse symmetric matrix-vector multiplication on a GPU. SIAM J. Sci. Comput. 37(6), C643–C666 (2015)
Article MathSciNet Google Scholar
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 111–125. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11515-8_10
Chapter Google Scholar
Namashivayam, N., Mehta, S., Yew, P.C.: Variable-sized blocks for locality-aware SpMV. In: IEEE/ACM CGO, IEEE (2021)
Google Scholar
Niu, Y., Zhengyang, L., Dong, M., Jin, Z., Liu, W., Tan, G.: TileSpMV: a tiled algorithm for sparse matrix-vector multiplication on GPUs. In: 35th IPDPS, pp. 68–78. IEEE (2021)
Google Scholar
Saad, Y.: Iterative methods for sparse linear systems. In: SIAM (2003)
Google Scholar
Sun, Q., Zhang, C., Wu, C., Zhang, J., Li, L.: Bandwidth reduced parallel SpMV on the SW26010 many-core platform. In: Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10 (2018)
Google Scholar
Tiwari, A., Kumar, V., Mitra, G.: High performance and energy optimal parallel programming on CPU and DSP based MPSOC. Ph.D. thesis, Ph. D. dissertation, IIIT-Delhi (2018)
Google Scholar
Wang, Y., et al.: Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions. CCF Trans. High Perform. Comput. 3, 114–125 (2021)
Article Google Scholar
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pp. 1–12 (2007)
Google Scholar
Xie, B., Zhan, J., Liu, X., Gao, W., Jia, Z., He, X., Zhang, L.: CVR: efficient vectorization of SPMV on x86 processors. In: IEEE CGO (2018)
Google Scholar
Xu, H., Zhu, X., Wang, Q., Liu, J.: Efficiently executing sparse matrix-matrix multiplication on general purpose digital single processor. In: 2022 IEEE 24th International Conferenct on High Performance Computing & Communications, pp. 1–8. IEEE (2022)
Google Scholar
Yang, B., Gu, S., Gu, T.X., Zheng, C., Liu, X.P.: Parallel multicore CSB format and its sparse matrix vector multiplication. In: Advances in Linear Algebra & Matrix Theory, vol. 2014 (2014)
Google Scholar
Yin, S., Wang, Q., Hao, R., Zhou, T., Mei, S., Liu, J.: Optimizing irregular-shaped matrix-matrix multiplication on multi-core DSPs. In: 2022 IEEE International Conference on Cluster Computing (CLUSTER), pp. 451–461. IEEE (2022)
Google Scholar
Zhang, Y., et al.: Memory-aware optimization for sequences of sparse matrix-vector multiplications. In: 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IEEE (2023)
Google Scholar
Zhang, Y., Li, S., Yan, S., Zhou, H.: A cross-platform SpMV framework on many-core architectures. ACM Trans. Archit. Code Optim. (TACO) 13(4), 1–25 (2016)
Article Google Scholar
Zhou, H., Fan, X., Zhao, L.: Optimizations on sparse matrix-vector multiplication based on CUDA. Comput. Meas. Control 18(8) (2010)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Key R &D Program of China under grant agreement 2021YFB0300101, the National Science Foundation of China (NSFC) under grant agreements 61902411, 62032023, 12002382, 11275269, 42104078 and 62073333, the Excellent Youth Foundation of Hunan Province under grant agreement 2021JJ10050.

Author information

Authors and Affiliations

National University of Defense Technology, Changsha, China
Deshun Bi, Shengguo Li, Yichen Zhang, Xiaojian Yang & Dezun Dong

Authors

Deshun Bi
View author publications
You can also search for this author in PubMed Google Scholar
Shengguo Li
View author publications
You can also search for this author in PubMed Google Scholar
Yichen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojian Yang
View author publications
You can also search for this author in PubMed Google Scholar
Dezun Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shengguo Li or Dezun Dong .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology, Melbourne, VIC, Australia
Zahir Tari
Tianjin University, Tianjin, China
Keqiu Li
University of Arizona, Tucson, AZ, USA
Hongyi Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bi, D., Li, S., Zhang, Y., Yang, X., Dong, D. (2024). Efficiently Running SpMV on Multi-core DSPs for Banded Matrix. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14491. Springer, Singapore. https://doi.org/10.1007/978-981-97-0808-6_12

Download citation

DOI: https://doi.org/10.1007/978-981-97-0808-6_12
Published: 27 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0807-9
Online ISBN: 978-981-97-0808-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficiently Running SpMV on Multi-core DSPs for Banded Matrix