Abstract
Sparse matrix-vector multiplication (SpMV) plays a pivotal role in large-scale scientific computing. Despite the increasing use of low-power multicore digital signal processors (DSPs) in high performance computing (HPC) systems, optimizing SpMV on these platforms has been largely overlooked. This paper introduces the FT-M7032, a new CPU-DSP heterogeneous processor multi-core platform for high-performance computing. The FT-M7032 provides programmable memory units at multiple levels, but effectively utilizing these units poses a challenge. To address this, we evaluate the transfer capability between different units to map matrix elements to storage units. Based on our evaluation, we propose an efficient parallel implementation, SpMV_Band, specifically designed for banded matrices. Furthermore, we devise a computation pipeline that optimizes memory access overhead by overlapping data transfers and computations. To evaluate our approach, we compare its performance with a baseline executed on the general-purpose CPU cores of the FT-M7032 heterogeneous platform. Experimental results demonstrate that our techniques achieve a significant speedup of 2.0\(\times \) compared to the competing baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alappat, C., et al.: Performance modeling of streaming kernels and sparse matrix-vector multiplication on A64FX. In: IEEE/ACM PMBS, pp. 1–7. IEEE (2020)
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–11 (2009)
Chen, L., Jiang, P., Agrawal, G.: Exploiting recent SIMD architectural advances for irregular applications. In: IEEE/ACM CGO, pp. 47–58. IEEE (2016)
Chen, S., Fang, J., Xu, C., Wang, Z.: Adaptive hybrid storage format for sparse matrix-vector multiplication on multi-core SIMD CPUs. Appl. Sci. 12(19), 9812 (2022)
Crane, H., Jr., Gibbs, N.E., Poole, W.G., Jr., Stockmeyer, P.K.: Algorithm 508: Matrix bandwidth and profile reduction. ACM Trans. Mathematical Softw. (TOMS) 2(4), 375–377 (1976)
Davis, T., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
Fang, J., Zhang, P., Huang, C., Tang, T., Lu, K., Wang, R., Wang, Z.: Programming bare-metal accelerators with heterogeneous threading models: a case study of matrix-3000. Front. Inf. Technol. Electron. Eng. 24(4), 509–520 (2023)
Gao, Y., Bakos, J.D.: Sparse matrix-vector multiply on the texas instruments c6678 digital signal processor. In: 2013 IEEE 24th ASAP, pp. 168–174. IEEE (2013)
Gao, Y., Zhang, F., Bakos, J.D.: Sparse matrix-vector multiply on the keystone ii digital signal processor. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)
Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD (1996)
Igual, F.D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., van de Geijn, R.A.: Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE (2012)
Im, E.J., Yelick, K.: Optimization of sparse matrix kernels for data mining. In: Submitted to First SIAM Conference on Data Mining (2000)
Kincaid, D.R., Oppe, T.C., Young, D.M.: ITPACKV 2D user’s guide. Technical Report, Texas University, Austin, TX (USA). Center for Numerical Analysis (1989)
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014)
Kubota, Y., Takahashi, D.: Optimization of sparse matrix-vector multiplication by auto selecting storage schemes on GPU. In: Murgante, B., Gervasi, O., Iglesias, A., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2011. LNCS, vol. 6783, pp. 547–561. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21887-3_42
Lewis, J.G.: Algorithm 582: The Gibbs-Poole-Stockmeyer and Gibbs-king algorithms for reordering sparse matrices. ACM Trans. Math. Softw. (TOMS) 8(2), 190–194 (1982)
Li, C., Xia, T., Zhao, W., Zheng, N., Ren, P.: SpV8: Pursuing optimal vectorization and regular computation pattern in SpMV. In: 2021 58th ACM/IEEE Design Automation Conference (DAC), pp. 661–666. IEEE (2021)
Liu, S., Cao, Y., Sun, S.: Mapping and optimization method of SpMV on Multi-DSP accelerator. Electronics 11(22), 3699 (2022)
Liu, W., Vinter, B.: CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In: 29th ACM ICS’15, pp. 339–350. ACM, New York (2015)
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015)
Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix-vector multiplication on X86-based many-core processors. In: ICS’13, pp. 273–282. ACM, New York (2013)
Liu, Y., Schmidt, B.: LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs. In: 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 82–89. IEEE (2015)
Liu, Z., Tian, X.: Vectorization of matrix multiplication for multi-core vector processors. Chin. J. Comput. 41(10), 2251–2264 (2018)
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995
Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: SC’16. Salt Lake (2016)
Mironowicz, P., Dziekonski, A., Mrozowski, M.: A task-scheduling approach for efficient sparse symmetric matrix-vector multiplication on a GPU. SIAM J. Sci. Comput. 37(6), C643–C666 (2015)
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 111–125. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11515-8_10
Namashivayam, N., Mehta, S., Yew, P.C.: Variable-sized blocks for locality-aware SpMV. In: IEEE/ACM CGO, IEEE (2021)
Niu, Y., Zhengyang, L., Dong, M., Jin, Z., Liu, W., Tan, G.: TileSpMV: a tiled algorithm for sparse matrix-vector multiplication on GPUs. In: 35th IPDPS, pp. 68–78. IEEE (2021)
Saad, Y.: Iterative methods for sparse linear systems. In: SIAM (2003)
Sun, Q., Zhang, C., Wu, C., Zhang, J., Li, L.: Bandwidth reduced parallel SpMV on the SW26010 many-core platform. In: Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10 (2018)
Tiwari, A., Kumar, V., Mitra, G.: High performance and energy optimal parallel programming on CPU and DSP based MPSOC. Ph.D. thesis, Ph. D. dissertation, IIIT-Delhi (2018)
Wang, Y., et al.: Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions. CCF Trans. High Perform. Comput. 3, 114–125 (2021)
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pp. 1–12 (2007)
Xie, B., Zhan, J., Liu, X., Gao, W., Jia, Z., He, X., Zhang, L.: CVR: efficient vectorization of SPMV on x86 processors. In: IEEE CGO (2018)
Xu, H., Zhu, X., Wang, Q., Liu, J.: Efficiently executing sparse matrix-matrix multiplication on general purpose digital single processor. In: 2022 IEEE 24th International Conferenct on High Performance Computing & Communications, pp. 1–8. IEEE (2022)
Yang, B., Gu, S., Gu, T.X., Zheng, C., Liu, X.P.: Parallel multicore CSB format and its sparse matrix vector multiplication. In: Advances in Linear Algebra & Matrix Theory, vol. 2014 (2014)
Yin, S., Wang, Q., Hao, R., Zhou, T., Mei, S., Liu, J.: Optimizing irregular-shaped matrix-matrix multiplication on multi-core DSPs. In: 2022 IEEE International Conference on Cluster Computing (CLUSTER), pp. 451–461. IEEE (2022)
Zhang, Y., et al.: Memory-aware optimization for sequences of sparse matrix-vector multiplications. In: 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IEEE (2023)
Zhang, Y., Li, S., Yan, S., Zhou, H.: A cross-platform SpMV framework on many-core architectures. ACM Trans. Archit. Code Optim. (TACO) 13(4), 1–25 (2016)
Zhou, H., Fan, X., Zhao, L.: Optimizations on sparse matrix-vector multiplication based on CUDA. Comput. Meas. Control 18(8) (2010)
Acknowledgement
This work was supported in part by the National Key R &D Program of China under grant agreement 2021YFB0300101, the National Science Foundation of China (NSFC) under grant agreements 61902411, 62032023, 12002382, 11275269, 42104078 and 62073333, the Excellent Youth Foundation of Hunan Province under grant agreement 2021JJ10050.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bi, D., Li, S., Zhang, Y., Yang, X., Dong, D. (2024). Efficiently Running SpMV on Multi-core DSPs for Banded Matrix. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14491. Springer, Singapore. https://doi.org/10.1007/978-981-97-0808-6_12
Download citation
DOI: https://doi.org/10.1007/978-981-97-0808-6_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0807-9
Online ISBN: 978-981-97-0808-6
eBook Packages: Computer ScienceComputer Science (R0)