Abstract
In high-performance computing, the general matrix-matrix multiplication (xGEMM) routine is the core of the Level 3 BLAS kernel for effective matrix-matrix multiplication operations. The performance of parallel xGEMM (PxGEMM) is significantly affected by two main factors: the flop rate that can be achieved by calculating the operations and the communication costs for broadcasting submatrices to others. In this study, an approach is proposed to improve and adjust the parallel double-precision general matrix-matrix multiplication (PDGEMM) routine for modern Intel computers such as Knights Landing (KNL) and Xeon Scalable Processors (SKL). The proposed approach consists of two methods to deal with the aforementioned factors. First, the improvement of PDGEMM for the computational part is suggested based on a blocked GEMM algorithm that provides better fits for the architectures of KNL and SKL to perform better block size computation. Second, a communication routine adjustment with the message passing interface is proposed to overcome the settings of the basic linear algebra communication subprograms to improve the time-wise cost efficiency. Consequently, it is shown that performance improvements are achieved in the case of smaller matrix multiplications on the SKL clusters.
Similar content being viewed by others
References
High-Performance Computing (HPC). https://www.nics.tennessee.edu/computing-resources/what-is-hpc. Accessed 22 Nov 2020
Geist, A., Reed, D.A.: A survey of high-performance computing scaling challenges. Int. J. High Perform. Comput. Appl. 31, 104–113 (2017)
Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., Gschwandtner, P., Lemarinier, P., Markidis, S., Jordan, H., Fahringer, T., Katrinis, K., Laure, E., Nikolopoulos, D.S.: A taxonomy of task-based parallel programming technologies for high-performance computing. J. SuperComput. 74, 1422–1434 (2018)
Basic Linear Algebra Subprograms (BLAS). http://www.netlib.org/blas. Accessed 22 Nov 2020.
Parallel Basic Linear Algebra Subprograms (PBLAS). http://www.netlib.org/scalapack/pblas_qref.html. Accessed 22 Nov 2020.
Filippone, S.: Parallel libraries on distributed memory architectures: The IBM Parallel ESSL. In: Waśniewski, J., Dongarra, J., Madsen, K., Olesen, D. (eds.) Applied Parallel Computing Industrial Computation and Optimization, pp.247-255. Springer (1996)
ScaLAPACK. http://www.netlib.org/scalapack. Accessed 22 Nov 2020.
Intel Math Kernel Library (MKL). https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html. Accessed 22 Nov 2020.
Yan, D., Wang, W., Chu, X.: Optimizing batched winograd convolution on GPUs. Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’20). https://doi.org/10.1145/3332466.3374520
Catalán, S., Castelló, A., Igual, F.D., Rodríguez-Sánchez, R., Quintana-Ortí, E.S.: Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Comput. 23, 359–375 (2020)
Frison, G., Sartor, T., Zanelli, A., Diehl, M.: The BLAS API of BLASFEO: optimizing performance for small matrices. ACM Transact. Mathem. Software (2020). https://doi.org/10.1145/3378671
Labini, P.S., Cianfriglia, M., Perri, D., Gervasi, O., Fursin, G., Lokhmotov, A., Nugteren, C., Carpentieri, B., Zollo, F., Vella, F.: On the anatomy of predictive models for accelerating GPU convolution kernels and beyond. ACM Transact. Architect. Code Optimiz. (2021). https://doi.org/10.1145/3434402
Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V.: PVM: Parallel virtual machine: a users’ guide and tutorial for networked parallel computing. MIT Press, Cambridge, MA (1994)
Kotsifakou, M., Srivastava, P., Sinclair, M.D., Komuravelli, R., Adve, V., Adve, S.: HPVM: Heterogeneous parallel virtual machine. proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming (2018). https://doi.org/10.1145/3178487.3178493
Hempel, R.: The MPI standard for message passing. In: Gentzsch, W. Harms, U. (eds.) High-Performance Computing and Networking, pp. 247-252. Springer, (1994)
Zhang, J., Lu, X., Panda, D.K.: High performance mpi library for container-based HPC cloud on InfiniBand clusters. Int. Conf. Parallel Process. (2016). https://doi.org/10.1109/ICPP.2016.38
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)
Ayguade, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, H.: The design of OpenMP tasks. IEEE Transact. Parallel Distr. Syst. 20, 404–418 (2008)
Diener, M., Kale, L.V., Bodony, D.J.: Heterogeneous computing with OpenMP and Hydra. Concurr. Comput.: Practice Exp. 32, e5728 (2020)
Sampath, S., Sagar, B.B., Nanjesh, B.R.: Performance evaluation and comparison of MPI and PVM using a cluster based parallel computing architecture. International Conference on Circuits, Power and Computing Technologies (2013). https://doi.org/10.1109/ICCPCT.2013.6529020
Lusk, E., Chan, A.: Early Experiments with the OpenMP/MPI Hybrid Programming model. In: Eigenmann, R., de Supinski, B.R. (eds.) OpenMP in a New Era of Parallelism, pp. 36-47. Springer (2008)
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights, Landing Morgan Kaufmann, Burlington, Massachusetts (2016)
Zhao, Z., Marsman, M., Wende, F., Kim, J.: Performance of hybrid MPI/OpenMP VASP on Cray XC40 based on Intel Knights landing many integrated core architecture. Cray User Group Proceedings (2017)
Basic Linear Algebra Communication Subprograms (BLACS). http://www.netlib.org/blacs. Accessed 22 Nov 2020
Walker, D., Sawyer, W., Deshpande, V.: An MPI implementation of the BLACS. Proceedings of 3rd International Conference on High Performance Computing (1996). https://doi.org/10.1109/HIPC.1996.565864
Chen, C., Fang, J., Tang, T., Yang, C.: LU factorization on heterogeneous systems: an energy-efficient appraoch towards high performance. Computing 99, 791–811 (2017)
Nagasaka, Y., Matsuoka, S., Azad, A., Buluc, A.: High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures. Proceedings of the 47th International Conference on Parallel Processing Companion (2018). https://doi.org/10.1145/3229710.3229720
Lim, R., Lee, Y., Kim, R., Choi, J.: An implementation of matrix-matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput. 21, 1785–1795 (2018)
Lim, R., Lee, Y., Kim, R., Choi, J., Lee, M.: Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors. J. Supercomput. 75, 7895–7908 (2019)
Zhang, X., Wang, Q., Werber, S.: Openblas. http://www.openblas.net. Accessed 22 Nov. 2020
Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the atlas project. Parallel Comput. 27, 321–354 (2001)
van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Transact. Mathem. Softw. (2015). https://doi.org/10.1145/2764454
van de Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurr. Practice Exp. 94, 255–274 (1997)
Choi, J.: A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers. Proceedings of 11th International Parallel Processing Symposium (1997). https://doi.org/10.1109/IPPS.1997.580916
Choi, J.: A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr. Practice Exp. 10, 655–670 (1998)
Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Transact. Mathem. Softw. 34, 1–25 (2008). https://doi.org/10.1145/1356052.1356053
Gunnels, J.A., Henry, G.M., van de Geijn, R.A.: A family of high-performance matrix multiplication algorithms. Int. Conf. Comput. Sci. (2001). https://doi.org/10.1007/3-540-45545-0_15
Kim, R., Choi, J., Lee, M.: Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512. Int. Conf. High Perf. Comput. Asia-Pac. Region (2019). https://doi.org/10.1145/3293320.3293334
Lim, R., Lee, Y., Kim, R., Choi, J.: OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing. Workshops of HPC Asia (2018). https://doi.org/10.1145/3176364.3176374
Recommended value of block size for Intel processor. https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/intel-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html. Accessed 22 Nov 2020.
Acknowledgements
This work was partially supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF-2015M3C4A7065662), and partially supported by the Supercomputer Development Leading Program of the National Research Foundation of Korea (NRF) funded by the Korean government (MSIT) (No. 2020M3H6A1084853). Also this work was supported by the National Supercomputing Center with supercomputing resources including technical support (No. KSC-2020-CRE-0195).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Park, Y., Kim, R., Nguyen, T.M.T. et al. Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors. Cluster Comput 26, 2539–2549 (2023). https://doi.org/10.1007/s10586-021-03274-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03274-8