research-article

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Authors:

Hyuck Chan Kwon,

Duksu KimAuthors Info & Claims

Computing, Volume 102, Issue 12

Pages 2607 - 2631

https://doi.org/10.1007/s00607-020-00846-1

Published: 01 December 2020 Publication History

Abstract

We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen’s method, we represent matrix multiplication work as a set of matrix addition and multiplication tasks among their sub-matrices. Then, we distribute the tasks to CPUs and GPUs while considering the characteristics of the tasks and computing resources to minimize the data communication overhead and fully utilize the available computing power. To handle a large matrix efficiently with limited GPU memory, we also propose a block-based work decomposition method. We then further improve the performance of our method by exploiting the concurrent execution abilities of a heterogeneous parallel computing system. We implemented our method on five different heterogeneous systems and applied it to matrices of various sizes. Our method generally shows higher performance than the prior GPU-based matrix multiplication methods. Moreover, compared with the state-of-the-art GPU matrix multiplication library (i.e., CUBLAS), our method achieved up to 1.97 times higher performance using the same GPUs and CPU cores. In some cases, our method using a low-performance GPU (e.g., GTX 1060, 3 GB) achieved performance comparable to that of CUBLAS using a high-performance GPU (e.g., RTX 2080, 8 GB). Also, our method continually improves performance as we use more computing resources like additional CPU cores and GPUs. We could achieve such high performance because our approach fully utilized the capacities of the given heterogeneous parallel computing systems while employing the Strassen’s method, which has a lower asymptotic complexity. These results demonstrate the efficiency and robustness of our algorithm.

References

[1]

AMD Ryzen series https://www.amd.com/en/products/embedded-ryzen-series (2020)

[2]

Abdelfattah A, Tomov S, Dongarra J (2019) Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS), IEEE, pp 111–122

[3]

Agarwal RC, Balle SM, Gustavson FG, Joshi M, and Palkar P A three-dimensional approach to parallel matrix multiplication IBM J Res Dev 1995 39 5 575-582

[4]

Ballard G, Demmel J, Holtz O, Lipshitz B, Schwartz O (2012) Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the twenty-fourth annual ACM symposium on parallelism in algorithms and architectures, ACM, pp 193–204

[5]

Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Orti ES (2008) Evaluation and tuning of the level 3 cublas for graphics processors. In: 2008 IEEE international symposium on parallel and distributed processing, IEEE, pp 1–8

[6]

Chtchelkanova A, Gunnels J, Morrow G, Overfelt J, and Van De Geijn RA Parallel implementation of blas: general techniques for level 3 blas Concurr Pract Exp 1997 9 9 837-857

[7]

Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the nineteenth annual ACM symposium on theory of computing, ACM, pp 1–6

[8]

Dagum L and Menon R OpenMP: an industry standard API for shared-memory programming IEEE Comput Sci Eng 1998 5 46-55

[9]

Fatahalian K, Sugerman J, Hanrahan P (2004) Understanding the efficiency of gpu algorithms for matrix-matrix multiplication. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, ACM, pp 133–137

[10]

Irony D, Toledo S, and Tiskin A Communication lower bounds for distributed-memory matrix multiplication J Parallel Distrib Comput 2004 64 9 1017-1026

[11]

Itu L, Moldoveanu F, Suciu C, and Postelnicu A Gpu enhanced stream-based matrix multiplication Bull Transil Univ Brasov Eng Sci Ser I 2012 5 2 79

[12]

Karunadasa N, Ranasinghe D (2009) On the comparative performance of parallel algorithms on small GPU/CUDA clusters. In: International conference on high performance computing

[13]

Kelefouras V, Kritikakou A, and Goutis C A matrix–matrix multiplication methodology for single/multi-core architectures using simd J Supercomput 2014 68 3 1418-1440

[14]

Kim D, Heo JP, Huh J, Kim J, Yoon SE (2009) HPCCD: Hybrid parallel continuous collision detection. Comput Graph Forum (Pac Graph) 28(7)

[15]

Kim D, Lee J, Lee J, Shin I, Kim J, and Yoon SE Scheduling in heterogeneous computing environments for proximity queries IEEE Trans Vis Comput Graph 2013 19 9 1513-1525

[16]

Lai PW, Arafat H, Elango V, Sadayappan P (2013) Accelerating Strassen–Winograd’s matrix multiplication algorithm on GPUs. In: 20th international conference on high performance computing (HiPC), IEEE, pp 139–148

[17]

Li A, Serban R, Negrut D (2014) An overview of nvidia tegra k1 architecture. http://sbel.wisc.edu/documents/TR-2014-17

[18]

Lipshitz B, Ballard G, Demmel J, Schwartz O (2012) Communication-avoiding parallel strassen: implementation and performance. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, p 101

[19]

Liu W, Vinter B (2014) An efficient GPU general sparse matrix-matrix multiplication for irregular data. In: IEEE 28th international parallel and distributed processing symposium, IEEE, pp 370–381

[20]

NVIDIA: CUBLAS libraries https://developer.nvidia.com/cublas (2018)

[21]

NVIDIA: CUDA programming guide 9.2 (2018)

[22]

Ohtaki Y, Takahashi D, Boku T, Sato M (2004) Parallel implementation of strassen’s matrix multiplication algorithm for heterogeneous clusters. In: 18th international on parallel and distributed processing symposium, Proceedings, IEEE, p 112

[23]

Pan VY (1979) Field extension and trilinear aggregating, uniting and canceling for the acceleration of matrix multiplications. In: IEEE foundations of computer science, IEEE, pp 28–38

[24]

Pan VYNew fast algorithms for matrix operationsSIAM J Comput198092321-342568817

[25]

Ray U, Hazra TK, Ray UK (2016) Matrix multiplication using Strassen’s algorithm on CPU & GPU

[26]

Ryu S, Kim D (2018) Parallel huge matrix multiplication on a cluster with gpgpu accelerators. In: 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), IEEE, pp 877–882

[27]

Saravanan V, Radhakrishnan M, Basavesh A, and Kothari D A comparative study on performance benefits of multi-core cpus using openmp Int J Comput Sci Issue (IJCSI) 2012 9 1 272

[28]

Schönhage APartial and total matrix multiplicationSIAM J Comput1981103434-455623057

[29]

Strassen VGaussian elimination is not optimalNumerische mathematik1969134354-356248973

[30]

Strassen V (1986) The asymptotic spectrum of tensors and the exponent of matrix multiplication. In: 27th annual symposium on foundations of computer science, IEEE, pp 49–54

[31]

Sun Y, Tong Y (2010) Cuda based fast implementation of very large matrix computation. In: 2010 international conference on parallel and distributed computing, applications and technologies, pp 487–491. 10.1109/PDCAT.2010.45

[32]

Sunitha N, Raju K, Chiplunkar NN (2017) Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead. In: international conference on inventive communication and computational technologies (ICICCT), pp 211–215

[33]

Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: International conference for high performance computing, networking, storage and analysis, IEEE, pp 1–11

[34]

Yugopuspito P, Sutrisno Hudi R (2013) Breaking through memory limitation in GPU parallel processing using strassen algorithm. In: International conference on computer, control, informatics and its applications (IC3INA), pp 201–205

[35]

Zhang P, Gao Y (2015) Matrix multiplication on high-density multi-gpu architectures: theoretical and experimental investigations. In: International conference on high performance computing, Springer, pp 17–30

Cited By

Martínez PBernabé GGarcía J(2024)POAS: a framework for exploiting accelerator level parallelism in heterogeneous environmentsThe Journal of Supercomputing10.1007/s11227-024-06008-w80:10(14666-14693)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-06008-w
Ghorbel ABen Amor NAbid M(2022)GPGPU-Based Parallel Computing of Viola and Jones Eyes Detection Algorithm to Drive an Intelligent WheelchairJournal of Signal Processing Systems10.1007/s11265-022-01783-294:12(1365-1379)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s11265-022-01783-2
Khalid YAleem MAhmed UProdan RIslam MIqbal M(2021)FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performanceComputing10.1007/s00607-021-00958-2103:10(2171-2202)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s00607-021-00958-2
Show More Cited By

Index Terms

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Index terms have been assigned to the content through auto-classification.

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Strassen’s Algorithm Reloaded on GPUs

Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup for ...
A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient ...

Comments

Information & Contributors

Information

Published In

cover image Computing

Computing Volume 102, Issue 12

Dec 2020

129 pages

ISSN:0010-485X

Issue’s Table of Contents

© Springer-Verlag GmbH Austria, part of Springer Nature 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 December 2020

Accepted: 25 September 2020

Received: 06 April 2020

Author Tags

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Martínez PBernabé GGarcía J(2024)POAS: a framework for exploiting accelerator level parallelism in heterogeneous environmentsThe Journal of Supercomputing10.1007/s11227-024-06008-w80:10(14666-14693)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-06008-w
Ghorbel ABen Amor NAbid M(2022)GPGPU-Based Parallel Computing of Viola and Jones Eyes Detection Algorithm to Drive an Intelligent WheelchairJournal of Signal Processing Systems10.1007/s11265-022-01783-294:12(1365-1379)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s11265-022-01783-2
Khalid YAleem MAhmed UProdan RIslam MIqbal M(2021)FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performanceComputing10.1007/s00607-021-00958-2103:10(2171-2202)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s00607-021-00958-2
Barlas G(2021)Optimized Hybrid Execution of Dense Matrix-Matrix Multiplication on Clusters of Heterogeneous Multicore and Many-Core PlatformsParallel Computing Technologies10.1007/978-3-030-86359-3_14(178-195)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-86359-3_14

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents