research-article

In-place transposition of rectangular matrices on accelerators

Authors:

Juan Gómez-Luna,

José María González-Linares,

Wen-Mei W. HwuAuthors Info & Claims

ACM SIGPLAN Notices, Volume 49, Issue 8

Pages 207 - 218

https://doi.org/10.1145/2692916.2555266

Published: 06 February 2014 Publication History

Abstract

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.

References

[1]

Frigo, M., Johnson, S.: The design and implementation of fftw3. Proceedings of the IEEE 93(2) (2005) 216--231

[2]

Kohlhoff, K., Pande, V., Altman, R.: K-means for parallel architectures using all-prefix-sum sorting and updating steps. IEEE Transactions on Parallel and Distributed Systems 24(8) (2013) 1602--1612

Digital Library

[3]

Goto, K., Geijn, R.A.v.d.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3) (May 2008) 12:1--12:25

Digital Library

[4]

Intel MKL: Intel Math Kernel Library (January 2013)

[5]

Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. (January 2009)

[6]

Windley, P.F.: Transposing matrices in a digital computer. The Computer Journal 2(1) (1959) 47--48

[7]

Berman, M.F.: A method for transposing a matrix. J. ACM 5(4) (October 1958) 383--384

Digital Library

[8]

Hungerford, T.: Abstract algebra: an introduction. Saunders College Publishing (1997)

[9]

Sung, I.J.: Data layout transformation through in-place transposition. PhD thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering (May 2013) http://hdl.handle.net/2142/44300.

[10]

Karlsson, L.: Blocked in-place transposition with application to storage format conversion. Technical report (2009)

[11]

Gustavson, F., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software 38(3) (April 2012) 17:1--17:32

Digital Library

[12]

Sung, I.J., Liu, G., Hwu, W.M.: DL: A data layout transformation system for heterogeneous computing. In: Innovative Parallel Computing, In Par. (May 2012) 1 --11

[13]

Cate, E.G., Twigg, D.W.: Algorithm 513: Analysis of in-situ transposition {f1}. ACM Trans. Math. Softw. 3(1) (March 1977) 104--110

Digital Library

[14]

Kaushik, S.D., Huang, C.H., Johnson, J.R., Johnson, R.W., Sadayappan, P.: Efficient transposition algorithms for large matrices. In: Supercomputing. (November 1993)

Digital Library

[15]

Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast fourier transform on graphics processors. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming. PPoPP '11, New York, NY, USA, ACM (2011) 257--266

Digital Library

[16]

Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: An optimized approach to histogram computation on gpu. Machine Vision and Applications 24(5) (2013) 899--908

Digital Library

[17]

Van den Braak, G.J., Nugteren, C., Mesman, B., Corporaal, H.: GPU-vote: A framework for accelerating voting algorithms on GPU. In Kaklamanis, C., Papatheodorou, T., Spirakis, P., eds.: Euro-Par Parallel Processing. Volume 7484 of Lecture Notes in Computer Science. (2012) 945--956

Digital Library

[18]

Knuth, D.E.: The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley (1981)

[19]

Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: Performance modeling of atomic additions on GPU scratchpad memory. IEEE Transactions on Parallel and Distributed Systems 24(11) (2013) 2273--2282

Digital Library

[20]

NVIDIA: CUDA C Programming Guide 5.0 (July 2012)

[21]

Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing, Piscataway, NJ, USA, IEEE Press (2008) 31:1--31:11

Digital Library

[22]

Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: Performance models for asynchronous data transfers on consumer graphics processing units. Journal of Parallel and Distributed Computing 72(9) (2012) 1117 -- 1126 Accelerators for High-Performance Computing.

Digital Library

[23]

Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC '11, New York, NY, USA, ACM (2011) 13:1--13:11

Digital Library

[24]

AMD: ATI Stream SDK OpenCL Programming Guide (2010)

[25]

Catanzaro, B., Keller, A., Garland, M.: A decomposition for in-place matrix transposition. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP '14 (February 2014)

Digital Library

[26]

Boyer, M., Meng, J., Kumaran, K.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). (2013) 1097--1106

Digital Library

[27]

Intel: OpenCL design and programming guide for the Intel Xeon Phi coprocessor. (2013)

Cited By

Chen LWei JZeng ZZhang M(2022)Optimized Computation for Determinant of Multivariate Polynomial Matrices on GPGPU2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00045(82-91)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00045
Chen ZWang DYu QZheng FGuo FChen Z(2022)AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processorThe Journal of Supercomputing10.1007/s11227-021-04282-678:7(9456-9474)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1007/s11227-021-04282-6
Moses WChelini LZhao RZinenko O(2021)Polygeist: Raising C to Polyhedral MLIRProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00011(45-59)Online publication date: 26-Sep-2021
https://dl.acm.org/doi/10.1109/PACT52795.2021.00011
Show More Cited By

Index Terms

In-place transposition of rectangular matrices on accelerators
1. Mathematics of computing
  1. Mathematical software

Recommendations

In-place transposition of rectangular matrices on accelerators
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place ...
Parallel Transposition of Sparse Data Structures
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Many applications in computational sciences and social sciences exploit sparsity and connectivity of acquired data. Even though many parallel sparse primitives such as sparse matrix-vector (SpMV) multiplication have been extensively studied, some other ...
Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 49, Issue 8

PPoPP '14

August 2014

390 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2692916

Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ

Issue’s Table of Contents

PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2014
412 pages
ISBN:9781450326568
DOI:10.1145/2555243
General Chair:
José Moreira
IBM Research, USA
,
Program Chair:
James Larus
EPFL, Switzerland

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2014

Published in SIGPLAN Volume 49, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
494
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)12

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen LWei JZeng ZZhang M(2022)Optimized Computation for Determinant of Multivariate Polynomial Matrices on GPGPU2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00045(82-91)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00045
Chen ZWang DYu QZheng FGuo FChen Z(2022)AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processorThe Journal of Supercomputing10.1007/s11227-021-04282-678:7(9456-9474)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1007/s11227-021-04282-6
Moses WChelini LZhao RZinenko O(2021)Polygeist: Raising C to Polyhedral MLIRProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00011(45-59)Online publication date: 26-Sep-2021
https://dl.acm.org/doi/10.1109/PACT52795.2021.00011
Yu LLin SWang N(2020)A Pipeline-Friendly In-place Linear Transformation2020 7th International Conference on Information Science and Control Engineering (ICISCE)10.1109/ICISCE50968.2020.00317(1604-1608)Online publication date: Dec-2020
https://doi.org/10.1109/ICISCE50968.2020.00317
Zhao ZLin SYu N(2019)A Class of In-Place Linear Transformations Possessing the Cache-Oblivious PropertyIEEE Access10.1109/ACCESS.2019.28989947(23068-23075)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2898994
Huang TLin WTsai HWang W(2019)Highly efficient GPU eigensolver for three-dimensional photonic crystal band structures with any Bravais latticeComputer Physics Communications10.1016/j.cpc.2019.07.007245(106841)Online publication date: Dec-2019
https://doi.org/10.1016/j.cpc.2019.07.007
Gorawski MLorek M(2018)Efficient Processing of Large Data Structures on GPUsInternational Journal of Parallel Programming10.1007/s10766-017-0515-046:6(1063-1093)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-017-0515-0
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Gomez-Luna JHajj IFernandez IGiannoula COliveira GMutlu O(2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3174101
Hong CDhulipala LShun JSarkar VKim H(2020)Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414657(55-69)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414657
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents