Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

In-place transposition of rectangular matrices on accelerators

Published: 06 February 2014 Publication History

Abstract

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.

References

[1]
Frigo, M., Johnson, S.: The design and implementation of fftw3. Proceedings of the IEEE 93(2) (2005) 216--231
[2]
Kohlhoff, K., Pande, V., Altman, R.: K-means for parallel architectures using all-prefix-sum sorting and updating steps. IEEE Transactions on Parallel and Distributed Systems 24(8) (2013) 1602--1612
[3]
Goto, K., Geijn, R.A.v.d.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3) (May 2008) 12:1--12:25
[4]
Intel MKL: Intel Math Kernel Library (January 2013)
[5]
Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. (January 2009)
[6]
Windley, P.F.: Transposing matrices in a digital computer. The Computer Journal 2(1) (1959) 47--48
[7]
Berman, M.F.: A method for transposing a matrix. J. ACM 5(4) (October 1958) 383--384
[8]
Hungerford, T.: Abstract algebra: an introduction. Saunders College Publishing (1997)
[9]
Sung, I.J.: Data layout transformation through in-place transposition. PhD thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering (May 2013) http://hdl.handle.net/2142/44300.
[10]
Karlsson, L.: Blocked in-place transposition with application to storage format conversion. Technical report (2009)
[11]
Gustavson, F., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software 38(3) (April 2012) 17:1--17:32
[12]
Sung, I.J., Liu, G., Hwu, W.M.: DL: A data layout transformation system for heterogeneous computing. In: Innovative Parallel Computing, In Par. (May 2012) 1 --11
[13]
Cate, E.G., Twigg, D.W.: Algorithm 513: Analysis of in-situ transposition {f1}. ACM Trans. Math. Softw. 3(1) (March 1977) 104--110
[14]
Kaushik, S.D., Huang, C.H., Johnson, J.R., Johnson, R.W., Sadayappan, P.: Efficient transposition algorithms for large matrices. In: Supercomputing. (November 1993)
[15]
Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast fourier transform on graphics processors. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming. PPoPP '11, New York, NY, USA, ACM (2011) 257--266
[16]
Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: An optimized approach to histogram computation on gpu. Machine Vision and Applications 24(5) (2013) 899--908
[17]
Van den Braak, G.J., Nugteren, C., Mesman, B., Corporaal, H.: GPU-vote: A framework for accelerating voting algorithms on GPU. In Kaklamanis, C., Papatheodorou, T., Spirakis, P., eds.: Euro-Par Parallel Processing. Volume 7484 of Lecture Notes in Computer Science. (2012) 945--956
[18]
Knuth, D.E.: The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley (1981)
[19]
Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: Performance modeling of atomic additions on GPU scratchpad memory. IEEE Transactions on Parallel and Distributed Systems 24(11) (2013) 2273--2282
[20]
NVIDIA: CUDA C Programming Guide 5.0 (July 2012)
[21]
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing, Piscataway, NJ, USA, IEEE Press (2008) 31:1--31:11
[22]
Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: Performance models for asynchronous data transfers on consumer graphics processing units. Journal of Parallel and Distributed Computing 72(9) (2012) 1117 -- 1126 Accelerators for High-Performance Computing.
[23]
Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC '11, New York, NY, USA, ACM (2011) 13:1--13:11
[24]
AMD: ATI Stream SDK OpenCL Programming Guide (2010)
[25]
Catanzaro, B., Keller, A., Garland, M.: A decomposition for in-place matrix transposition. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP '14 (February 2014)
[26]
Boyer, M., Meng, J., Kumaran, K.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). (2013) 1097--1106
[27]
Intel: OpenCL design and programming guide for the Intel Xeon Phi coprocessor. (2013)

Cited By

View all
  • (2022)Optimized Computation for Determinant of Multivariate Polynomial Matrices on GPGPU2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00045(82-91)Online publication date: Dec-2022
  • (2022)AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processorThe Journal of Supercomputing10.1007/s11227-021-04282-678:7(9456-9474)Online publication date: 1-May-2022
  • (2021)Polygeist: Raising C to Polyhedral MLIRProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00011(45-59)Online publication date: 26-Sep-2021
  • Show More Cited By

Index Terms

  1. In-place transposition of rectangular matrices on accelerators

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 49, Issue 8
    PPoPP '14
    August 2014
    390 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2692916
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
      February 2014
      412 pages
      ISBN:9781450326568
      DOI:10.1145/2555243
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 February 2014
    Published in SIGPLAN Volume 49, Issue 8

    Check for updates

    Author Tags

    1. GPU
    2. in-place
    3. transposition

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)54
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Optimized Computation for Determinant of Multivariate Polynomial Matrices on GPGPU2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00045(82-91)Online publication date: Dec-2022
    • (2022)AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processorThe Journal of Supercomputing10.1007/s11227-021-04282-678:7(9456-9474)Online publication date: 1-May-2022
    • (2021)Polygeist: Raising C to Polyhedral MLIRProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00011(45-59)Online publication date: 26-Sep-2021
    • (2020)A Pipeline-Friendly In-place Linear Transformation2020 7th International Conference on Information Science and Control Engineering (ICISCE)10.1109/ICISCE50968.2020.00317(1604-1608)Online publication date: Dec-2020
    • (2019)A Class of In-Place Linear Transformations Possessing the Cache-Oblivious PropertyIEEE Access10.1109/ACCESS.2019.28989947(23068-23075)Online publication date: 2019
    • (2019)Highly efficient GPU eigensolver for three-dimensional photonic crystal band structures with any Bravais latticeComputer Physics Communications10.1016/j.cpc.2019.07.007245(106841)Online publication date: Dec-2019
    • (2018)Efficient Processing of Large Data Structures on GPUsInternational Journal of Parallel Programming10.1007/s10766-017-0515-046:6(1063-1093)Online publication date: 1-Dec-2018
    • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
    • (2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
    • (2020)Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414657(55-69)Online publication date: 30-Sep-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media