research-article

Auto-tuning 3-D FFT library for CUDA GPUs

Authors:

Satoshi MatsuokaAuthors Info & Claims

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Article No.: 30, Pages 1 - 10

https://doi.org/10.1145/1654059.1654090

Published: 14 November 2009 Publication History

Abstract

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.

References

[1]

J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput., Vol. 19:297--301, 1965.

[2]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[3]

M. Frigo and S. G. Johnson. The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. special issue on "Program Generation, Optimization, and Platform Adaptation".

[4]

General-Purpose Computation Using Graphics Hardware. http://www.gpgpu.org/.

[5]

N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A Memory Model for Scientific Algorithms on Graphics Processors. In the 2006 ACM/IEEE conference on supercomputing. IEEE, 2006.

Digital Library

[6]

N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High Performance Discrete Fourier Transforms on Graphics Processors. In the 2008 ACM/IEEE conference on supercomputing, 2008.

Digital Library

[7]

Khronos Group. OpenCL - The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.

[8]

E. S. Larsen and D. McAllister. Fast Matrix Multiplies using Graphics Hardware. In the 2001 ACM/IEEE conference on supercomputing (CDROM). ACM Press, 2001.

Digital Library

[9]

S. Matsuoka. The Rise of the Commodity Vectors. In J. M. L. M. Palma, P. Amestoy, M. J. Daydé, M. Mattoso, and J. C. Lopes, editors, VECPAR, volume 5336 of Lecture Notes in Computer Science, pages 53--62. Springer, 2008.

Digital Library

[10]

K. Moreland and E. Angel. The FFT on a GPU. In Proceedings of SIGGRAPH/Eurographics Workshop on Graphics Hardware 2003, pages 112--119, 2003.

Digital Library

[11]

A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[12]

L. Nyland, M. Harris, and J. Prins. Fast N-Body Simulation with CUDA. In H. Nguyen, editor, GPU Gems 3, chapter 31, pages 677--695. Addison-Wesley, 2007.

[13]

M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.

[14]

J. Spitzer. Implementing a GPU-efficient FFT. In SIGGRAPH Course on Interactive Geometric and Scientific Computations with Graphics Hardware, 2003.

[15]

M. J. Stock and A. Gharakhani. Toward efficient GPU-accelerated N-body simulations. In 46th AIAA Aerospace Sciences Meeting and Exhibit, AIAA 2008--608, January 2008.

[16]

C. Van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia, PA, 1992.

Digital Library

[17]

V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[18]

V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture, 2008. http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf.

[19]

R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. Parallel Computing, 27(1--2):3--35, 2001.

Cited By

Lurati MHeldens SSclocco Avan Werkhoven B(2024)Bringing Auto-Tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUsEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_7(91-106)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69577-3_7
Zhao YLiu FMa WLi HPeng YWang C(2023)MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT FrameworkACM Transactions on Architecture and Code Optimization10.1145/360514820:3(1-23)Online publication date: 22-Jul-2023
https://dl.acm.org/doi/10.1145/3605148
Schoonhoven Rvan Werkhoven BBatenburg K(2023)Benchmarking Optimization Algorithms for Auto-Tuning GPU KernelsIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.321065427:3(550-564)Online publication date: Jun-2023
https://doi.org/10.1109/TEVC.2022.3210654
Show More Cited By

Index Terms

Auto-tuning 3-D FFT library for CUDA GPUs

Recommendations

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing

Most GPU performance "hypes" have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming ...
An Optimized FFT-Based Direct Poisson Solver on CUDA GPUs

A highly multithreaded FFT-based direct Poisson solver that makes effective use of the capabilities of the current NVIDIA graphics processing units (GPUs) is presented. Our algorithms carefully manage the multiple layers of the memory hierarchy of the ...
Automatic FFT Performance Tuning on OpenCL GPUs
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

Many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, have been revolutionized by Fourier methods. The fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

November 2009

778 pages

ISBN:9781605587448

DOI:10.1145/1654059

Conference Chair:
Wilfred Pinfold

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC '09

Sponsor:

SIGARCH
IEEE-CS

SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 20, 2009

Oregon, Portland

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

95
Total Citations
View Citations
82
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)5

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lurati MHeldens SSclocco Avan Werkhoven B(2024)Bringing Auto-Tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUsEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_7(91-106)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69577-3_7
Zhao YLiu FMa WLi HPeng YWang C(2023)MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT FrameworkACM Transactions on Architecture and Code Optimization10.1145/360514820:3(1-23)Online publication date: 22-Jul-2023
https://dl.acm.org/doi/10.1145/3605148
Schoonhoven Rvan Werkhoven BBatenburg K(2023)Benchmarking Optimization Algorithms for Auto-Tuning GPU KernelsIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.321065427:3(550-564)Online publication date: Jun-2023
https://doi.org/10.1109/TEVC.2022.3210654
Diéguez ALópez M(2023)Performance Tuning for GPU-Embedded Systems: Machine-Learning-Based and Analytical Model-Driven Tuning Methodologies2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00022(129-140)Online publication date: 17-Oct-2023
https://doi.org/10.1109/SBAC-PAD59825.2023.00022
Hirasawa SKoibuchi M(2023)An Auto-Tuning Method for High-Bandwidth Low-Latency Approximate Interconnection Networks2023 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP59025.2023.00011(9-16)Online publication date: Mar-2023
https://doi.org/10.1109/PDP59025.2023.00011
Dieguez AChoi MZhu XWong BIbrahim K(2022)ML-based Performance Portability for Time-Dependent Density Functional Theory in HPC Environments2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS56514.2022.00006(1-12)Online publication date: Nov-2022
https://doi.org/10.1109/PMBS56514.2022.00006
Muthukrishnan HNellans DLustig DFessler JWenisch TMartínez JDuato JJohn L(2021)Efficient multi-GPU shared memory via automatic optimization of fine-grained transfersProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00020(139-152)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00020
Hutter ESolomonik E(2021)Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00014(46-57)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00014
Lawson JGoli M(2021)Performance portability through machine learning guided kernel selection in SYCL librariesParallel Computing10.1016/j.parco.2021.102813107:COnline publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1016/j.parco.2021.102813
Zhang PFang JYang CHuang CTang TWang Z(2020)Optimizing Streaming Parallelism on Heterogeneous Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297804531:8(1878-1896)Online publication date: 1-Aug-2020
https://doi.org/10.1109/TPDS.2020.2978045
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents