Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1654059.1654090acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Auto-tuning 3-D FFT library for CUDA GPUs

Published: 14 November 2009 Publication History

Abstract

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.

References

[1]
J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput., Vol. 19:297--301, 1965.
[2]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.
[3]
M. Frigo and S. G. Johnson. The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. special issue on "Program Generation, Optimization, and Platform Adaptation".
[4]
General-Purpose Computation Using Graphics Hardware. http://www.gpgpu.org/.
[5]
N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A Memory Model for Scientific Algorithms on Graphics Processors. In the 2006 ACM/IEEE conference on supercomputing. IEEE, 2006.
[6]
N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High Performance Discrete Fourier Transforms on Graphics Processors. In the 2008 ACM/IEEE conference on supercomputing, 2008.
[7]
Khronos Group. OpenCL - The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.
[8]
E. S. Larsen and D. McAllister. Fast Matrix Multiplies using Graphics Hardware. In the 2001 ACM/IEEE conference on supercomputing (CDROM). ACM Press, 2001.
[9]
S. Matsuoka. The Rise of the Commodity Vectors. In J. M. L. M. Palma, P. Amestoy, M. J. Daydé, M. Mattoso, and J. C. Lopes, editors, VECPAR, volume 5336 of Lecture Notes in Computer Science, pages 53--62. Springer, 2008.
[10]
K. Moreland and E. Angel. The FFT on a GPU. In Proceedings of SIGGRAPH/Eurographics Workshop on Graphics Hardware 2003, pages 112--119, 2003.
[11]
A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.
[12]
L. Nyland, M. Harris, and J. Prins. Fast N-Body Simulation with CUDA. In H. Nguyen, editor, GPU Gems 3, chapter 31, pages 677--695. Addison-Wesley, 2007.
[13]
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.
[14]
J. Spitzer. Implementing a GPU-efficient FFT. In SIGGRAPH Course on Interactive Geometric and Scientific Computations with Graphics Hardware, 2003.
[15]
M. J. Stock and A. Gharakhani. Toward efficient GPU-accelerated N-body simulations. In 46th AIAA Aerospace Sciences Meeting and Exhibit, AIAA 2008--608, January 2008.
[16]
C. Van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia, PA, 1992.
[17]
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.
[18]
V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture, 2008. http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf.
[19]
R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. Parallel Computing, 27(1--2):3--35, 2001.

Cited By

View all
  • (2024)Bringing Auto-Tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUsEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_7(91-106)Online publication date: 26-Aug-2024
  • (2023)MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT FrameworkACM Transactions on Architecture and Code Optimization10.1145/360514820:3(1-23)Online publication date: 22-Jul-2023
  • (2023)Benchmarking Optimization Algorithms for Auto-Tuning GPU KernelsIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.321065427:3(550-564)Online publication date: Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
November 2009
778 pages
ISBN:9781605587448
DOI:10.1145/1654059
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC '09
Sponsor:

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)5
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Bringing Auto-Tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUsEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_7(91-106)Online publication date: 26-Aug-2024
  • (2023)MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT FrameworkACM Transactions on Architecture and Code Optimization10.1145/360514820:3(1-23)Online publication date: 22-Jul-2023
  • (2023)Benchmarking Optimization Algorithms for Auto-Tuning GPU KernelsIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.321065427:3(550-564)Online publication date: Jun-2023
  • (2023)Performance Tuning for GPU-Embedded Systems: Machine-Learning-Based and Analytical Model-Driven Tuning Methodologies2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00022(129-140)Online publication date: 17-Oct-2023
  • (2023)An Auto-Tuning Method for High-Bandwidth Low-Latency Approximate Interconnection Networks2023 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP59025.2023.00011(9-16)Online publication date: Mar-2023
  • (2022)ML-based Performance Portability for Time-Dependent Density Functional Theory in HPC Environments2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS56514.2022.00006(1-12)Online publication date: Nov-2022
  • (2021)Efficient multi-GPU shared memory via automatic optimization of fine-grained transfersProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00020(139-152)Online publication date: 14-Jun-2021
  • (2021)Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00014(46-57)Online publication date: May-2021
  • (2021)Performance portability through machine learning guided kernel selection in SYCL librariesParallel Computing10.1016/j.parco.2021.102813107:COnline publication date: 1-Oct-2021
  • (2020)Optimizing Streaming Parallelism on Heterogeneous Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297804531:8(1878-1896)Online publication date: 1-Aug-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media