Scalable Multi-node Fast Fourier Transform on GPUs

M Verma, S Chatterjee, G Garg, B Sharma, N Arya… - SN Computer …, 2023 - Springer
SN Computer Science, 2023Springer
In this paper, we present the details of our multi-node GPU-FFT library, as well its scaling on
Selene HPC system. It is one of the first attempts to develop an object-oriented open-source
multi-node multi-GPU FFT library by combining cuFFT, CUDA, and MPI. Our library employs
slab decomposition for data division and Cuda-aware MPI for communication among GPUs.
To minimize communication overheads, we employ a combination of asynchronous MPI _
Isend and MPI _ Irecv, along with MPI_Waitall and cudaMemcpy, instead of using MPI …
Abstract
In this paper, we present the details of our multi-node GPU-FFT library, as well its scaling on Selene HPC system. It is one of the first attempts to develop an object-oriented open-source multi-node multi-GPU FFT library by combining cuFFT, CUDA, and MPI. Our library employs slab decomposition for data division and Cuda-aware MPI for communication among GPUs. To minimize communication overheads, we employ a combination of asynchronous MPI_Isend and MPI_Irecv, along with MPI_Waitall and cudaMemcpy, instead of using MPI_Alltoall. We conducted scaling analysis of our GPU-FFT library for grid sizes of , , and , utilizing up to 512 A100 GPUs. We achieved linear scaling for the grid when using 64 to 512 GPUs. We report that the timings of multicore FFT of grid with 196608 cores of Cray XC40 is comparable to that of GPU-FFT of grid with 128 GPUs. The efficiency of GPU-FFT is due to the fast computation capabilities of A100 card and efficient communication via NVlink.
Springer