Abstract
Parallel algorithms relying on synchronous parallelization libraries often experience adverse performance due to global synchronization barriers. Asynchronous many-task runtimes offer task futurization capabilities that minimize or remove the need for global synchronization barriers. This paper conducts a case study of the multidimensional Fast Fourier Transform to identify which applications will benefit from the asynchronous many-task model. Our basis is the popular FFTW library [7]. We use the asynchronous many-task model HPX and a one-dimensional FFTW backend to implement multiple versions using different HPX features and highlight overheads and pitfalls during migration. Furthermore, we add an HPX threading backend to FFTW. The case study analyzes shared memory scaling properties between our HPX-based parallelization and FFTW with its pthreads, OpenMP, and HPX backends. The case study also compares FFTW’s MPI+X backend to a purely HPX-based distributed implementation. The FFT application does not profit from asynchronous task execution. In contrast, enforcing task synchronization results in better cache performance and thus better runtime. Nonetheless, the HPX backend for FFTW is competitive with existing backends. Our distributed HPX implementation based on HPX collectives using MPI parcelport has similar performance to FFTW’s MPI+OpenMP. However, the LCI parcelport of HPX accelerated communication up to factor 5.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
https://www.amd.com/de/products/cpu/amd-epyc-7352 (visited on 01/15/2024).
- 2.
https://www.amd.com/de/products/cpu/amd-epyc-7742 (visited on 01/15/2024).
- 3.
https://github.com/FFTW/fftw3/pull/341 (visited on 02/04/2024).
- 4.
https://doi.org/10.18419/darus-4094 (visited on 13/03/2024).
References
Ayala, A., et. al.: FFT benchmark performance experimentson systems targeting exascale. Technical report, University of Tennessee (2022)
Burrus, C.S., Parks, T.W.: DFT/FFT and Convolution Algorithms: Theory and Implementation, 1st edn. Wiley, USA (1991)
Chandra, R., Dagum, L., Kohr, D., Menon, R., Maydan, D., McDonald, J.: Parallel Programming in OpenMP. Morgan Kaufmann (2001)
Cooley, J., Tukey, J.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19(90), 297–301 (1965)
Daiß, G., et. al.: Stellar mergers with HPX-Kokkos and SYCL: methods of using an asynchronous many-task runtime system with SYCL. In: IWOCL 2023. ACM, New York (2023)
Deserno, M., Holm, C.: How to mesh up Ewald sums. I. A theoretical and numerical comparison of various particle mesh routines. J. Chem. Phys. 109(18), 7678–7693 (1998)
Frigo, M., Johnson, S.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19
Gholami, A., et. al.: AccFFT: a library for distributed-memory FFT on CPU and GPU architectures. CoRR (2015)
Kaiser, H., et al.: HPX - the C++ standard library for parallelism and concurrency. J. Open Sour. Softw. 5(53), 2352 (2020)
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: CVPR, pp. 4013–4021 (June 2016)
Marcello, D.C., et al.: Octo-tiger: a new, 3D hydrodynamic code for stellar mergers that uses HPX parallelization. MNRAS 504(4), 5345–5382 (2021)
Nichols, B., Buttlar, D., Farrell, J.P.: Pthreads Programming. O’Reilly & Associates Inc., USA (1996)
Pekurovsky, D.: P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions. SISC 34(4), C192–C209 (2012)
Thoman, P., et al.: A taxonomy of task-based parallel programming technologies for high-performance computing. J. Supercomput. 74(4), 1422–1434 (2018)
Wallace, G.K.: The JPEG still picture compression standard. Commun. ACM 34(4), 30–44 (1991)
Wu, N., et al.: Quantifying overheads in charm++ and HPX using task bench. In: Singer, J., Elkhatib, Y., Blanco Heras, D., Diehl, P., Brown, N., Ilic, A. (eds.) Euro-Par 2022. LNCS, vol. 13835, pp. 5–16. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-31209-0_1
Yan, J., Kaiser, H., Snir, M.: Design and analysis of the network software stack of an asynchronous many-task system – the LCI parcelport of HPX. In: Proceedings of the SC 2023 Workshops, pp. 1151–1161. ACM, New York (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Strack, A., Taylor, C., Diehl, P., Pflüger, D. (2024). Experiences Porting Shared and Distributed Applications to Asynchronous Tasks: A Multidimensional FFT Case-Study. In: Diehl, P., Schuchart, J., Valero-Lara, P., Bosilca, G. (eds) Asynchronous Many-Task Systems and Applications. WAMTA 2024. Lecture Notes in Computer Science, vol 14626. Springer, Cham. https://doi.org/10.1007/978-3-031-61763-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-61763-8_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-61762-1
Online ISBN: 978-3-031-61763-8
eBook Packages: Computer ScienceComputer Science (R0)