Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2442516.2442518acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

A peta-scalable CPU-GPU algorithm for global atmospheric simulations

Published: 23 February 2013 Publication History

Abstract

Developing highly scalable algorithms for global atmospheric modeling is becoming increasingly important as scientists inquire to understand behaviors of the global atmosphere at extreme scales. Nowadays, heterogeneous architecture based on both processors and accelerators is becoming an important solution for large-scale computing. However, large-scale simulation of the global atmosphere brings a severe challenge to the development of highly scalable algorithms that fit well into state-of-the-art heterogeneous systems. Although successes have been made on GPU-accelerated computing in some top-level applications, studies on fully exploiting heterogeneous architectures in global atmospheric modeling are still very less to be seen, due in large part to both the computational difficulties of the mathematical models and the requirement of high accuracy for long term simulations.
In this paper, we propose a peta-scalable hybrid algorithm that is successfully applied in a cubed-sphere shallow-water model in global atmospheric simulations. We employ an adjustable partition between CPUs and GPUs to achieve a balanced utilization of the entire hybrid system, and present a pipe-flow scheme to conduct conflict-free inter-node communication on the cubed-sphere geometry and to maximize communication-computation overlap. Systematic optimizations for multithreading on both GPU and CPU sides are performed to enhance computing throughput and improve memory efficiency. Our experiments demonstrate nearly ideal strong and weak scalabilities on up to 3,750 nodes of the Tianhe-1A. The largest run sustains a performance of 0.8 Pflops in double precision (32% of the peak performance), using 45,000 CPU cores and 3,750 GPUs.

References

[1]
E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems, 20 (3): 404--418, march 2009.
[2]
S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Users Manual. Argonne National Laboratory, 2010.
[3]
M. Bernaschi, M. Bisson, T. Endo, S. Matsuoka, M. Fatica, and S. Melchionna. Petaflop biofluidics simulations on a two million-core system. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11), pages 4:1--4:12, New York, NY, USA, 2011. ACM.
[4]
S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming interface for performance evaluation on modern processors. Int. J. High Perform. Comput. Appl., 14 (3): 189--204, Aug. 2000.
[5]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC '08), pages 4:1--4:12, Piscataway, NJ, USA, 2008. IEEE Press.
[6]
S. Gottlieb, C.-W. Shu, and E. Tadmore. Strong stability preserving high-order time integration methods. SIAM Review, 43: 89--112, 2001.
[7]
T. Hamada and K. Nitadori. 190 TFlops astrophysical N-body simulation on a cluster of GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10), pages 1--9, Washington, DC, USA, 2010. IEEE Computer Society.
[8]
T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09), pages 62:1--62:12, New York, NY, USA, 2009. ACM.
[9]
K. Hamilton and W. Ohfuchi, editors. High Resolution Numerical Modelling of the Atmosphere and Ocean. Springer, 2008.
[10]
T. Henderson, J. Middlecoff, J. Rosinski, M. Govett, and P. Madden. Experience applying Fortran GPU compilers to numerical weather prediction. In Proceedings of 2011 Symposium on Application Accelerators in High Performance Computing (SAAHPC 2011), pages 34--41, 2011.
[11]
Q. Hu, N. A. Gumerov, and R. Duraiswami. Scalable fast multipole methods on distributed heterogeneous architectures. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11), pages 36:1--36:12, New York, NY, USA, 2011. ACM.
[12]
R. Jakob-Chien, J. J. Hack, and D. L. Williamson. Spectral transform solutions to the shallow water test set. J. Comput. Phys., 119: 164--187, 1995.
[13]
A. Kageyama and T. Sato. Yin-Yang grid: An overset grid in spherical geometry. Geochem. Geophys. Geosyst., 5, 2004.
[14]
J. Michalakes and M. Vachharajani. GPU acceleration of numerical weather prediction. In Proceedings of IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2008), pages 1--7, 2008.
[15]
P. Micikevicius. 3D Finite Difference Computation on GPUs using CUDA. In Proc. 2nd Workshop on General Purpose Processing on Graphic Processing Units, pages 79--84, 2009.
[16]
H. Miura, M. Satoh, T. Nasuno, A. T. Noda, and K. Oouchi. A Madden-Julian Oscillation event realistically simulated by a global cloud-resolving model. Science, 318: 1763--1765, 2007.
[17]
S. Osher and S. Chakravarthy. Upwind schemes and boundary conditions with applications to Euler equations in general geometries. J. Comput. Phys., 50: 447--481, 1983.
[18]
W. M. Putman. Development of the finite-volume dynamical core on the cubed-sphere. PhD thesis, The Florida State University, 2007.
[19]
W. M. Putman and M. Suarez. Cloud-system resolving simulations with the NASA Goddard Earth Observing System global atmospheric model (GEOS-5). Geophys. Res. Lett., 38, 2011.
[20]
C. Ronchi, R. Iacono, and P. Paolucci. The cubed sphere: A new method for the solution of partial differential equations in spherical geometry. J. Comput. Phys., 124: 93--114, 1996.
[21]
J. A. Rossmanith. A wave propagation method for hyperbolic systems on the sphere. J. Comput. Phys., 213: 629--658, 2006.
[22]
R. Sadourny. Conservative finite-difference approximations of the primitive equations on quasi-uniform spherical grids. Mon. Wea. Rev., 100: 211--224, 1972.
[23]
R. Sadourny, A. Arakawa, and Y. Mintz. Integration of the nondivergent barotropic vorticity equation with an icosahedral-hexagonal grid for the sphere. Mon. Wea. Rev., 96: 351--356, 1968.
[24]
T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada, N. Maruyama, and S. Matsuoka. An 80-fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10), pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.
[25]
Shimokawabe, Aoki, Ishida, Kawano, and Muroi}shimokawabe2011iccsT. Shimokawabe, T. Aoki, J. Ishida, K. Kawano, and C. Muroi. 145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction. Procedia Computer Science, 4: 1535 -- 1544, 2011. Proceedings of the International Conference on Computational Science (ICCS 2011).
[26]
Shimokawabe, Aoki, Takaki, Endo, Yamanaka, Maruyama, Nukada, and Matsuoka}2011gb_tsubameT. Shimokawabe, T. Aoki, T. Takaki, T. Endo, A. Yamanaka, N. Maruyama, A. Nukada, and S. Matsuoka. Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11), pages 3:1--3:11, New York, NY, USA, 2011. ACM.
[27]
S. Shingu, H. Takahara, H. Fuchigami, M. Yamada, Y. Tsuda, W. Ohfuchi, Y. Sasaki, K. Kobayashi, T. Hagiwara, S.-i. Habata, M. Yokokawa, H. Itoh, and K. Otsuka. A 26.58 Tflops global atmospheric simulation with the spectral transform method on the Earth Simulator. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing (SC '02), pages 1--19, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.
[28]
D. L. Williamson. Integration of the barotropic vorticity equation on a spherical geodesic grid. Tellus, 20: 642--653, 1968.
[29]
D. L. Williamson, J. B. Drake, J. J. Hack, R. Jakob, and P. N. Swarztrauber. A standard test set for numerical approximations to the shallow water equations in spherical geometry. J. Comput. Phys., 102: 211--224, 1992.
[30]
M. Xie, Y. Lu, K. Wang, L. Liu, H. Cao, and X. Yang. The Tianhe-1A interconnect and message passing services. IEEE Micro, 1, 2012.
[31]
C. Yang and X.-C. Cai. Parallel multilevel methods for implicit solution of shallow water equations with nonsmooth topography on the cubed-sphere. J. Comput. Phys., 230: 2523--2539, 2011.
[32]
X.-J. Yang, X.-K. Liao, K. Lu, Q.-F. Hu, J.-Q. Song, and J.-S. Su. The Tianhe-1A supercomputer: Its hardware and software. J. Comput. Sci. Tech., 26, 2011.

Cited By

View all
  • (2022)W-Cycle SVD: A Multilevel Algorithm for Batched SVD on GPUsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00087(1-16)Online publication date: Nov-2022
  • (2021)A Split Execution Model for SpTRSVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.307450132:11(2809-2822)Online publication date: 1-Nov-2021
  • (2019)S-EnKFProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295722(15-26)Online publication date: 16-Feb-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2013
332 pages
ISBN:9781450319225
DOI:10.1145/2442516
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 48, Issue 8
    PPoPP '13
    August 2013
    309 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2517327
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 February 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. atmospheric modeling
  2. communication-computation overlap
  3. gpu
  4. heterogeneous system
  5. parallel algorithm
  6. scalability

Qualifiers

  • Research-article

Conference

PPoPP '13
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)W-Cycle SVD: A Multilevel Algorithm for Batched SVD on GPUsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00087(1-16)Online publication date: Nov-2022
  • (2021)A Split Execution Model for SpTRSVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.307450132:11(2809-2822)Online publication date: 1-Nov-2021
  • (2019)S-EnKFProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295722(15-26)Online publication date: 16-Feb-2019
  • (2019)Trade-offs between computation, communication, and synchronization in stencil-collective alternate updateCCF Transactions on High Performance Computing10.1007/s42514-019-00011-x1:2(144-160)Online publication date: 26-Jul-2019
  • (2019)PAGCM: A scalable parallel spectral‐based atmospheric general circulation modelConcurrency and Computation: Practice and Experience10.1002/cpe.529031:20Online publication date: 29-Apr-2019
  • (2018)Performance Tuning and Analysis for Stencil-Based Applications on POWER8 ProcessorACM Transactions on Architecture and Code Optimization10.1145/326442215:4(1-25)Online publication date: 10-Oct-2018
  • (2018)Communication-Avoiding for Dynamical Core of Atmospheric General Circulation ModelProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225140(1-10)Online publication date: 13-Aug-2018
  • (2018)Heterogeneous CPU-GPU Execution of Stencil Applications2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC.2018.00010(71-80)Online publication date: Nov-2018
  • (2018)An efficient parallel algorithm for the coupling of global climate models and regional climate models on a large-scale multi-core clusterThe Journal of Supercomputing10.1007/s11227-018-2406-674:8(3999-4018)Online publication date: 1-Aug-2018
  • (2017)26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.9(535-544)Online publication date: May-2017
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media