Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394277.3401860acmconferencesArticle/Chapter ViewAbstractPublication PagespascConference Proceedingsconference-collections
research-article
Open access

Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation

Published: 29 June 2020 Publication History

Abstract

This study proposes a fast low-order finite element solver for crustal deformation computations by applying Tensor Core, AI-specific hardware on a Volta GPU. Tensor Core can compute large matrix-matrix multiplications rapidly in half precision. We redesign a state-of-the-art solver algorithm so that lower-precision data types can be used and memory access costs can be reduced even when we use small matrices. With the proposed solver, we solved 13 billion degrees-of-freedom two-layered problems that mimicked the Earth's crust and mantle using 36 compute nodes of Summit. In the matrix-vector kernel, we obtained a 4.1-fold speedup over a standard kernel in a single-precision format. Our proposed solver increased the FLOP count of the entire solver; however, we reduced the time-to-solution by 1.7-fold since the Tensor Core provided a high effective performance.

References

[1]
Niels Aage, Erik Andreassen, Boyan S Lazarov, and Ole Sigmund. 2017. Giga-voxel computational morphogenesis for structural design. Nature 550, 7674 (2017), 84.
[2]
Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack Dongarra, Christopher Earl, Joel Falcou, Azzam Haidar, Ian Karlin, Tz Kolev, Ian Masliah, et al. 2016. High-performance tensor contractions for GPUs. Procedia Computer Science 80 (2016), 108--118.
[3]
Ahmad Abdelfattah, Stanimire Tomov, and Jack Dongarra. 2019. Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 111--122.
[4]
Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip, [Online]. 2018. https://www.hotchips.org/hc30/2conf2.05_Mythic_Mythic_Hot_Chips_2018_V5.pdf.
[5]
Arm's First-Generation Machine Learning Processor, [Online]. 2018. https://www.hotchips.org/hc30/2conf2.07_ARM_ML_Processor_HC30_ARM_2018_08_17.pdf.
[6]
Japan Hydrographic Association. 2013. JTOPO30 (30-second grid water depth data in Japan's coastal waters) [Online]. http://www.mirc.jha.jp/products/finished/JTOPO30/.
[7]
Erin Carson and Nicholas J Higham. 2018. Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM Journal on Scientific Computing 40, 2 (2018), A817-A847.
[8]
Michael A Clark, Ronald Babich, Kipton Barros, Richard C Brower, and Claudio Rebbi. 2010. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Computer Physics Communications 181, 9 (2010), 1517--1528.
[9]
Kohei Fujita, Tsuyoshi Ichimura, Kentaro Koyama, Hikaru Inoue, Muneo Hori, and Lalith Maddegedara. 2017. Fast and Scalable Low-Order Implicit Unstructured Finite-Element Solver for Earth's Crust Deformation Problem. In Proceedings of the Platform for Advanced Scientific Computing Conference. ACM, 11.
[10]
Yukitoshi Fukahata and Mitsuhiro Matsu'ura. 2005. General expressions for internal deformation fields due to a dislocation source in a multilayered elastic half-space. Geophysical Journal International 161, 2 (2005), 507--521.
[11]
Gene H Golub and Qiang Ye. 1999. Inexact preconditioned conjugate gradient method with inner-outer iteration. SIAM Journal on Scientific Computing 21, 4 (1999), 1305--1320.
[12]
Google Announces Cloud TPU v2 Beta Availability for Google Cloud Platform, [Online]. 2018. https://www.anandtech.com/show/12429/google-cloud-announces-cloud-tpu-beta-availability.
[13]
GPUDirect, [Online]. 2019. https://developer.nvidia.com/gpudirect.
[14]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746.
[15]
Azzam Haidar, Ahmad Abdelfattah, Mawussi Zounon, Panruo Wu, Srikara Pranesh, Stanimire Tomov, and Jack Dongarra. 2018. The design of fast and energy-efficient linear solvers: On the potential of half-precision arithmetic and iterative refinement techniques. In International Conference on Computational Science. Springer, 586--600.
[16]
Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 603--613.
[17]
Chihiro Hashimoto, Kenji Fukui, and Mitsuhiro Matsu'Ura. 2004. 3-D modelling of plate interfaces and numerical simulation of long-term crustal deformation in and around Japan. Pure and Applied Geophysics 161, 9--10 (2004), 2053--2068.
[18]
Kristin LH Hughes, Timothy Masterlark, and Walter D Mooney. 2010. Poroelastic stress-triggering of the 2005 M8. 7 Nias earthquake by the 2004 M9. 2 Sumatra-Andaman earthquake. Earth and Planetary Science Letters 293, 3--4(2010), 289--299.
[19]
Thomas JR Hughes. 2012. The finite element method: linear static and dynamic finite element analysis. Courier Corporation.
[20]
Tsuyoshi Ichimura, Kohei Fujita, Pher Errol Balde Quinay, Lalith Maddegedara, Muneo Hori, Seizo Tanaka, Yoshihisa Shizawa, Hiroshi Kobayashi, and Kazuo Minami. 2015. Implicit nonlinear wave simulation with 1.08 T DOF and 0.270 T unstructured finite elements to enhance comprehensive earthquake simulation. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12.
[21]
Tsuyoshi Ichimura, Kohei Fujita, Seizo Tanaka, Muneo Hori, Maddegedara Lalith, Yoshihisa Shizawa, and Hiroshi Kobayashi. 2014. Physics-based urban earthquake simulation enhanced by 10.7 BlnDOF× 30 K time-step unstructured FE non-linear seismic wave simulation. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 15--26.
[22]
Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C Wells, Thomas C Schulthess, Tjerk P Straatsma, Christopher J Zimmer, Maxime Martinasso, Kengo Nakajima, et al. 2018. A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 627--637.
[23]
Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C. Wells, Christopher J. Zimmer, Tjerk P. Straatsma, Takane Hori, Simone Puel, Thorsten W. Becker, Muneo Hori, and Naonori Ueda. 2019. 416-PFLOPS fast scalable implicit solver on low-ordered unstructured finite elements accelerated by 1.10-ExaFLOPS kernel with reformulated AI-like algorithm: For equation-based earthquake modeling. Research Poster for SC19: International Conference for High Performance Computing, Networking, Storage and Analysis (2019).
[24]
Tsuyoshi Ichimura, Muneo Hori, and Hiroyuki Kuwamoto. 2007. Earthquake motion simulation with multiscale finite-element analysis on hybrid grid. Bulletin of the Seismological Society of America 97, 4 (2007), 1133--1143.
[25]
Chetan Jhurani and Paul Mullowney. 2015. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. J. Parallel and Distrib. Comput. 75 (2015), 133--140.
[26]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).
[27]
William Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes on the Status of IEEE 754, 94720--1776 (1996), 11.
[28]
C Kyriakopoulos, T Masterlark, S Stramondo, M Chini, and C Bignami. 2013. Coseismic slip distribution for the Mw 9 2011 Tohoku-Oki earthquake derived from 3-D FE modeling. Journal of Geophysical Research: Solid Earth 118, 7 (2013), 3837--3847.
[29]
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. NVIDIA Tensor Core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 522--531.
[30]
Timothy Masterlark. 2003. Finite element model predictions of static deformation from dislocation sources in a subduction zone: Sensitivities to homogeneous, isotropic, Poisson-solid, and half-space assumptions. Journal of Geophysical Research: Solid Earth 108, B11 (2003).
[31]
Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd workshop on general purpose processing on graphics processing units. ACM, 79--84.
[32]
SE Minson, M Simons, and JL Beck. 2013. Bayesian inversion for finite fault earthquake source models I---Theory and algorithm. Geophysical Journal International 194, 3 (2013), 1701--1726.
[33]
NVIDIA. 2008. cuBLAS library. NVIDIA Corporation, Santa Clara, California 15, 27 (2008), 31.
[34]
NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture, [Online]. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[35]
Geospatial Information Authority of Japan. 2010. GNSS earth observation network system [Online]. http://terras.gsi.go.jp/geo_info/geonet_top.html.
[36]
Yoshimitsu Okada. 1985. Surface deformation due to shear and tensile faults in a half-space. Bulletin of the seismological society of America 75, 4(1985), 1135--1154.
[37]
Jay Parker, Gregory Lyzenga, Charles Norton, Cinzia Zuffada, Margaret Glasscoe, John Lou, and Andrea Donnellan. 2008. Geophysical Finite-Element Simulation Tool (GeoFEST): algorithms and validation for quasistatic regional faulted crust problems. Pure and Applied Geophysics 165, 3--4(2008), 497--521.
[38]
Md Aamir Raihan, Negar Goli, and Tor M Aamodt. 2019. Modeling Deep Learning Accelerator Enabled GPUs. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 79--92.
[39]
Johann Rudi, A Cristiano I Malossi, Tobin Isaac, Georg Stadler, Michael Gurnis, Peter WJ Staar, Yves Ineichen, Costas Bekas, Alessandro Curioni, and Omar Ghattas. 2015. An extreme-scale implicit solver for complex PDEs: highly heterogeneous flow in earth's mantle. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1--12.
[40]
Youcef Saad. 1993. A flexible inner-outer preconditioned GMRES algorithm. SIAM Journal on Scientific Computing 14, 2 (1993), 461--469.
[41]
Yousef Saad. 2003. Iterative methods for sparse linear systems. Vol. 82. siam.
[42]
Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M Summers. 2016. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35, 5 (2016), 1285--1298.
[43]
Japan Seismic Hazard Information Station. 2010. National Research Institute for Earth Science and Disaster Resilience [Online]. https://www.j-shis.bosai.go.jp/download.
[44]
Summit, [Online]. 2018. https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.
[45]
Stanimire Tomov, Jack Dongarra, and Marc Baboulin. 2010. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 5--6 (2010), 232--240.
[46]
Using bfloat16 with tensorflow models, [Online]. 2019. https://cloud.google.com/ttpu/docs/bfloat16.
[47]
Kang Wang and Yuri Fialko. 2018. Observations and modeling of coseismic and postseismic deformation due to the 2015 Mw 7.8 Gorkha (Nepal) earthquake. Journal of Geophysical Research: Solid Earth 123, 1 (2018), 761--779.
[48]
James M Winget and Thomas JR Hughes. 1985. Solution algorithms for nonlinear transient heat conduction analysis employing element-by-element iterative strategies. Computer Methods in Applied Mechanics and Engineering 52, 1--3 (1985), 711--815.
[49]
Sencer Nuri Yeralan, Timothy A Davis, Wissam M Sid-Lakhdar, and Sanjay Ranka. 2017. Algorithm 980: Sparse QR factorization on the GPU. ACM Transactions on Mathematical Software (TOMS) 44, 2 (2017), 17.

Cited By

View all
  • (2024)Low-Ordered Orthogonal Voxel Finite Element with INT8 Tensor Cores for GPU-Based Explicit Elastic Wave Propagation AnalysisComputational Science – ICCS 202410.1007/978-3-031-63759-9_31(257-271)Online publication date: 29-Jun-2024
  • (2023)Exploiting the Potential of Flexible Processing Units2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00013(34-45)Online publication date: 17-Oct-2023
  • (2022)Extreme Scale Earthquake Simulation with Uncertainty QuantificationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00009(1-11)Online publication date: Nov-2022

Index Terms

  1. Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference
      June 2020
      169 pages
      ISBN:9781450379939
      DOI:10.1145/3394277
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 June 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Conjugate gradient method
      2. Finite element analysis
      3. GPU computation
      4. Transprecision computing

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      PASC '20
      Sponsor:

      Acceptance Rates

      PASC '20 Paper Acceptance Rate 16 of 36 submissions, 44%;
      Overall Acceptance Rate 109 of 221 submissions, 49%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)332
      • Downloads (Last 6 weeks)23
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Low-Ordered Orthogonal Voxel Finite Element with INT8 Tensor Cores for GPU-Based Explicit Elastic Wave Propagation AnalysisComputational Science – ICCS 202410.1007/978-3-031-63759-9_31(257-271)Online publication date: 29-Jun-2024
      • (2023)Exploiting the Potential of Flexible Processing Units2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00013(34-45)Online publication date: 17-Oct-2023
      • (2022)Extreme Scale Earthquake Simulation with Uncertainty QuantificationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00009(1-11)Online publication date: Nov-2022

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media