research-article

Open access

Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation

Authors:

Takuma Yamaguchi,

Tsuyoshi Ichimura,

Christopher J. Zimmer,

Tjerk P. Straatsma,

Lalith Maddegedara,

Naonori UedaAuthors Info & Claims

PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference

Article No.: 16, Pages 1 - 11

https://doi.org/10.1145/3394277.3401860

Published: 29 June 2020 Publication History

Abstract

This study proposes a fast low-order finite element solver for crustal deformation computations by applying Tensor Core, AI-specific hardware on a Volta GPU. Tensor Core can compute large matrix-matrix multiplications rapidly in half precision. We redesign a state-of-the-art solver algorithm so that lower-precision data types can be used and memory access costs can be reduced even when we use small matrices. With the proposed solver, we solved 13 billion degrees-of-freedom two-layered problems that mimicked the Earth's crust and mantle using 36 compute nodes of Summit. In the matrix-vector kernel, we obtained a 4.1-fold speedup over a standard kernel in a single-precision format. Our proposed solver increased the FLOP count of the entire solver; however, we reduced the time-to-solution by 1.7-fold since the Tensor Core provided a high effective performance.

References

[1]

Niels Aage, Erik Andreassen, Boyan S Lazarov, and Ole Sigmund. 2017. Giga-voxel computational morphogenesis for structural design. Nature 550, 7674 (2017), 84.

[2]

Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack Dongarra, Christopher Earl, Joel Falcou, Azzam Haidar, Ian Karlin, Tz Kolev, Ian Masliah, et al. 2016. High-performance tensor contractions for GPUs. Procedia Computer Science 80 (2016), 108--118.

Digital Library

[3]

Ahmad Abdelfattah, Stanimire Tomov, and Jack Dongarra. 2019. Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 111--122.

[4]

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip, [Online]. 2018. https://www.hotchips.org/hc30/2conf2.05_Mythic_Mythic_Hot_Chips_2018_V5.pdf.

[5]

Arm's First-Generation Machine Learning Processor, [Online]. 2018. https://www.hotchips.org/hc30/2conf2.07_ARM_ML_Processor_HC30_ARM_2018_08_17.pdf.

[6]

Japan Hydrographic Association. 2013. JTOPO30 (30-second grid water depth data in Japan's coastal waters) [Online]. http://www.mirc.jha.jp/products/finished/JTOPO30/.

[7]

Erin Carson and Nicholas J Higham. 2018. Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM Journal on Scientific Computing 40, 2 (2018), A817-A847.

[8]

Michael A Clark, Ronald Babich, Kipton Barros, Richard C Brower, and Claudio Rebbi. 2010. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Computer Physics Communications 181, 9 (2010), 1517--1528.

[9]

Kohei Fujita, Tsuyoshi Ichimura, Kentaro Koyama, Hikaru Inoue, Muneo Hori, and Lalith Maddegedara. 2017. Fast and Scalable Low-Order Implicit Unstructured Finite-Element Solver for Earth's Crust Deformation Problem. In Proceedings of the Platform for Advanced Scientific Computing Conference. ACM, 11.

Digital Library

[10]

Yukitoshi Fukahata and Mitsuhiro Matsu'ura. 2005. General expressions for internal deformation fields due to a dislocation source in a multilayered elastic half-space. Geophysical Journal International 161, 2 (2005), 507--521.

[11]

Gene H Golub and Qiang Ye. 1999. Inexact preconditioned conjugate gradient method with inner-outer iteration. SIAM Journal on Scientific Computing 21, 4 (1999), 1305--1320.

Digital Library

[12]

Google Announces Cloud TPU v2 Beta Availability for Google Cloud Platform, [Online]. 2018. https://www.anandtech.com/show/12429/google-cloud-announces-cloud-tpu-beta-availability.

[13]

GPUDirect, [Online]. 2019. https://developer.nvidia.com/gpudirect.

[14]

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746.

Digital Library

[15]

Azzam Haidar, Ahmad Abdelfattah, Mawussi Zounon, Panruo Wu, Srikara Pranesh, Stanimire Tomov, and Jack Dongarra. 2018. The design of fast and energy-efficient linear solvers: On the potential of half-precision arithmetic and iterative refinement techniques. In International Conference on Computational Science. Springer, 586--600.

[16]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 603--613.

Digital Library

[17]

Chihiro Hashimoto, Kenji Fukui, and Mitsuhiro Matsu'Ura. 2004. 3-D modelling of plate interfaces and numerical simulation of long-term crustal deformation in and around Japan. Pure and Applied Geophysics 161, 9--10 (2004), 2053--2068.

[18]

Kristin LH Hughes, Timothy Masterlark, and Walter D Mooney. 2010. Poroelastic stress-triggering of the 2005 M8. 7 Nias earthquake by the 2004 M9. 2 Sumatra-Andaman earthquake. Earth and Planetary Science Letters 293, 3--4(2010), 289--299.

[19]

Thomas JR Hughes. 2012. The finite element method: linear static and dynamic finite element analysis. Courier Corporation.

[20]

Tsuyoshi Ichimura, Kohei Fujita, Pher Errol Balde Quinay, Lalith Maddegedara, Muneo Hori, Seizo Tanaka, Yoshihisa Shizawa, Hiroshi Kobayashi, and Kazuo Minami. 2015. Implicit nonlinear wave simulation with 1.08 T DOF and 0.270 T unstructured finite elements to enhance comprehensive earthquake simulation. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12.

Digital Library

[21]

Tsuyoshi Ichimura, Kohei Fujita, Seizo Tanaka, Muneo Hori, Maddegedara Lalith, Yoshihisa Shizawa, and Hiroshi Kobayashi. 2014. Physics-based urban earthquake simulation enhanced by 10.7 BlnDOF× 30 K time-step unstructured FE non-linear seismic wave simulation. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 15--26.

Digital Library

[22]

Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C Wells, Thomas C Schulthess, Tjerk P Straatsma, Christopher J Zimmer, Maxime Martinasso, Kengo Nakajima, et al. 2018. A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 627--637.

Digital Library

[23]

Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C. Wells, Christopher J. Zimmer, Tjerk P. Straatsma, Takane Hori, Simone Puel, Thorsten W. Becker, Muneo Hori, and Naonori Ueda. 2019. 416-PFLOPS fast scalable implicit solver on low-ordered unstructured finite elements accelerated by 1.10-ExaFLOPS kernel with reformulated AI-like algorithm: For equation-based earthquake modeling. Research Poster for SC19: International Conference for High Performance Computing, Networking, Storage and Analysis (2019).

[24]

Tsuyoshi Ichimura, Muneo Hori, and Hiroyuki Kuwamoto. 2007. Earthquake motion simulation with multiscale finite-element analysis on hybrid grid. Bulletin of the Seismological Society of America 97, 4 (2007), 1133--1143.

[25]

Chetan Jhurani and Paul Mullowney. 2015. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. J. Parallel and Distrib. Comput. 75 (2015), 133--140.

Digital Library

[26]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).

[27]

William Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes on the Status of IEEE 754, 94720--1776 (1996), 11.

[28]

C Kyriakopoulos, T Masterlark, S Stramondo, M Chini, and C Bignami. 2013. Coseismic slip distribution for the Mw 9 2011 Tohoku-Oki earthquake derived from 3-D FE modeling. Journal of Geophysical Research: Solid Earth 118, 7 (2013), 3837--3847.

[29]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. NVIDIA Tensor Core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 522--531.

[30]

Timothy Masterlark. 2003. Finite element model predictions of static deformation from dislocation sources in a subduction zone: Sensitivities to homogeneous, isotropic, Poisson-solid, and half-space assumptions. Journal of Geophysical Research: Solid Earth 108, B11 (2003).

[31]

Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd workshop on general purpose processing on graphics processing units. ACM, 79--84.

Digital Library

[32]

SE Minson, M Simons, and JL Beck. 2013. Bayesian inversion for finite fault earthquake source models I---Theory and algorithm. Geophysical Journal International 194, 3 (2013), 1701--1726.

[33]

NVIDIA. 2008. cuBLAS library. NVIDIA Corporation, Santa Clara, California 15, 27 (2008), 31.

[34]

NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture, [Online]. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

[35]

Geospatial Information Authority of Japan. 2010. GNSS earth observation network system [Online]. http://terras.gsi.go.jp/geo_info/geonet_top.html.

[36]

Yoshimitsu Okada. 1985. Surface deformation due to shear and tensile faults in a half-space. Bulletin of the seismological society of America 75, 4(1985), 1135--1154.

[37]

Jay Parker, Gregory Lyzenga, Charles Norton, Cinzia Zuffada, Margaret Glasscoe, John Lou, and Andrea Donnellan. 2008. Geophysical Finite-Element Simulation Tool (GeoFEST): algorithms and validation for quasistatic regional faulted crust problems. Pure and Applied Geophysics 165, 3--4(2008), 497--521.

[38]

Md Aamir Raihan, Negar Goli, and Tor M Aamodt. 2019. Modeling Deep Learning Accelerator Enabled GPUs. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 79--92.

[39]

Johann Rudi, A Cristiano I Malossi, Tobin Isaac, Georg Stadler, Michael Gurnis, Peter WJ Staar, Yves Ineichen, Costas Bekas, Alessandro Curioni, and Omar Ghattas. 2015. An extreme-scale implicit solver for complex PDEs: highly heterogeneous flow in earth's mantle. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1--12.

Digital Library

[40]

Youcef Saad. 1993. A flexible inner-outer preconditioned GMRES algorithm. SIAM Journal on Scientific Computing 14, 2 (1993), 461--469.

Digital Library

[41]

Yousef Saad. 2003. Iterative methods for sparse linear systems. Vol. 82. siam.

Digital Library

[42]

Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M Summers. 2016. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35, 5 (2016), 1285--1298.

[43]

Japan Seismic Hazard Information Station. 2010. National Research Institute for Earth Science and Disaster Resilience [Online]. https://www.j-shis.bosai.go.jp/download.

[44]

Summit, [Online]. 2018. https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.

[45]

Stanimire Tomov, Jack Dongarra, and Marc Baboulin. 2010. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 5--6 (2010), 232--240.

Digital Library

[46]

Using bfloat16 with tensorflow models, [Online]. 2019. https://cloud.google.com/ttpu/docs/bfloat16.

[47]

Kang Wang and Yuri Fialko. 2018. Observations and modeling of coseismic and postseismic deformation due to the 2015 Mw 7.8 Gorkha (Nepal) earthquake. Journal of Geophysical Research: Solid Earth 123, 1 (2018), 761--779.

[48]

James M Winget and Thomas JR Hughes. 1985. Solution algorithms for nonlinear transient heat conduction analysis employing element-by-element iterative strategies. Computer Methods in Applied Mechanics and Engineering 52, 1--3 (1985), 711--815.

[49]

Sencer Nuri Yeralan, Timothy A Davis, Wissam M Sid-Lakhdar, and Sanjay Ranka. 2017. Algorithm 980: Sparse QR factorization on the GPU. ACM Transactions on Mathematical Software (TOMS) 44, 2 (2017), 17.

Digital Library

Cited By

Ichimura TFujita KHori MLalith M(2024)Low-Ordered Orthogonal Voxel Finite Element with INT8 Tensor Cores for GPU-Based Explicit Elastic Wave Propagation AnalysisComputational Science – ICCS 202410.1007/978-3-031-63759-9_31(257-271)Online publication date: 29-Jun-2024
https://doi.org/10.1007/978-3-031-63759-9_31
Vázquez MAzhar MTrancoso P(2023)Exploiting the Potential of Flexible Processing Units2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00013(34-45)Online publication date: 17-Oct-2023
https://doi.org/10.1109/SBAC-PAD59825.2023.00013
Ichimura TFujita KKusakabe RKoyama KMurakami SKikuchi YHori THori MInoue HNose TKawashima TLalith M(2022)Extreme Scale Earthquake Simulation with Uncertainty QuantificationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00009(1-11)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00009

Index Terms

Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation
1. Applied computing
  1. Physical sciences and engineering
    1. Earth and atmospheric sciences
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC
Accelerator Programming Using Directives
Abstract
Accelerating applications with portability and maintainability is one of the big challenges in science and engineering. Previously, we have developed a fast implicit low-order three-dimensional finite element solver, which has a complicated ...
GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Many high performance computing applications require computing both sparse matrix-vector product SMVP and sparse matrix-transpose vector product SMTVP for better overall performance. Under such a circumstance, it is critical to maintain a similarly high ...
Fast Sparse Matrix-Vector Multiplication on Graphics Processing Unit for Finite Element Analysis
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Finite element analysis involves the solution of linear systems described by large size sparse matrices. Iterative Krylov methods are well suited for such type of problems. These methods require linear algebra operations, including sparse matrix-vector ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference

June 2020

169 pages

ISBN:9781450379939

DOI:10.1145/3394277

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
CSCS: Swiss National Supercomputing Centre
ETH Zurich: Federal Institute of Technology - University of Zurich

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Japan Society for the Promotion of Science

Conference

PASC '20

Sponsor:

SIGHPC
CSCS
ETH Zurich

PASC '20: Platform for Advanced Scientific Computing Conference

June 29 - July 1, 2020

Geneva, Switzerland

Acceptance Rates

PASC '20 Paper Acceptance Rate 16 of 36 submissions, 44%;

Overall Acceptance Rate 109 of 221 submissions, 49%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
733
Total Downloads

Downloads (Last 12 months)332
Downloads (Last 6 weeks)23

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ichimura TFujita KHori MLalith M(2024)Low-Ordered Orthogonal Voxel Finite Element with INT8 Tensor Cores for GPU-Based Explicit Elastic Wave Propagation AnalysisComputational Science – ICCS 202410.1007/978-3-031-63759-9_31(257-271)Online publication date: 29-Jun-2024
https://doi.org/10.1007/978-3-031-63759-9_31
Vázquez MAzhar MTrancoso P(2023)Exploiting the Potential of Flexible Processing Units2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00013(34-45)Online publication date: 17-Oct-2023
https://doi.org/10.1109/SBAC-PAD59825.2023.00013
Ichimura TFujita KKusakabe RKoyama KMurakami SKikuchi YHori THori MInoue HNose TKawashima TLalith M(2022)Extreme Scale Earthquake Simulation with Uncertainty QuantificationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00009(1-11)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00009

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents