Cuda Maxwell 3 D

Stan Posey, HPC Industry Development NVIDIA, Santa Clara, CA, USA sposey@nvidia.
com
NVIDIA Introduction and HPC Evolution of GPUs

Public, based in Santa Clara, CA | ~$4B revenue | ~5,500 employees
Founded in 1999 with primary business in semiconductor industry
Products for graphics in workstations, notebooks, mobile devices, etc. Began R&D of GPUs for HPC in 2004, released first Tesla and CUDA in 2007
Development of GPUs as a co-processing accelerator for x86 CPUs
HPC Evolution of GPUs

2004: Began strategic investments in GPU as HPC co-processor 2006: G80 first GPU with built-in compute features, 128 cores; CUDA SDK Beta 2007: Tesla 8-series based on G80, 128 cores CUDA 1.0, 1.1
2008: Tesla 10-series based on GT 200, 240 cores CUDA 2.0, 2.3
2009: Tesla 20-series, code named Fermi up to 512 cores CUDA SDK 3.0
3 Generations of Tesla in 3 Years
How NVIDIA Tesla GPUs are Deployed in Systems

Data Center Products
Tesla M205 / M2070 Adapter Tesla S2050 1U System
Workstation
Tesla C2050 / C2070 Workstation Board
GPUs
Single Precision
1 Tesla GPU
4 Tesla GPUs
1 Tesla GPU
1030 Gigaflops
515 Gigaflops 3 GB / 6 GB 148 GB/s
4120 Gigaflops
2060 Gigaflops 12 GB (3 GB / GPU) 148 GB/s
1030 Gigaflops
515 Gigaflops 3 GB / 6 GB 144 GB/s
3
Double Precision
Memory Memory B/W
Engineering Disciplines and Related Software

Computational Structural Mechanics (CSM) implicit for strength (stress) and vibration
Structural strength at minimum weight, low-frequency oscillatory loading, fatigue
ANSYS; ABAQUS/Standard; MSC.Nastran; NX Nastran; Marc
Computational Structural Mechanics (CSM) explicit for impact loads; structural failure
Impact over short duration; contacts crashworthiness, jet engine blade failure, bird-strike
LS-DYNA; ABAQUS/Explicit; PAM-CRASH; RADIOSS
Computational Fluid Dynamics (CFD) for flow of liquids (~water) and gas (~air)
Aerodynamics; propulsion; reacting flows; multiphase; cooling/heat transfer
ANSYS FLUENT; STAR-CD; STAR-CCM+; CFD++; ANSYS CFX; AcuSolve; PowerFLOW
Computational Electromagnetics (CEM) for EM compatibility, interference, radar

EMC for sensors, controls, antennas; low observable signatures; radar-cross-section
ANSYS HFSS; ANSYS Maxwell; ANSYS SIwave; XFdtd; FEKO; Xpatch; SIGLBC; CARLOS; MM3D 4
Motivation for CPU Acceleration with GPUs
GPU Progress Status for Engineering Codes

GPU Status
Available Today
Structural Mechanics
ANSYS Mechanical AFEA Abaqus/Standard (beta) LS-DYNA implicit Marc
Fluid Dynamics
AcuSolve Moldflow Culises (OpenFOAM) Particleworks CFD++ LS-DYNA CFD
Electromagnetics
Nexxim EMPro CST MS XFdtd SEMCAD X Xpatch
Release Coming in 2011 Product Evaluation
RADIOSS implicit PAM-CRASH implicit MD Nastran NX Nastran LS-DYNA Abaqus/Explicit
CFD-ACE+ FloEFD Abaqus/CFD
Research Evaluation
FLUENT/CFX STAR-CCM+
HFSS 6
GPU Considerations for Engineering Codes

Initial efforts are linear solvers on GPU, but its not enough
Linear solvers ~50% of profile time -- only 2x speed-up is possible More of application will be moved to GPUs in progressive stages
Most codes use a parallel domain decomposition method

This fits GPU model very well and preserves costly MPI investment
All codes are parallel and scale across multiple CPU cores
Fair GPU vs. CPU comparisons should be CPU-socket-to-GPU-socket Comparisons presented here are made against 4-core Nehalem
7
Leading ISVs Who Develop Engineering Codes

ISV
ANSYS SIMULIA LSTC MSC.Software CD-adapco Altair Siemens ESI Group Metacomp ACUSIM Autodesk
Application
ANSYS CFD (FLUENT and CFX); ANSYS Mechanical; HFSS Abaqus/Standard; Abaqus/Explicit LS-DYNA MD Nastran; Marc; Adams STAR-CD; STAR-CCM+ RADIOSS NX Nastran PAM-CRASH; PAM-STAMP CFD++ AcuSolve Moldflow 8
GPU Priority by ISV Market Opportunity and Fit

#1 Computational Structural Mechanics (CSM) implicit for strength (stress) and vibration
ANSYS | ABAQUS/Standard | MSC.Nastran; Marc | NX Nastran | LS-DYNA | RADIOSS
Typical Computational Profiles of CSM Implicit
Tesla C2050 4x Faster DGEMM vs. QC Nehalem
DGEMM Improved 36% With CUDA 3.2 (Nov 10)

Gflops
350 300 250 200 150 100 50 0
64 x 64 384 x 384 704 x 704 1024 x 1024 1344 x 1344 1664 x 1664 1984 x 1984 2304 x 2304 2624 x 2624 2944 x 2944 3264 x 3264 3584 x 3584 3904 x 3904 4224 x 4224 4544 x 4544 4864 x 4864 5184 x 5184 5504 x 5504 5824 x 5824 6144 x 6144 6464 x 6464 6784 x 6784 7104 x 7104 7424 x 7424 7744 x 7744 8064 x 8064 8384 x 8384
DGEMM: Multiples of 64
cuBLAS 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz 10
Basics of Implicit CSM Implementations

Implicit CSM deployment of a multi-frontal direct sparse solver
Large Dense Matrix Fronts
Schematic Representation of Stiffness Matrix that is Factorized in the Solver
Small Dense Matrix Fronts

11
Basics of Implicit CSM Implementations

Implicit CSM deployment of a multi-frontal direct sparse solver
Large Dense Matrix Fronts
Upper threshold: Fronts too large for single GPU memory need multiple GPUs Lower threshold: Fronts too small to overcome PCIe data transfer costs stay on CPU cores
Small Dense Matrix Fronts

12
ANSYS Performance Study by HP and NVIDIA

HP ProLiant SL390 Server Configuration
Single server node 12 total CPU cores, 1 GPU 2 x Xeon X5650 HC 2.67 GHz CPUs (Westmere) 48 GB memory 12 x 4GB 1333 MHz DIMMs NVIDIA Tesla M2050 GPU with 3 GB memory RHEL5.4, MKL 10.25, NVIDIA CUDA 3.1 256.40 Study conducted at HP by Domain Engineering
HP SL390 Server
NVIDIA Tesla M2050 GPU
+
HP Z800 Workstation NVIDIA Tesla C2050 GPU
HP Z800 Workstation Configuration

2 x Xeon X5570 QC 2.8 GHz CPUs (Nehalem) 48 GB memory NVIDIA Tesla C2050 with 3 GB memory RHEL5.4, Intel MKL 10.25, NVIDIA CUDA 3.1 Study conducted at NVIDIA by Performance Lab
ANSYS Mechanical Model V12sp-5

Turbine geometry, 2,100 K DOF and SOLID187 FEs Single load step, static, large deflection nonlinear ANSYS Mechanical 13.0 direct sparse solver 13
ANSYS Mechanical for Westmere GPU Server

NOTE: Results Based on ANSYS Mechanical R13 SMP Direct Solver Sep 2010
3000
Xeon 5650 2.67 GHz Westmere (Dual Socket)

2656
ANSYS Mechanical Times in Seconds
Xeon 5650 2.67 GHz Westmere + Tesla M2050

Results from single HP-SL390 server node, 2 x Xeon X5650 2.67GHz CPUs, 48GB memory 12 x 4GB at 1333MHz, MKL 10.25; Tesla M2050, CUDA 3.1
V12sp-5 Model
2000
1.6x
1616
1000
Lower is better
1.5x
1062
NOTE: Scaling Limit to One 6-Core Socket
1.6x
0
668
1.3x
521
1 Core
2 Core 1 Socket
4 Core
6 Core
12 Core 2 Socket
- Turbine geometry - 2,100 K DOF - SOLID187 FEs - Static, nonlinear - One load step - Direct sparse
14
ANSYS Mechanical for Westmere GPU Server

3000
Xeon 5650 2.67 GHz Westmere (Dual Socket)

2656
Xeon 5650 2.67 GHz Westmere + Tesla M2050

Results from single HP-SL390 server node, 2 x Xeon X5650 2.67GHz CPUs, 48GB memory 12 x 4GB at 1333MHz, MKL 10.25; Tesla M2050, CUDA 3.1 NOTE: Add a C2050 to use with 4 cores: now faster than 12, with 8 to use for other tasks
Lower is better
V12sp-5 Model
2000
4.4x
1000
1616
3.3x
1062
2.4x
606
0
494
668 438
1.5x
1.3x
448
521
398
1 Core
2 Core 1 Socket
4 Core
6 Core
12 Core 2 Socket
15
ANSYS Mechanical for Nehalem GPU Workstation

NOTE: Results Based on ANSYS Mechanical R13 Direct SMP Solver Sep 2010
3000
Xeon 5560 2.8 GHz Nehalem (Dual Socket)

2604
Xeon 5560 2.8 GHz Nehalem + Tesla C2050

Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHz CPUs, 48GB memory, MKL 10.25; Tesla M2050, CUDA 3.1
V12sp-5 Model
2000
1.8x
Lower is better
1412
1000
NOTE: Scaling Limit to One 4-Core Socket
1.7x
830
1.2x
690
1.1x
593
1 Core
2 Core
1 Socket
4 Core
6 Core
2 Socket
8 Core
16
ANSYS Mechanical for Nehalem GPU Workstation

NOTE: Results Based on ANSYS Mechanical R13 Sparse Direct Solver Sep 2010
3000
Xeon 5560 2.8 GHz Nehalem (Dual Socket)

2604
Xeon 5560 2.8 GHz Nehalem + Tesla C2050

Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHz CPUs, 48GB memory, MKL 10.25; Tesla M2050, CUDA 3.1 NOTE: Add a C2050 to use with 4 cores: now faster than 12, with 8 to use for other tasks
Lower is better
V12sp-5 Model
2000
4.6x
1412
1000
3.0x
830 2.0x 561

0
690
1.7x
471
426
593
1.5x
411
390
1 Core
2 Core
1 Socket
4 Core
6 Core
2 Socket
8 Core
17
Effects of System CPU Memory for V12sp-5 Model

2000
Lower is better
Xeon 5560 2.8 GHz Nehalem 4 Cores (Dual Socket) Xeon 5560 2.8 GHz Nehalem 4 Cores + Tesla C2050
1500
1524
V12sp-5 Model
1.3x
1214
1000
1155
1.7x
NOTE: Most CPU and CPU+GPU benefit with in-memory solution
830
500
682
2.0x
426
0
24 GB
Out-of-memory
32 GB
Out-of-memory
48 GB
In-memory 34 GB required for in-memory solution
18
Effects of System CPU Memory for V12sp-5 Model

2000
Lower is better
Xeon 5560 2.8 GHz Nehalem 4 Cores (Dual Socket) Xeon 5560 2.8 GHz Nehalem 4 Cores + Tesla C2050
1500
1524 1214
V12sp-5 Model
32%
1000
78%
1155
39%
NOTE: GPU results far more sensitive to outof-memory solution
830
500
682
60%
426
0
24 GB
Out-of-memory
32 GB
Out-of-memory
48 GB
In-memory 34 GB required for in-memory solution
19
Economics of Engineering Codes in Practice

Cost Trends in CAE Deployment: Costs in People and Software Continue to Increase
Historically hardware very expensive vs. ISV software and people Software budgets are now 4x vs. hardware
Increasingly important that hardware choices drive cost efficiency in people and software
20
Abaqus/Standard for Nehalem GPU Workstation

Abaqus/Standard: Based on v6.10-EF Direct Solver Tesla C2050, CUDA 3.1 vs. 4-core Nehalem
Tesla C2050 Speed-up vs. 4-Core Nehalem
Solver Total Time
3.4 3.0
3.7
Higher is better
Source: SIMULIA Customer Conference, 27 May 2010:

Current and Future Trends of High Performance Computing with Abaqus
2.4 2.0 2.0
Presentation by Matt Dunbar
1
CPU Profile: 75% Solver CPU Profile: 71% Solver CPU Profile: 80% Solver
S4b: Engine Block Model of 5 MM DOF

NOTE: Solver Performance Increases with FP Operations Results Based on 4-core CPU 21
DOFs FP Ops
S4b - 5 MM 1.03E+13
Case 2 - 3.7 MM 1.68E+13
Case 3 - 1.5 MM 1.70E+13
Abaqus and NVIDIA Automotive Case Study

NOTE: Preliminary Results Based on Abaqus/Standard v6.10-EF Direct Solver
6000
5825
Abaqus/Standard Times in Seconds
858 4967
4000
Non-Solver Times Solver CPU + GPU Solver CPU

2.2x Total
Lower is better
Engine Model
2.8x in Solver
2000
CPU Profile: 85% in Solver
2659
850 1809
Xeon 5550 CPU 2.67 GHz, 4 Cores
Xeon 5550 CPU + Tesla C2050 2.67 GHz, 4 Cores
- 1.5M DOF - 2 Iterations - 5.8e12 Ops per Iteration

22
Results from HP Z800 Workstation, 2 x Xeon X5550 2.67 GHz CPUs, 48GB memory, MKL 10.25; Tesla C2050 with CUDA 3.1
Abaqus and NVIDIA Automotive Case Study

NOTE: Preliminary Results Based on Abaqus/Standard v6.10-EF Direct Solver
7500
Xeon 5550 2.67 GHz Nehalem (Dual Socket) Xeon 5550 2.67 GHz Nehalem + Tesla C2050
5825
Lower is better
Abaqus/Standard Times in Seconds
5000
2.2x
Engine Model
2500
3224 2659
41%
1.7x
1881
4 Core
8 Core
- 1.5M DOF - 2 Iterations - 5.8e12 Ops per Iteration

23
Results from HP Z800 Workstation, 2 x Xeon X5550 2.67 GHz CPUs, 48GB memory, MKL 10.25; Tesla C2050 with CUDA 3.1
LS-DYNA 971 Performance for GPU Acceleration

NOTE: Results of LS-DYNA Total Time for 300K DOF Implicit Model
2500
Lower is better
2 x QC Xeon Nehalem (8 cores total) 2 x QC Xeon Nehalem + Tesla C2050
Total LS-DYNA Time in Seconds
2000
2030
Results for CPU-only
1500
1.9x
1000
OUTER3 Model
1085
1.8x
NOTE: CPU Scales to 8 Cores for 5.8x Benefit over 1 Core
500
605
1.7x
350
~300K DOF, 1 RHS
1 Core
2 Core
4 Core
8 Core
24
LS-DYNA 971 Performance for GPU Acceleration

NOTE: Results of LS-DYNA Total Time for 300K DOF Implicit Model
2500
Lower is better
2 x QC Xeon Nehalem (8 cores total) 2 x QC Xeon Nehalem + Tesla C2050
Total LS-DYNA Time in Seconds
2000
2030
Add GPU Acceleration
1500
NOTE: 1 Core + GPU faster than 6 cores
4.8x
1000
OUTER3 Model
NOTE: More cores speeds-up total time
1085
3.3x
500
605 420 330

1.8x
2.4x
240
350
1.6x
215
~300K DOF, 1 RHS
1 Core + GPU
2 Core + GPU
4 Core + GPU
8 Core + GPU
25
Distributed CSM and NVIDIA GPU Clusters

NOTE: Illustration Based on a Simple Example of 4 Partitions and 4 Compute Nodes
Model geometry is decomposed; partitions are sent to independent compute nodes on a cluster
Compute nodes operate distributed parallel using MPI communication to complete a solution per time step N1 N2 N3 N4
A global solution is developed at the completed time duration

26
Distributed CSM and NVIDIA GPU Clusters

NOTE: Illustration Based on a Simple Example of 4 Partitions and 4 Compute Nodes
Model geometry is decomposed; partitions are sent to independent compute nodes on a cluster
Compute nodes operate distributed parallel using MPI communication to complete a solution per time step N1 G1 N2 G2 N3 G3 N4 G4 A global solution is developed at the completed time duration
27
A partition would be mapped to a GPU and provide shared memory OpenMP parallel a 2nd level of parallelism in a hybrid model
GPU Priority by ISV Market Opportunity and Fit

#2 Computational Fluid Dynamics (CFD)
ANSYS CFD (FLUENT/CFX) | STAR-CCM+ | AcuSolve| CFD++| Particleworks | OpenFOAM
Typical Computational Profile of CFD (implicit)
NOTE: Tesla C2050 9x Faster SpMV vs. QC Nehalem
28
Performance of AcuSolve 1.8 on Tesla

AcuSolve: Profile is SpMV Dominant but Substantial Portion Still on CPU
29
Performance of AcuSolve 1.8 on Tesla

AcuSolve: Comparison of Multi-Core Xeon CPU vs. Xeon CPU + Tesla GPU
1000
Xeon Nehalem CPU

Lower is better
Nehalem CPU + Tesla GPU
750
500
549
549
S-duct with 80K DOF

Hybrid MPI/Open MP for Multi-GPU test
250
279 165
4 Core CPU 1 Core CPU + 1 GPU
4 Core CPU 1 Core CPU + 2 GPU 30
CFD Developments and Publications on GPUs

48th AIAA Aerospace Sciences Meeting | Jan 2010 | Orlando, FL, USA
FEFLO: Porting of an Edge-Based CFD Solver to GPUs
[AIAA-2010-0523] Andrew Corrigan, Ph.D., Naval Research Lab; Rainald Lohner, Ph.D., GMU
FAST3D:
OVERFLOW:
Using GPU on HPC Applications to Satisfy Low Power Computational Requirement

[AIAA-2010-0524] Gopal Patnaik, Ph.D., US Naval Research Lab
Rotor Wake Modeling with a Coupled Eulerian and Vortex Particle Method
[AIAA-2010-0312] Chris Stone, Ph.D., Intelligent Light
CFD on Future Architectures | Oct 2009 | DLR Braunschweig, DE

Veloxi: elsA: Unstructured CFD Solver on GPUs
Jamil Appa, Ph.D., BAE Systems Advanced Technology Centre
Recent Results with elsA on Many-Cores

Michel Gazaix and Steve Champagneux, ONERA / Airbus France
Turbostream: Turbostream: A CFD Solver for Many-Core Processors

Tobias Brandvik, Ph.D. , Whittle Lab, University of Cambridge
Parallel CFD 2009 | May 2009 | NASA Ames, Moffett Field, CA, USA
OVERFLOW: Acceleration of a CFD Code with a GPU
Dennis Jespersen, NASA Ames Research Center
31
GPU Results for Grid-Based Continuum CFD

Success Demonstrated in Full Range of Time and Spatial Schemes
Explicit
[usually compressible]
TurboStream
Veloxi
S3D
Implicit
[usually incompressible]
~15x ~4x
~8x ~2x
FEFLO
ISVs
Aircraft aero
Bldg air blast
U.S. Engine Co. Internal flows DNS

Structured Grid
AcuSolve Moldflow
Unstructured
Chem mixer Auto climate
Speed-ups based on use of 4-core Xeon X5550 2.67 GHz
32
Culises: New CFD Solver Library for OpenFOAM
33
Prometech and Particle-Based CFD for Multi-GPUs

Particleworks from Prometech Software
MPS-based method developed at the University of Tokyo [Prof. Koshizuka]
Preliminary results for Particleworks 2.5 with released planned for 2011 Performance is relative to 4 cores of Intel i7 CPU Contact Prometech for release details
34
IMPETUS AFEA Results for GPU Computing
4.5h on 4 cores
35
Summary of Engineering Code Progress for GPUs

GPUs are an Emerging HPC Technology for ISVs
Industry Leading ISV Software is GPU-Enabled Today
Initial GPU Performance Gains are Encouraging

Just the beginning of more performance and more applications
NVIDIA Continues to Invest in ISV Developments

Joint technical collaborations at most Engineering ISVs
36
Contributors to the ISV Performance Studies

SIMULIA Mr. Matt Dunbar, Technical Staff, Parallel Solver Development Dr. Luis Crivelli, Technical Staff, Parallel Solver Development ANSYS Mr. Jeff Beisheim, Technical Staff, Solver Development
USC Institute for Information Sciences Dr. Bob Lucas, Director of Numerical Methods
ACUSIM (Now a Division of Altair Engineering) Dr. Farzin Shakib, Founder and President
37
Thank You, Questions ?

Stan Posey | CAE Market Development
NVIDIA, Santa Clara, CA, USA
38

Cuda Maxwell 3 D

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Cuda Maxwell 3 D

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cuda Maxwell 3 D

Uploaded by

Copyright:

Available Formats

Stan Posey, HPC Industry Development NVIDIA, Santa Clara, CA, USA sposey@nvidia.

NVIDIA Introduction and HPC Evolution of GPUs

Development of GPUs as a co-processing accelerator for x86 CPUs

HPC Evolution of GPUs

3 Generations of Tesla in 3 Years

How NVIDIA Tesla GPUs are Deployed in Systems

Engineering Disciplines and Related Software

Computational Electromagnetics (CEM) for EM compatibility, interference, radar

Motivation for CPU Acceleration with GPUs

GPU Progress Status for Engineering Codes

Release Coming in 2011 Product Evaluation

RADIOSS implicit PAM-CRASH implicit MD Nastran NX Nastran LS-DYNA Abaqus/Explicit

CFD-ACE+ FloEFD Abaqus/CFD

GPU Considerations for Engineering Codes

Most codes use a parallel domain decomposition method

Leading ISVs Who Develop Engineering Codes

GPU Priority by ISV Market Opportunity and Fit

Tesla C2050 4x Faster DGEMM vs. QC Nehalem

DGEMM Improved 36% With CUDA 3.2 (Nov 10)

Basics of Implicit CSM Implementations

Large Dense Matrix Fronts

Schematic Representation of Stiffness Matrix that is Factorized in the Solver

Small Dense Matrix Fronts

Basics of Implicit CSM Implementations

Large Dense Matrix Fronts

Small Dense Matrix Fronts

ANSYS Performance Study by HP and NVIDIA

NVIDIA Tesla M2050 GPU

HP Z800 Workstation Configuration

ANSYS Mechanical Model V12sp-5

ANSYS Mechanical for Westmere GPU Server

Xeon 5650 2.67 GHz Westmere (Dual Socket)

ANSYS Mechanical Times in Seconds

Xeon 5650 2.67 GHz Westmere + Tesla M2050

NOTE: Scaling Limit to One 6-Core Socket

ANSYS Mechanical for Westmere GPU Server

Xeon 5650 2.67 GHz Westmere (Dual Socket)

ANSYS Mechanical Times in Seconds

Xeon 5650 2.67 GHz Westmere + Tesla M2050

ANSYS Mechanical for Nehalem GPU Workstation

Xeon 5560 2.8 GHz Nehalem (Dual Socket)

ANSYS Mechanical Times in Seconds

Xeon 5560 2.8 GHz Nehalem + Tesla C2050

NOTE: Scaling Limit to One 4-Core Socket

ANSYS Mechanical for Nehalem GPU Workstation

Xeon 5560 2.8 GHz Nehalem (Dual Socket)

ANSYS Mechanical Times in Seconds

Xeon 5560 2.8 GHz Nehalem + Tesla C2050

830 2.0x 561

Effects of System CPU Memory for V12sp-5 Model

ANSYS Mechanical Times in Seconds

NOTE: Most CPU and CPU+GPU benefit with in-memory solution

Effects of System CPU Memory for V12sp-5 Model

ANSYS Mechanical Times in Seconds

NOTE: GPU results far more sensitive to outof-memory solution

Economics of Engineering Codes in Practice

Abaqus/Standard for Nehalem GPU Workstation

Tesla C2050 Speed-up vs. 4-Core Nehalem