Cuda Maxwell 3 D
Cuda Maxwell 3 D
Cuda Maxwell 3 D
com
2008: Tesla 10-series based on GT 200, 240 cores CUDA 2.0, 2.3
2009: Tesla 20-series, code named Fermi up to 512 cores CUDA SDK 3.0
Workstation
Tesla C2050 / C2070 Workstation Board
GPUs
Single Precision
1 Tesla GPU
4 Tesla GPUs
1 Tesla GPU
1030 Gigaflops
515 Gigaflops 3 GB / 6 GB 148 GB/s
4120 Gigaflops
2060 Gigaflops 12 GB (3 GB / GPU) 148 GB/s
1030 Gigaflops
515 Gigaflops 3 GB / 6 GB 144 GB/s
3
Double Precision
Memory Memory B/W
Computational Structural Mechanics (CSM) explicit for impact loads; structural failure
Impact over short duration; contacts crashworthiness, jet engine blade failure, bird-strike
LS-DYNA; ABAQUS/Explicit; PAM-CRASH; RADIOSS
Computational Fluid Dynamics (CFD) for flow of liquids (~water) and gas (~air)
Aerodynamics; propulsion; reacting flows; multiphase; cooling/heat transfer
ANSYS FLUENT; STAR-CD; STAR-CCM+; CFD++; ANSYS CFX; AcuSolve; PowerFLOW
Structural Mechanics
ANSYS Mechanical AFEA Abaqus/Standard (beta) LS-DYNA implicit Marc
Fluid Dynamics
AcuSolve Moldflow Culises (OpenFOAM) Particleworks CFD++ LS-DYNA CFD
Electromagnetics
Nexxim EMPro CST MS XFdtd SEMCAD X Xpatch
Research Evaluation
FLUENT/CFX STAR-CCM+
HFSS 6
All codes are parallel and scale across multiple CPU cores
Fair GPU vs. CPU comparisons should be CPU-socket-to-GPU-socket Comparisons presented here are made against 4-core Nehalem
7
Application
ANSYS CFD (FLUENT and CFX); ANSYS Mechanical; HFSS Abaqus/Standard; Abaqus/Explicit LS-DYNA MD Nastran; Marc; Adams STAR-CD; STAR-CCM+ RADIOSS NX Nastran PAM-CRASH; PAM-STAMP CFD++ AcuSolve Moldflow 8
DGEMM: Multiples of 64
cuBLAS 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz 10
Upper threshold: Fronts too large for single GPU memory need multiple GPUs Lower threshold: Fronts too small to overcome PCIe data transfer costs stay on CPU cores
+
HP Z800 Workstation NVIDIA Tesla C2050 GPU
V12sp-5 Model
2000
1.6x
1616
1000
Lower is better
1.5x
1062
1.6x
0
668
1.3x
521
1 Core
2 Core 1 Socket
4 Core
6 Core
12 Core 2 Socket
- Turbine geometry - 2,100 K DOF - SOLID187 FEs - Static, nonlinear - One load step - Direct sparse
14
V12sp-5 Model
2000
4.4x
1000
1616
3.3x
1062
2.4x
606
0
494
668 438
1.5x
1.3x
448
521
398
1 Core
2 Core 1 Socket
4 Core
6 Core
12 Core 2 Socket
- Turbine geometry - 2,100 K DOF - SOLID187 FEs - Static, nonlinear - One load step - Direct sparse
15
V12sp-5 Model
2000
1.8x
Lower is better
1412
1000
1.7x
830
1.2x
690
1.1x
593
1 Core
2 Core
1 Socket
4 Core
6 Core
2 Socket
8 Core
- Turbine geometry - 2,100 K DOF - SOLID187 FEs - Static, nonlinear - One load step - Direct sparse
16
V12sp-5 Model
2000
4.6x
1412
1000
3.0x
690
1.7x
471
426
593
1.5x
411
390
1 Core
2 Core
1 Socket
4 Core
6 Core
2 Socket
8 Core
- Turbine geometry - 2,100 K DOF - SOLID187 FEs - Static, nonlinear - One load step - Direct sparse
17
Xeon 5560 2.8 GHz Nehalem 4 Cores (Dual Socket) Xeon 5560 2.8 GHz Nehalem 4 Cores + Tesla C2050
Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHz CPUs, 48GB memory, MKL 10.25; Tesla M2050, CUDA 3.1
1500
1524
V12sp-5 Model
1.3x
1214
1000
1155
1.7x
830
500
682
2.0x
426
0
24 GB
Out-of-memory
32 GB
Out-of-memory
48 GB
In-memory 34 GB required for in-memory solution
- Turbine geometry - 2,100 K DOF - SOLID187 FEs - Static, nonlinear - One load step - Direct sparse
18
Xeon 5560 2.8 GHz Nehalem 4 Cores (Dual Socket) Xeon 5560 2.8 GHz Nehalem 4 Cores + Tesla C2050
Results from HP Z800 Workstation, 2 x Xeon X5560 2.8GHz CPUs, 48GB memory, MKL 10.25; Tesla M2050, CUDA 3.1
1500
1524 1214
V12sp-5 Model
32%
1000
78%
1155
39%
830
500
682
60%
426
0
24 GB
Out-of-memory
32 GB
Out-of-memory
48 GB
In-memory 34 GB required for in-memory solution
- Turbine geometry - 2,100 K DOF - SOLID187 FEs - Static, nonlinear - One load step - Direct sparse
19
Historically hardware very expensive vs. ISV software and people Software budgets are now 4x vs. hardware
Increasingly important that hardware choices drive cost efficiency in people and software
20
3.4 3.0
3.7
Higher is better
1
CPU Profile: 75% Solver CPU Profile: 71% Solver CPU Profile: 80% Solver
DOFs FP Ops
S4b - 5 MM 1.03E+13
5825
858 4967
4000
Engine Model
2.8x in Solver
2000
CPU Profile: 85% in Solver
2659
850 1809
Results from HP Z800 Workstation, 2 x Xeon X5550 2.67 GHz CPUs, 48GB memory, MKL 10.25; Tesla C2050 with CUDA 3.1
Xeon 5550 2.67 GHz Nehalem (Dual Socket) Xeon 5550 2.67 GHz Nehalem + Tesla C2050
5825
Lower is better
5000
2.2x
Engine Model
2500
3224 2659
41%
1.7x
1881
4 Core
8 Core
Results from HP Z800 Workstation, 2 x Xeon X5550 2.67 GHz CPUs, 48GB memory, MKL 10.25; Tesla C2050 with CUDA 3.1
2000
2030
1500
1.9x
1000
OUTER3 Model
1085
1.8x
NOTE: CPU Scales to 8 Cores for 5.8x Benefit over 1 Core
500
605
1.7x
350
1 Core
2 Core
4 Core
8 Core
24
2000
2030
1500
4.8x
1000
OUTER3 Model
NOTE: More cores speeds-up total time
1085
3.3x
500
2.4x
240
350
1.6x
215
1 Core + GPU
2 Core + GPU
4 Core + GPU
8 Core + GPU
25
Model geometry is decomposed; partitions are sent to independent compute nodes on a cluster
Compute nodes operate distributed parallel using MPI communication to complete a solution per time step N1 N2 N3 N4
Model geometry is decomposed; partitions are sent to independent compute nodes on a cluster
Compute nodes operate distributed parallel using MPI communication to complete a solution per time step N1 G1 N2 G2 N3 G3 N4 G4 A global solution is developed at the completed time duration
27
A partition would be mapped to a GPU and provide shared memory OpenMP parallel a 2nd level of parallelism in a hybrid model
28
29
750
500
549
549
250
279 165
FAST3D:
OVERFLOW:
Rotor Wake Modeling with a Coupled Eulerian and Vortex Particle Method
[AIAA-2010-0312] Chris Stone, Ph.D., Intelligent Light
Parallel CFD 2009 | May 2009 | NASA Ames, Moffett Field, CA, USA
OVERFLOW: Acceleration of a CFD Code with a GPU
Dennis Jespersen, NASA Ames Research Center
31
Explicit
[usually compressible]
TurboStream
Veloxi
S3D
Implicit
[usually incompressible]
~15x ~4x
~8x ~2x
FEFLO
ISVs
Aircraft aero
AcuSolve Moldflow
Unstructured
Chem mixer Auto climate
32
33
4.5h on 4 cores
35
USC Institute for Information Sciences Dr. Bob Lucas, Director of Numerical Methods
ACUSIM (Now a Division of Altair Engineering) Dr. Farzin Shakib, Founder and President
37
38