1 Introduction

High Performance Computing
Why it is important
Introduction
HPC = ?
NO
Introduction
Software development has not evolved as

fast as hardware capability and network capacity.
Nominal and sustained performance of computing systems is
further diverging, unless they are manually optimised which limits
portability to other systems.
Development and maintenance of software for advanced

computing systems is becoming increasingly effort-intensive
requiring dual expertise, both on the application side and on the
system side.
…
In order to program the next generation of computing systems,

everyone must become a parallel programmer!
What is Parallel Computing?
Simultaneous use of multiple resources to solve a
computational problem.
In scientific codes, there is often a large amount of work to

be done, and it is often regular to some extent, with the
same operation being performed on many data.
for (i=0; i<n; i++)
a[i] = b[i] + c[i];

Why use Parallel Computing?
 Save time
 Solve larger problems
 Many problems are so large and/or complex that it is
impractical or impossible to solve them on a single
computer
 More memory available
 Higgs discovery “only possible because of the
extraordinary achievements of Grid computing”
— Rolf Heuer, CERN DG
 Use of non-local resources

 Using compute resources on a wide area network, or even
the Internet when local compute resources are scarce.
E.g. SETI@home.
Actual CPUs and GPUs
 Modern CPUs have vector instructions that can perform
multiple instances of an operation simultaneously. On
Intel processors this is known as SIMD Streaming
Extensions (SSE) or Advanced Vector Extensions (AVX).
GPUs are vector processors.
 By far the most common parallel computer architecture
are MIMD (also known as Single Program Multiple Data,
SPMD): the processors execute multiple, possibly
differing instructions, each on their own data. There is a
great variety in MIMD computers. Some of the aspects
concern the way memory is organized, and the network
that connects the processors.
Parallel Programming
The simplest case is the embarrassingly parallel problem, where little or
no effort is required to speedup the process
 No dependency/ communication between the parallel tasks
Examples :
 Distributed relational database queries

 Rendering of computer graphics
 Event simulation and reconstruction in particle physics
 Brute-force searches in cryptography
 Ensemble calculations of numerical weather prediction
BUT normally you have to work (hard) to parallelize an application!

To summarize
In order to write efficient scientific codes, it is important
to understand the resource architecture.
The difference in speed between two codes that compute

the same result can range from a few percent to orders of
magnitude, depending only on factors relating to how well
the algorithms are coded for the specific architecture.
Clearly, it is not enough to have an algorithm and “put it

on the computer”: some knowledge of computer
architecture is useful, sometimes crucial.
JUNE 2013 -NOVEMBER 2015
62%
65%
85%
88% of the cores 46% of the cores Homogeneous cores

NOVEMBER 2016
74%
The ShenWei 26010 is a 260-core,

64-bit RISC chip that exceeds 3
teraflops at maximum tilt, putting it on
par with Intel’s Knight’s Landing Xeo
Phi.
74%
JUNE 2019
https://en.wikichip.org/wiki/ibm/microarchitectures/power9
Today
 https://www.arm.com/blogs/blueprint/fujitsu-a64fx-arm
 https://www.top500.org/news/fugaku-holds-top-spot-exascale-remains-elusive/
Cloud and High Performance Computing
101
• The advantage of pay-as-you-go

computing has been an industry
goal for many years.
• The Globus Project has shown the
power of Grid computing.
• Cloud computing takes Grid
computing to a whole new level
by using virtualization to
encapsulate an operating system
(OS) instance
Operational costs 1:
energy costs
Total cost of ownership: total cost of acquisition and operating costs
MFLOPS/W KW
Supercomputer‘s lifelong energy costs almost equal the investment costs
Tianhe-2: ≈ 20MW ≈ $20 million/year for electricity

Operational costs 2:
human resources and tools
HARDWARE
DEVELOPMENT
TOOLS
DEVELOPERS
 Development tools
 GNU gcc – free
 Intel Parallel Studio XE – from 699 US$ NOW FREE
(oneAPI)
 PGI Accelerator Workstation – from 759 US$ again NOW FREE
 NVIDIA HPC SDK – free
Raw numbers or real performances ?
 A workstation with dual 12-cores CPUs and 4 GPUs

 139 GFLOPs SP / 69 GFLOPs DP per CPU
 3.5 TFLOPs SP / 1.2 TFLOPs DP per GPU
 = 14.3 TFLOPs SP / 4.9 TFLOPS DP at about 12,000 US$
 1.6 KWatt
 First position in June 2001,

at the bottom of the list in June 2008
 How can programmers exploit such performance ?

 programming paradigms and languages
 development tools
Raw numbers or real performances ?
Architecture of test workstation (fp32)
Memory
Core Core Core Core

C CPU BUS
Core P Core 25.6 GB/s Core Core
U
Core B Core
U
Core Core 2 Intel Xeon E5645
(115 Gflops, 9.6 per
S
2
5
.
6
G
B
core)
/
s
64 GB main memory
Memory
BUS
192GB/s nVidia GTX 580
(1581 Gflops)
1.5 GB GPU memory

Architecture of test workstation (fp64)
2 Intel Xeon E5645

(58 Gflops)
Memory
Cor Cor Cor Cor

e e e e
Memory Cor Cor CPU BUS Cor Cor
BUS e e 25.6 GB/s e e
Cor Cor Cor Cor
e e e e
64 GB main memory
Memory
BUS
nVidia GTX 580
192GB/s
(198 Gflops)
1.5 GB GPU memory

Parallel Computing
Performance Metrics
Let T(n,p) be the time to solve a problem of size n
using p processors
 Speedup: S(n,p) = T(n,1)/T(n,p)

 Efficiency: E(n,p) = S(n,p)/p
Amdahl’s Law
Maximal Speedup = 1/(1-P), P parallel portion of code
Amdahl’s Law
Speedup = 1/((P/N)+S),
N no. of processors, S serial portion of code
Amdahl Was an Optimist
Parallelization usually adds communications/overheads
To summarize
0 < Speedup ≤ p
0 < Efficiency ≤ 1
Linear speedup : speedup = p.

Amdahl was a Pessimist
Superlinear speedup is very rare. Some reasons for
speedup > p (efficiency > 1)
 Parallel computer has p times as much RAM so higher fraction

of program memory in RAM instead of disk.
An important reason for using parallel computers
 In developing parallel program a better algorithm was

discovered, older serial algorithm was not best possible.
A useful side-effect of parallelization
 In general, the time spent in serial portion of code is a

decreasing fraction of the total time as problem size increases.
The lesson is
 Linear speedup is rare, due to communication
overhead, load imbalance, algorithm/architecture
mismatch, etc.
 Further, essentially nothing scales to arbitrarily

many processors.
 However, for most users, the important question is:

Have I achieved acceptable performance on
my software for a suitable range of data
and the resources I’m using?
The N-Body simulation
N-Body – sequential algorithm
 Maximum theoretical performances:
9.6 GFLOPs (single core – SIMD instructions
– 551$ CPU, 700$ icc)
 All-pairs algorithm O(N2 ) x timesteps
 We consider timestep=100 and N=1K, 10K
Compile GFLOPs Efficiency MFLOPs/US$

r
1K 10K 1K 10K 1K 10K
gcc (5.1) 1.59 1.67 16.6% 17.4% 3 2
Intel 2.70 5.71 28.1% 59.5% 2 5
N-Body – OpenMP and MPI
N-Body – OpenMP and MPI
57.6 GFLOPs (1 CPU – 6 cores)
115 GFLOPs (2 CPUs – 12 cores)
 Intel compiler only (cost disregarded)

 OpenMP is easier
 Think parallel
Compiler GFLOPs Efficiency MFLOPs/US$

1K 10K 1K 10K 1K 10K
OpenMP - 1 cpu 25.49 33.1 44.3% 57.5% 23 30
OpenMP - 2 cpus 35.50 63.24 30.8% 54.9% 32 57
MPI - 1 cpu 27.21 29.12 47.2% 50.6% 25 26
MPI - 2 cpus 50.27 59.93 43.6% 52.0% 46 54
N-Body – CUDA and OpenACC

1581 GFLOPs (1 GPU – 512 cores – 499 US$)
 OpenACC is much much easier
 Performances with and w/o fastmath (rsqrtf function), N=10K
Algorithm GFLOPs Efficiency MFLOPs/US$

CUDA 167.71 10.6% 336
OpenACC 147.57 9.3% 185
CUDA - fastmath 434.46 27.5% 871
OpenACC - fastmath 211.80 13.4% 265
CUDA reference 597.11 37.8% 1197

1581 GFLOPs (1 GPU – 512 cores – 499 US$)
 OpenACC is much much easier
 Performances with and w/o fastmath (rsqrtf function), N=10K
Algorithm GFLOPs Efficiency MFLOPs/US$

CUDA 167.71 10.6% 336
OpenACC 147.57 9.3% 185
CUDA - fastmath 434.46 27.5% 871
OpenACC - fastmath 211.80 13.4% 265
CUDA reference 597.11 37.8% 1197
Performance/Price (10K)
Programmers and tools make the difference!
CUDA reference
OpenACC - fastmath
CUDA - fastmath
OpenACC
CUDA
MPI - Intel - 2 cpu
MPI - Intel - 1 cpu
OpenMP - Intel - 2 cpu
MFLOPs/$ - whole system
OpenMP - Intel - 1 cpu
MFLOPs/$ - comp. units
Sequential - Intel
Sequential - gcc GFLOPs
0,1 1 10 100 1000 10000

A real-world example
SeisSol Earthquake simulation SW
The extensive optimization and the
complete parallelization of the 70,000
lines of SeisSol code results in a peak
performance of up to 1.42 petaflops.
This corresponds to 44.5 percent of

Super MUC’s theoretically available
capacity
Aspects we will not talk about
 We are not green: 1 week of molecular dynamics
simulation on 512 cores = 3200 kWh,
corresponding to
 1600 CO2 kg
 340 € energy bill
 13000 km by car
 A national supercomputing facility has a yearly CO2

footprint comparable to a takeoff of SATURN V
13
35
$ 4,000 $ 200 20
Today
Nvidia Jetson Dev Kit: 59 $, 472 GFLOPs FP16, 10 W, 45x70 mm
https://developer.nvidia.com/embedded/jetson-modules
Example: Functional Magnetic Resonance Imaging
 The science on that is pretty well established. They knew how to take the
data that was coming from the MRI, and they could compute on it and
create a model of what’s going on inside the brain. But in 2012, when we
started the project, they estimated it would take 44 years on their cluster
 they parallelized their code and saw huge increases in performance
 But they also looked at it algorithmically with machine learning and AI,
 They put it all together and ended up with a 10,000X increase in
performance. They went from something requiring a supercomputing
project at a national lab to something that could be done clinically inside
a hospital in a couple of minutes.
https://www.princeton.edu/news/2017/02/23/princeton-intel-collaboration-breaks-new-
ground-studies-brain
https://www.hpcwire.com/2017/06/08/code-modernization-bringing-codes-parallel-age/
Again on SW
https://www.nextplatform.com/2021/12/06/stacking-up-amd-
mi200-versus-nvidia-a100-compute-
engines/?mc_cid=11aeb90192&mc_eid=e50c89e962
Again on SW
Libraries
Conclusions
The efficient exploitation of current heterogeneous HPC solutions
require good understanding of HW and SW features (architectures,
instructions sets, sdk, …)
 Not only HW
 Skilled developers
 State-of-the-art software libraries and programming tools.
Good tools and developers are worth the money
The course aims at presenting the basics to let you

became good HPC developers!

1 Introduction

Uploaded by

Copyright:

Available Formats

1 Introduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Introduction

Uploaded by

Copyright:

Available Formats

High Performance Computing

Software development has not evolved as

Development and maintenance of software for advanced

In order to program the next generation of computing systems,

In scientific codes, there is often a large amount of work to

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

 Use of non-local resources

 No dependency/ communication between the parallel tasks

 Distributed relational database queries

 Event simulation and reconstruction in particle physics

 Brute-force searches in cryptography

 Ensemble calculations of numerical weather prediction

BUT normally you have to work (hard) to parallelize an application!

The difference in speed between two codes that compute

Clearly, it is not enough to have an algorithm and “put it

88% of the cores 46% of the cores Homogeneous cores

The ShenWei 26010 is a 260-core,

• The advantage of pay-as-you-go

Supercomputer‘s lifelong energy costs almost equal the investment costs

Tianhe-2: ≈ 20MW ≈ $20 million/year for electricity

 A workstation with dual 12-cores CPUs and 4 GPUs

 First position in June 2001,

 How can programmers exploit such performance ?

Core Core Core Core

1.5 GB GPU memory

2 Intel Xeon E5645

Cor Cor Cor Cor

1.5 GB GPU memory

 Speedup: S(n,p) = T(n,1)/T(n,p)

Linear speedup : speedup = p.

 Parallel computer has p times as much RAM so higher fraction

An important reason for using parallel computers

 In developing parallel program a better algorithm was

A useful side-effect of parallelization

 In general, the time spent in serial portion of code is a

 Further, essentially nothing scales to arbitrarily

 However, for most users, the important question is:

 All-pairs algorithm O(N2 ) x timesteps

 We consider timestep=100 and N=1K, 10K

Compile GFLOPs Efficiency MFLOPs/US$

 Intel compiler only (cost disregarded)

Compiler GFLOPs Efficiency MFLOPs/US$

 Maximum theoretical performances:

 OpenACC is much much easier

 Performances with and w/o fastmath (rsqrtf function), N=10K

Algorithm GFLOPs Efficiency MFLOPs/US$

 Maximum theoretical performances:

 OpenACC is much much easier

 Performances with and w/o fastmath (rsqrtf function), N=10K

Algorithm GFLOPs Efficiency MFLOPs/US$

0,1 1 10 100 1000 10000

This corresponds to 44.5 percent of

 A national supercomputing facility has a yearly CO2

Nvidia Jetson Dev Kit: 59 $, 472 GFLOPs FP16, 10 W, 45x70 mm

Good tools and developers are worth the money

The course aims at presenting the basics to let you

You might also like