Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Intro Parallel Programming Paradigms

The document provides an overview of parallel programming paradigms, focusing on performance determinants such as CPU speed, data movement, and workload distribution. It discusses various architectures, including distributed and shared memory, and highlights the importance of MPI (Message Passing Interface) for process communication in parallel computing. Additionally, it emphasizes the need for application-specific solutions and the significance of benchmarking and understanding hardware properties.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Intro Parallel Programming Paradigms

The document provides an overview of parallel programming paradigms, focusing on performance determinants such as CPU speed, data movement, and workload distribution. It discusses various architectures, including distributed and shared memory, and highlights the importance of MPI (Message Passing Interface) for process communication in parallel computing. Additionally, it emphasizes the need for application-specific solutions and the significance of benchmarking and understanding hardware properties.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Overview

 on  Parallel  
Programming  Paradigms  
Ivan  Giro3o  –  igiro3o@ictp.it  
Informa(on  &    Communica(on  Technology  Sec(on  (ICTS)  
Interna(onal  Centre  for  Theore(cal  Physics  (ICTP)      
What  Determines  Performance?    
• How  fast  is  my  CPU?  
• How  fast  can  I  move  data  around?    
• How  well  can  I  split  work  into  pieces?  
– Very  applica(on  specific:  never  assume  that  a  good  
solu(on  for  one  problem  is  as  good  a  solu(on  for  
another    
– always  run  benchmarks  to  understand  requirements  
of  your  applica(ons  and  proper(es  of  your  hardware  
– respect  Amdahl's  law    

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


2  
igiroSo@ictp.it   ICTP,  smr2761    
Parallel  Architectures    
• Distributed  Memory   • Shared  Memory  
memory memory node memory
node

node

MEMORY
CPU CPU CPU

CPU CPU CPU CPU CPU

NETWORK
node  

memory memory memory


node
node

node

CPU CPU CPU


01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
3  
igiroSo@ictp.it   ICTP,  smr2761    
Mul(ple  Socket  CPUs  

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


4  
igiroSo@ictp.it   ICTP,  smr2761    
Paradigm  at  Shared  Memory  /1  
Thread 1! Thread 2! Thread 3!

PC! Private data! PC! Private data! PC! Private data!

Shared data!

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


5  
igiroSo@ictp.it   ICTP,  smr2761    
Paradigm  at  Shared  Memory  /2  
• Usually  indicated  as  Mul(threading  Programming  
• Commonly  implemented  in  scien(fic  compu(ng  
using  the  OpenMP  standard  (direc(ve  based)  
• Thread  management  overhead    
• Limited  scalability  
• Write  access  to  shared  data  can  easily  lead  to  
race  condi(ons  and  incorrect  data    

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


6  
igiroSo@ictp.it   ICTP,  smr2761    
Parallel  Programming  Paradigms  
• MPI  (Message  Passing  Interface)    
– A  standard  defined  for  portable  message  passing    
– It  available  in  the  form  of  library  which  includes  interfaces  
for  expressing  the  data  exchange  among  processes  
– A  framework  is  provided  for  spawning  the  independent  
processes  (i.e.,  mpirun)  
– Processes  communica(on  is  via  network  
– It  works  on  either  shared  and  distributed  mem.  
architecture  
– ideal  for  distribu(ng  memory  among  compute  nodes  

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


7  
igiroSo@ictp.it   ICTP,  smr2761    
MPI  Program  Design    
• Mul(ple   and   separate   processes   (can   be   local   and  
remote)   concurrently   that   are   coordinated   and  
exchange   data   through   “messages”   =>   a   “share  
nothing”  paralleliza(on    
• Best   for   coarse   grained   paralleliza(on   Distribute  
large  data  sets;  replicate  small  data    
• Minimize  communica(on  or  overlap  communica(on  
and  compu(ng  for  efficiency  =>  Amdahl's  law  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
8  
igiroSo@ictp.it   ICTP,  smr2761    
What  is  MPI?    
• A  standard,  i.e.  there  is  a  document  describing  how  the  API  
(constants  &  subrou(nes)  are  named  and  should  behave;  mul(ple  
“levels”,  MPI-­‐1  (basic),  MPI-­‐2  (advanced),  MPI-­‐3  (new)    
• A  library  or  API  to  hide  the  details  of  low-­‐level  communica(on  
hardware  and  how  to  use  it    
• Implemented  by  mul(ple  vendors    
• Open  source  and  commercial  versions  
• Vendor  specific  versions  for  certain  hardware  
• Not  binary  compa(ble  between  implementa(ons    

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


9  
igiroSo@ictp.it   ICTP,  smr2761    
Programming  Parallel  Paradigms  
• Are  the  tools  we  use  to  express  the  parallelism  
for  on  a  given  architecture  
• They  differ  in  how  programmers  can  manage  and  
define  key  features  like:  
– parallel  regions  
– concurrency  
– process  communica(on    
– synchronism  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
10  
igiroSo@ictp.it   ICTP,  smr2761    
MPI  inter  process  communica(ons  
MPI  on  Mul(  core  CPU   1  MPI  proces  /  core  
Stress  network  
node   node  
Stress  OS  

Many  MPI  codes  (QE)  based  on  


ALLTOALL    
MPI_BCAST   Messages  =  processes  *  processes  

We  need  to  exploit  the  hierarchy  


node   node  

Re-­‐design     Mix  message  passing  


network   applica@ons   And  mul@-­‐threading  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
11  
igiroSo@ictp.it   ICTP,  smr2761    
The  Hybrid  Mode  
node   node  

node   node  

network  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
12  
igiroSo@ictp.it   ICTP,  smr2761    
The  Hybrid  Mode  
node   node  

node   node  

network  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
13  
igiroSo@ictp.it   ICTP,  smr2761    
~  8  GBytes  

The  Intel  Xeon  E5-­‐2665    


Sandy  Bridge-­‐EP  2.4GHz  

mpirun -np 8 pw-gpu.x -inp input file


01/10/2015  –    Ivan  GiroSo      
Computer  Architecture  for  HPC  -­‐  ICTP,  smr2761     14  
igiroSo@ictp.it  
~  8  GBytes  

The  Intel  Xeon  E5-­‐2665    


Sandy  Bridge-­‐EP  2.4GHz  

mpirun -np 1 pw-gpu.x -inp input file


01/10/2015  –    Ivan  GiroSo      
Computer  Architecture  for  HPC  -­‐  ICTP,  smr2761     15  
igiroSo@ictp.it  
~  8  GBytes  

The  Intel  Xeon  E5-­‐2665    


Sandy  Bridge-­‐EP  2.4GHz  

export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=$OMP_NUM_THREADS
mpirun -np 2 pw-gpu.x -inp input file

01/10/2015  –    Ivan  GiroSo      


Computer  Architecture  for  HPC  -­‐  ICTP,  smr2761     16  
igiroSo@ictp.it  
Workload Management: system level, High-throughput

Python: Ensemble simulations, workfows

MPI: Domain partition

OpenMP: Node Level shared mem

CUDA/OpenCL/OpenAcc:
floating point accelerators

01/10/2015  –    Ivan  GiroSo      


Computer  Architecture  for  HPC  -­‐  ICTP,  smr2761     17  
igiroSo@ictp.it  
Type  of  Parallelism  
• Func@onal  (or  task)  parallelism:  
different  people  are  performing  
different  task  at  the  same  (me  

• Data  Parallelism:                                      
different  people  are  performing  the  
same  task,  but  on  different  
equivalent  and  independent  objects    

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


18  
igiroSo@ictp.it   ICTP,  smr2761    
Process  Interac(ons    
• The  effec(ve  speed-­‐up  obtained  by  the  paralleliza(on  depend  by  the  
amount  of  overhead  we  introduce  making  the  algorithm  parallel  
• There  are  mainly  two  key  sources  of  overhead:  
1. Time  spent  in  inter-­‐process  interac(ons  (communica@on)  
2. Time  some  process  may  spent  being  idle  (synchroniza@on)    

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


19  
igiroSo@ictp.it   ICTP,  smr2761    
Effect  of  load-­‐unbalancing  

all here?

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


20  
igiroSo@ictp.it   ICTP,  smr2761    
Mapping  and  Synchroniza(on  
   

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


21  
igiroSo@ictp.it   ICTP,  smr2761    
Amdahl's  law  
In  a  massively  parallel  context,  an  upper  limit  for  the  scalability  of  parallel  
applica(ons  is  determined  by  the  frac(on  of  the  overall  execu(on  (me  
spent  in  non-­‐scalable  opera(ons  (Amdahl's  law).  

maximum  speedup  tends  to    


1  /  (  1  −  P  )    
P=  parallel  frac(on  

1000000  core  

P  =  0.999999  

serial  frac+on=  0.000001  

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


22  
igiroSo@ictp.it   ICTP,  smr2761    
How  do  we  evaluate  the  improvement?  
• We  want  es(mate  the  amount  of  the  
introduced  overhead  =>  To  =  npesTP  -­‐  TS    
• But  to  quan(fy  the  improvement  we  use  the  
term  Speedup:  
TS    
  SP  =    
TP    
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
23  
igiroSo@ictp.it   ICTP,  smr2761    
Speedup  

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


24  
igiroSo@ictp.it   ICTP,  smr2761    
Efficiency  
• Only  embarrassing  parallel  algorithm  can  obtain  an  
ideal  Speedup    
• The  Efficiency  is  a  measure  of  the  frac(on  of  (me  for  
which  a  processing  element  is  usefully  employed:    

SP    
EP  =     p
   
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
25  
igiroSo@ictp.it   ICTP,  smr2761    
Efficiency  

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


26  
igiroSo@ictp.it   ICTP,  smr2761    
Amdal’s  Law  And  Real  Life  
• The  speedup  of  a  parallel  program  is  limited  by  the  
sequen(al  frac(on  of  the  program    
• This  assumes  perfect  scaling  and  no  overhead    

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


27  
igiroSo@ictp.it   ICTP,  smr2761    
Scaling  -­‐  QE-­‐CP  on  Fermi  BGQ  @  CINECA  

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


28  
igiroSo@ictp.it   ICTP,  smr2761    
Easy  Parallel  Compu(ng  
• Farming,  embarrassingly  parallel  
– Execu(ng  mul(ple  instances  on  the  same  program  with  different  
inputs/ini(al  cond.    
– Reading  large  binary  files  by  splivng  the  workload  among  processes    
– Searching  elements  on  large  data-­‐sets    
– Other  parallel  execu(on  of  embarrassingly  parallel  problem  (no  
communica(on  among  tasks)      

• Ensemble  simula(ons  (weather  forecast)    


• Parameter  space  (find  the  best  wing  shape)    

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


29  
igiroSo@ictp.it   ICTP,  smr2761    
Single  Program  on  Mul(ple  Data  
• performing  the  same  program  (set  of  instruc(ons)  
among  different  data  
• Same  model  adopted  by  the  MPI  library    
• A  parallel  tool  is  needed  to  handle  the  different  
processes  working  in  parallel  
• The  MPI  library  provides  the  mpirun  applica(on  to  
execute  parallel  instances  of  the  same  program          
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
30  
igiroSo@ictp.it   ICTP,  smr2761    
$ mpirun -np 12 my_program.x

mynode01   mynode02  

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


Ivan  GiroSo       31  
igiroSo@ictp.it   ICTP,  smr2761    
[igirotto@mynode01 ~]$ mpirun -np 12 /bin/hostname
mynode01
mynode02
mynode01
mynode02
mynode01
mynode02
mynode01
mynode02
mynode01
mynode02
mynode01
mynode02

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


32  
igiroSo@ictp.it   ICTP,  smr2761    
Parallel  Opera(ons  in  Prac(ce  
• Parallel  reading  and  compu(ng  in  parallel  is  
always  allowed  
• Parallel  wri(ng  is  extremely  dangerous!  
• To  control  the  parallel  flow  each  process  should  
be  unique  and  iden(fiable  (ID)  
• The  OpenMPI  implementa(on  of  the  MPI  library  
provides  a  series  of  environment  variables  
defined  for  each  MPI  process  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
33  
igiroSo@ictp.it   ICTP,  smr2761    
OMPI_COMM_WORLD_SIZE  -­‐  the  number  of  processes  in  this  process'  MPI  
Comm_World  

OMPI_COMM_WORLD_RANK  -­‐  the  MPI  rank  of  this  process  

OMPI_COMM_WORLD_LOCAL_RANK  -­‐  the  rela(ve  rank  of  this  process  on  this  node  
within  its  job.  For  example,  if  four  processes  in  a  job  share  a  node,  they  will  each  be  
given  a  local  rank  ranging  from  0  to  3.  

OMPI_UNIVERSE_SIZE  -­‐  the  number  of  process  slots  allocated  to  this  job.  Note  that  
this  may  be  different  than  the  number  of  processes  in  the  job.  

OMPI_COMM_WORLD_LOCAL_SIZE  -­‐  the  number  of  ranks  from  this  job  that  are  
running  on  this  node.  

OMPI_COMM_WORLD_NODE_RANK  -­‐  the  rela(ve  rank  of  this  process  on  this  node  
looking  across  ALL  jobs.    hSp://www.open-­‐mpi.org  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
  igiroSo@ictp.it   ICTP,  smr2761    
34  
In  Python  
import os
myid = os.environ['OMPI_COMM_WORLD_RANK']
[...]

In  BASH  
#!/bin/bash
myid=${OMPI_COMM_WORLD_RANK}
[...]

[igirotto@mynode01 ~]$ mpirun ./myprogram.[py/sh...]

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


35  
igiroSo@ictp.it   ICTP,  smr2761    
Possible  Applica(ons  
• Execu(ng  mul(ple  instances  on  the  same  program  
with  different  inputs/ini(al  cond.    
• Reading  large  binary  files  by  splivng  the  workload  
among  processes    
• Searching  elements  on  large  data-­‐sets    
• Other  parallel  execu(on  of  embarrassingly  parallel  
problem  (no  communica(on  among  tasks)      
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
36  
igiroSo@ictp.it   ICTP,  smr2761    
Conclusions  
• Task  Farming  is  a  simple  model  to  parallelize  
simple  problems  that  can  be  divided  in  
independent  task  
• The  mpirun  applica(on  aids  to  easily  perform  
mul(ple  processes,  includes  environment  sevng  
• Load  balancing  remains  a  main  problem,  but  
moving  from  serial  to  parallel  processing  can  
substan(ally  speed-­‐up  (me  of  simula(on  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
37  
igiroSo@ictp.it   ICTP,  smr2761    
Task  Farming  
• Many  independent  programs  (tasks)  running  at  once  
– each  task  can  be  serial  or  parallel  
– “independent”  means  they  don’t  communicate  directly  
– Processes  possibly  driven  by  the  mpirun  framework  

[igirotto@localhost]$ more my_shell_wrapper.sh


#!/bin/bash
#example for the OpenMPI implementation
./prog.x --input input_${OMPI_COMM_WORLD_RANK}.dat

[igirotto@localhost]$ mpirun -np 400 ./my_shell_wrapper.sh

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


38  
igiroSo@ictp.it   ICTP,  smr2761    
Master/Slave  
W1  
Master   W1  

W2  

W4   W2  
W4  
W3  

W3  

01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          


39  
igiroSo@ictp.it   ICTP,  smr2761    
Parallel  I/O  

File   I/O  Bandwidth   P0  


System  

P4   P3   P2   P1  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
40  
igiroSo@ictp.it   ICTP,  smr2761    
Parallel  I/O  
P0   P1   P2   P3  

I/O  Bandwidth  
I/O  Bandwidth  

I/O  Bandwidth  

I/O  Bandwidth  
File   File   File   File  
System   System   System   System  
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
41  
igiroSo@ictp.it   ICTP,  smr2761    
Parallel  I/O  
P0   P1   P2   P3  
I/O  

I/O  

I/O  
I/O  

MPI  I/O  &  Parallel  I/O  Libraries  (Hdf5,  Netcdf,  etc…)  

Parallel  File  System  


01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  Paradigms                                          
42  
igiroSo@ictp.it   ICTP,  smr2761    
What  If  You  Want  to  Learning  How  to  Program  All  This?!  

• Introductory  School  on  Parallel  Programming  


and  Parallel  Architecture  for  High  
Performance  Compu(ng  |  (smr  2877)  
• 3  October  2016  -­‐  14  October  2016    

What  If  You  Want  to  Master  All  This?!  


11/09/2015  –    Ivan  GiroSo       Introduc(on  to  High-­‐Performance  Compu(ng                                        
43  
igiroSo@ictp.it   ICTP,  smr2706    
11/09/2015  –    Ivan  GiroSo       Introduc(on  to  High-­‐Performance  Compu(ng                                        
44  
igiroSo@ictp.it   ICTP,  smr2706    
01/10/2015  –    Ivan  GiroSo       Overview  on  Parallel  Programming  
45  
igiroSo@ictp.it   Paradigms                                          ICTP,  smr2761    

You might also like