Overview of Parallel Computing: Shawn T. Brown
Overview of Parallel Computing: Shawn T. Brown
Overview of Parallel Computing: Shawn T. Brown
Shawn T. Brown
Senior Scientific Specialist Pittsburgh Supercomputing Center
Overview:
Why parallel computing? Parallel computing architectures Parallel programming languages Scalability of parallel programs
The most obvious and useful way to do this is build bigger computers from collections of smaller one
But this is not the only way to exploit parallelism.
There are chips that already have 4-way SSE, with more coming.
Parallelism in a desktop
This presentation is being given on a parallel computer!
Multi-core chips
Cramming multiple cores in a socket. Allow vendors to provide solutions that offer more computational performance for less cost both in money and in power.
Intel announced a few months back an 80 core tiled research architecture, and new MIT startup is making 60 core tiled architectures. Likely to proceed to 16-32 cores per socket in the next 10 years.
Parallel disks
There are lots of applications for which several TB of Data are needed to be stored for analysis. Build a large filesystem from a collection of smaller harddrives.
Parallel Supercomputers
Building larger computers from smaller ones
Connected together by some sort of fast network
Infiniband, Myrinet, Seastar, etc,
Shared-Memory Processing
Each processor can access the entire data space
Pros
Easier to program Amenable to automatic parallelism Can be used to run large memory serial programs
Cons
Expensive Difficult to implement on the hardware level Limited number of processors (currently around 512)
Shared-Memory Processing
Programming
OpenMP, Pthreads, Shmem
Columbia (NASA)
20 512 processor Altix computers Combined total of 10,240 processors
Examples
Multiprocessor Desktops
Xeon vs. Opterons Multi-core processors
SGI Altix
Intel Itanium 2 dual core processors linked by the socalled NUMAFlex interconnect Up to 512 processors (1024 cores) sharing up 128 TB of memory
Each node in the computer has a locally addressable memory space The computers are connected together via some high-speed network
Infiniband, Myrinet, Giganet, etc..
Pros
Cons
Harder to program More difficult to manage Memory management
Capability computing
Creating large supercomputers to enable computation on large scale
Running the entire machine to perform one task Good fast interconnect and balanced performance important Usually specialized hardware and operating systems
Networks
The performance of a distributed memory architecture is highly dependent on the speed and quality of the interconnect.
Latency
The time to send a 0 byte packet of data on the network
Bandwidth
The rate at which a very large packet of information can be sent
Topology
The configuration of the network that determines how many processing units are directly connected.
Networks
Commonly overlooked, important things
How much outstanding data can be on the network at a given time.
Highly scalable codes use asynchronous communication schemes, which require a large amount of data to be on the network at a given time.
Balance
If the either the network or the compute nodes perform way out of proportion, it makes for an unbalanced situation.
Networks
Infiniband, Myrinet, GigE,
Networks that are more designed to run on a small number of processors
Clusters
Thunderbird (Sandia National Labs)
Dell PowerEdge Series Capacity Cluster 4096 dual 3.6 Ghz Intel Xeon processors 6 GB DDR-2 RAM per node 4x InfiniBand interconnect
There is a catch
Harnessing this increased power requires advanced software development
That is why you are here and interested in parallel computers. Whether it be the PS3 or the CRAY-XT3, writing highly scalable parallel code is a reuie
Multi-core, bigger distributed machines, it is only going to get more difficult for beginning programmers to write highly scalable software. Hackers need not apply!
Message Passing
Users make calls that explicitly share information between execution entities
OpenMP (www.openmp.org)
Protocol designed to provide automatic parallelization through compiler pragmas. Mainly loop driven parallelism Best suited to desktop and small SMP computers
One-sided vs Two-sided
Whether one or both processes involved in the communication process.
two-sided message
message id data payload
network interface
host CPU
memory
MPI
Message Passing Interface A message-passing library specification
extended message-passing model not a language or compiler specification not a specific implementation or product
MPI is a standard
A list of rules and specifications Left up to individual implementations as to how it is implemented. Virtually all parallel machines in the worlds support an implementation of MPI. Many more sophisticated parallel programming languages are written on top of MPI.
MPI Implementations
Because MPI is a standard, there are several implementations
MPICH - http://www-unix.mcs.anl.gov/mpi/mpich1/
Freely available, portable implementation Available on everything
OpenMPI - http://www.open-mpi.org/
Includes the once popular LAM-MPI
MPI-2
Supports one-sided put and gets to remote memory, as well as parallel I/O and dynamic processing
Shmem
Efficient implementation of globally shared pointers and onesided data management Inherent support for atomic memory operations. Also supports collectives, generally with less overhead than MPI
Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local or global
x: 1 y: l: g: p0 x: 5 y: l: g: p1 x: 7 y: 0 l: g: pn
Strong Scalability
Can we get faster for a Problem size.
Scaling
2000
500
Weak Scalability
How big of a problem can we do?
Memory Scaling
We just looked at Performance Scaling
The speedup in execution time.
When one programs for a distributed architecture, the memory per node is terribly important. Replicated Memory (BAD)
Identical data that must be stored on every processor
Improving Scalability
Serial portions of your code limit the scalability
Amdahls Law
If there is x% of serial component, speedup cannot be better than 100/x. Variants
If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use .
Problem decomposition
A parallel algorithm can only be as fast as the slowest chunk. Very important that one recognize how the algorithm can be broken apart.
Inherent to the algorithm Decisions must be made due to performance.
Communication
Transmitting data between processors takes time. Asynchronous vs. Synchronous
Whether computation can be done while data is on its way to destinations.
Scalable communication
Asynchronous,
Overlap communication and computation to hide the communication time.
nearest-neighbor,
Asymptotically linear as the number of processors grows.
with no barriers,
Stopping computation bad!
Three approaches:
Chunk:
Wait for 2nd dim FFTs to finish Minimize # messages
Slab:
Wait for chunk of rows destined for 1 proc to finish Overlap with computation
Pencil:
Send each row as it completes pencil = 1 row Maximize overlap and slab = all rows in a single plane with Match natural layout
same destination
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Reproduced with permission from Kathy Yelick (UC Berkeley)
.5 Tflops
r e p s p o l F M
600
400
200
6 3 25 Ela n
2 3 51 Ela n
6 4 25 Ela n
2 4 51 Ela n
Slab is always best for MPI; small message cost too high Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 64 overlap 512 #procs Pencil is always best for UPC; more 256 256 256 512
Reproduced with permission by Kathy Yelick (UC, Berkely)
Load imbalance
Your parallel algorithm can only go as fast as its slowest parallel work. Load imbalance occurs when one parallel component has more work to do then others.
Load Balancing
There are strategies to mitigate load balancing Let's look at a loop
for( i = 0; i < N; i++){ do work that scales with N }
Bag of Tasks
The so-called bag of tasks is a way to divide dynamically the work done.
Also called Master/Worker or Server/Client models.
Essentially...
One process acts as the server It divides up initial work amongst the rest of the processes (workers) When a worker is done with its assigned work, it sends back it's processed result. If there is more work to do, the server sends it out. Continue until no work is left.
1 Master 2
Back to example
The previous model is an example of dynamic load balancing.
Providing some means to morph the work distribution to the problem at hand.
Increasing scalability
Minimize serial sections of code
Beat Amdahls law