Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Overview of Parallel Computing: Shawn T. Brown

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Overview of Parallel Computing

Shawn T. Brown
Senior Scientific Specialist Pittsburgh Supercomputing Center

Overview:
Why parallel computing? Parallel computing architectures Parallel programming languages Scalability of parallel programs

Why parallel computing?


At some point, building a more powerful computer with a single set of components requires too much effort.

Scientific applications need more !

Parallelism is the way to get more power!


Building more power from smaller units.
Build up in memory, computing, disk to make a computer that is greater than the sum of its parts

The most obvious and useful way to do this is build bigger computers from collections of smaller one
But this is not the only way to exploit parallelism.

Parallelism inside the CPU


On single chips
SSE SIMD instructions
Allows the one CPU unit to execute multiple instances of the same instruction on different data The Opteron has 2-way SSE
Peak performance is 2X the clock rate because it can theoretically perform two operations per cycle

There are chips that already have 4-way SSE, with more coming.

Parallelism in a desktop
This presentation is being given on a parallel computer!
Multi-core chips
Cramming multiple cores in a socket. Allow vendors to provide solutions that offer more computational performance for less cost both in money and in power.

Quad-core chips just starting to come out


AMD - Barcelona Intel - Penryn

Intel announced a few months back an 80 core tiled research architecture, and new MIT startup is making 60 core tiled architectures. Likely to proceed to 16-32 cores per socket in the next 10 years.

Parallel disks
There are lots of applications for which several TB of Data are needed to be stored for analysis. Build a large filesystem from a collection of smaller harddrives.

Parallel Supercomputers
Building larger computers from smaller ones
Connected together by some sort of fast network
Infiniband, Myrinet, Seastar, etc,

Wide variety of architectures


From the small laboratory cluster to biggest supercomputers in the world, parallel computing is the way to get more power!

Shared-Memory Processing
Each processor can access the entire data space
Pros
Easier to program Amenable to automatic parallelism Can be used to run large memory serial programs

Cons
Expensive Difficult to implement on the hardware level Limited number of processors (currently around 512)

Shared-Memory Processing
Programming
OpenMP, Pthreads, Shmem

Columbia (NASA)
20 512 processor Altix computers Combined total of 10,240 processors

Examples
Multiprocessor Desktops
Xeon vs. Opterons Multi-core processors

SGI Altix
Intel Itanium 2 dual core processors linked by the socalled NUMAFlex interconnect Up to 512 processors (1024 cores) sharing up 128 TB of memory

Distributed Memory Machines


Each node in the computer has a locally addressable memory space The computers are connected together via some high-speed network
Infiniband, Myrinet, Giganet, etc..

Pros

Really large machines Cheaper to build and run

Cons
Harder to program More difficult to manage Memory management

Capacity vs. Capability


Capacity computing
Creating large supercomputers to facilitate large throughput of small parallel jobs
Cheaper, slower interconnects Clusters running Linux, OS X, or Windows Easy to build

Capability computing
Creating large supercomputers to enable computation on large scale
Running the entire machine to perform one task Good fast interconnect and balanced performance important Usually specialized hardware and operating systems

Networks

The performance of a distributed memory architecture is highly dependent on the speed and quality of the interconnect.
Latency
The time to send a 0 byte packet of data on the network

Bandwidth
The rate at which a very large packet of information can be sent

Topology
The configuration of the network that determines how many processing units are directly connected.

Networks
Commonly overlooked, important things
How much outstanding data can be on the network at a given time.
Highly scalable codes use asynchronous communication schemes, which require a large amount of data to be on the network at a given time.

Balance
If the either the network or the compute nodes perform way out of proportion, it makes for an unbalanced situation.

Hardware level support


Some routers can support things like network memory and hareware level operations, which can greatly increase performance

Networks
Infiniband, Myrinet, GigE,
Networks that are more designed to run on a small number of processors

Seastar (Cray), Federation (IBM), Constellation (Sun),


Networks designed to scale to tens of thousands of processors.

Clusters
Thunderbird (Sandia National Labs)
Dell PowerEdge Series Capacity Cluster 4096 dual 3.6 Ghz Intel Xeon processors 6 GB DDR-2 RAM per node 4x InfiniBand interconnect

System X (Virginia Tech)


1100 Dual 2.3 GHz PowerPC 970FX processors 4 GB ECC DDR400 (PC3200) RAM 80 GB S-ATA hard disk drive One Mellanox Cougar InfiniBand 4x HCA* Running Mac OS X

MPP (Massively Parallel Processing)


Red Storm (Sandia National Labs)
12,960 Dual Core 2.4 Ghz Opteron processors 4 GB of RAM per processor Proprietary SeaStar interconnect provides machine wide scalability

IBM BlueGene/L (LLNL)


131,072 700 Mhz processors 256 MB or RAM per processor Balanced compute speed with interconnect

There is a catch
Harnessing this increased power requires advanced software development
That is why you are here and interested in parallel computers. Whether it be the PS3 or the CRAY-XT3, writing highly scalable parallel code is a reuie
Multi-core, bigger distributed machines, it is only going to get more difficult for beginning programmers to write highly scalable software. Hackers need not apply!

Parallel Programming Models


Shared Memory
Multiple processors sharing the same memory space

Message Passing
Users make calls that explicitly share information between execution entities

Remote Memory Access


Processors can directly access memory on another processor

These models are then used to build more sophisticated models


Loop Driven Data Parallel Function Driven Parallel (Task-Level)

Shared Memory Programming


SysV memory manipulation
One can actually create, manipulate, shared memory spaces.

Pthreads (Posix Threads)


Lower level Unix library to build multi-threaded programs

OpenMP (www.openmp.org)
Protocol designed to provide automatic parallelization through compiler pragmas. Mainly loop driven parallelism Best suited to desktop and small SMP computers

Caution: Race Conditions


When two threads are changing the same memory location at the same time.

Distributed Memory Programming


No matter what the model, data must be passed from one memory space to the next. Synchronous vs. Asynchronous communication
Whether computation and communication are mutually exclusive

One-sided vs Two-sided
Whether one or both processes involved in the communication process.
two-sided message
message id data payload

one-sided put message


address data payload

network interface

host CPU

memory

Asynchronous and one-sided communication are both the most scalable.

MPI
Message Passing Interface A message-passing library specification
extended message-passing model not a language or compiler specification not a specific implementation or product

MPI is a standard
A list of rules and specifications Left up to individual implementations as to how it is implemented. Virtually all parallel machines in the worlds support an implementation of MPI. Many more sophisticated parallel programming languages are written on top of MPI.

MPI Implementations
Because MPI is a standard, there are several implementations
MPICH - http://www-unix.mcs.anl.gov/mpi/mpich1/
Freely available, portable implementation Available on everything

OpenMPI - http://www.open-mpi.org/
Includes the once popular LAM-MPI

Vendor specific implementations


CRAY, SGI, IBM

Remote Memory Access


Implemented as puts and gets into and out of remote memory locations
Sophisticated under the hood memory management.

MPI-2
Supports one-sided put and gets to remote memory, as well as parallel I/O and dynamic processing

Shmem
Efficient implementation of globally shared pointers and onesided data management Inherent support for atomic memory operations. Also supports collectives, generally with less overhead than MPI

ARMCI Aggregate Remote memory copy interface


A remote memory access interface that is highly portable, supporting many of the features of Shmem, with some optimization features

Partitioned Global Address Space



Global address space

Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local or global
x: 1 y: l: g: p0 x: 5 y: l: g: p1 x: 7 y: 0 l: g: pn

By default: Object heaps are shared Program stacks are private

SPMD languages: UPC, CAF, and Titanium


All three use an SPMD execution model Emphasis in this talk on UPC and Titanium (based on Java)

Dynamic languages: X10, Fortress, Chapel and Charm++


Slide reproduced with permission from Kathy Yellick (U.C. Berkeley)

Other Powerful languages


Charm++
Object-oriented parallel extension to C++ Run-time engine allows work to be scheduled on the computer. Highly-dynamic, extreme load-balancing capabilities. Completely asynchronous. NAMD, a very popular MD simulation engine is written in Charm++

Other Powerful Languages


Portals
Completely one-sided communication scheme Zero-copy, OS and Application Bypass Designed to have MPI and other languages on top. Intended from the ground up to scale to 10,000s of processors

Now you have written a parallel code how good is it?


Parallel performance is defined in terms of scalability
Scaling for LeanCP (32 Water Molecules at 70 Ry) on BigBen (Cray XT3)
2500

Strong Scalability
Can we get faster for a Problem size.
Scaling
2000

1500 R eal ideal 1000

500

0 0 500 1000 1500 2000 2500 N um ber of Processors

Now you have written a parallel code how good is it?


Parallel performance is defined in terms of scalability

Weak Scalability
How big of a problem can we do?

Memory Scaling
We just looked at Performance Scaling
The speedup in execution time.

When one programs for a distributed architecture, the memory per node is terribly important. Replicated Memory (BAD)
Identical data that must be stored on every processor

Distributed Memory (GOOD)


Data structures that have been broken down and stored across nodes.

Improving Scalability
Serial portions of your code limit the scalability

Amdahls Law
If there is x% of serial component, speedup cannot be better than 100/x. Variants
If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use .

Problem decomposition
A parallel algorithm can only be as fast as the slowest chunk. Very important that one recognize how the algorithm can be broken apart.
Inherent to the algorithm Decisions must be made due to performance.

Communication
Transmitting data between processors takes time. Asynchronous vs. Synchronous
Whether computation can be done while data is on its way to destinations.

Barriers and Synchronization


These say stop and enforce sequential execution in portions of code.

Global vs. Nearest Neighbor Communications


Global communication involves communication with large sets of processors Nearest Neighbor is point to point communication between processors close to each other

Scalable communication
Asynchronous,
Overlap communication and computation to hide the communication time.

nearest-neighbor,
Asymptotically linear as the number of processors grows.

with no barriers,
Stopping computation bad!

the most scalable way to write code.

Communication Strategies for 3D FFT


chunk = all rows with same destination

Three approaches:
Chunk:
Wait for 2nd dim FFTs to finish Minimize # messages

Slab:
Wait for chunk of rows destined for 1 proc to finish Overlap with computation

Pencil:
Send each row as it completes pencil = 1 row Maximize overlap and slab = all rows in a single plane with Match natural layout
same destination
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Reproduced with permission from Kathy Yelick (UC Berkeley)

NAS FT Variants Performance Summary


d a e r h T Best MFlop rates FFTW) Chunk (NAS FT withfor all NAS FT Benchmark versions Best MPI (always slabs) Best NAS Fortran/MPI 1000 Best MPI Best UPC (always pencils) Best UPC 800

.5 Tflops

MFlops per Thread

r e p s p o l F M

600

400

200

t 64 256 y rine Ba nd i M Infin

6 3 25 Ela n

2 3 51 Ela n

6 4 25 Ela n

2 4 51 Ela n

Slab is always best for MPI; small message cost too high Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 64 overlap 512 #procs Pencil is always best for UPC; more 256 256 256 512
Reproduced with permission by Kathy Yelick (UC, Berkely)

These ideas make a difference.


Molecular dynamics uses a large 3D FFT to perform the PME procedure. For very large systems, on very large processor counts, pencil decomposition is better. For the Ribosome molecule (2.7 million atoms) on 4096 processors, the pencil decomposition is 30% faster than the slab.

Load imbalance
Your parallel algorithm can only go as fast as its slowest parallel work. Load imbalance occurs when one parallel component has more work to do then others.

Load Balancing
There are strategies to mitigate load balancing Let's look at a loop
for( i = 0; i < N; i++){ do work that scales with N }

There are a couple ways we could divide up the work


Statically
Just divide up the work evenly between processors

Bag of Tasks
The so-called bag of tasks is a way to divide dynamically the work done.
Also called Master/Worker or Server/Client models.

Essentially...
One process acts as the server It divides up initial work amongst the rest of the processes (workers) When a worker is done with its assigned work, it sends back it's processed result. If there is more work to do, the server sends it out. Continue until no work is left.

1 Master 2

Back to example
The previous model is an example of dynamic load balancing.
Providing some means to morph the work distribution to the problem at hand.

Example over 4 processors

Increasing scalability
Minimize serial sections of code
Beat Amdahls law

Minimize communication overhead


Overlap computation and communication with asynchronous communication models Choose algorithms that emphasize nearest neighbor communication Choose the right language for the job!

Dynamic load balancing

Some other tricks of the trade


Plan out your code before hand.
Transforming a serial code to parallel is rarely the best strategy.

Minimize I/O and learn how to use parallel I/O


Very expensive time wise, so use sparingly Do not (and I repeat) do not use scratch files!

Parallel performance is mostly a series of trade-offs


Rarely is there one way to do the right thing.

You might also like