Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
26 views

Programação Paralela e Distribuída

This document introduces parallel computing concepts including Flynn's taxonomy for classifying computer architectures based on the number of instruction and data streams. It discusses parallel processing models such as SIMD, SPMD, MIMD and MPMD and how they differ in terms of instruction and data streams. It also covers parallel computer communication methods including shared memory, distributed memory and hybrid systems. Examples of parallel algorithms and how to implement them using control versus data parallelism are provided.

Uploaded by

Marcelo Azevedo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Programação Paralela e Distribuída

This document introduces parallel computing concepts including Flynn's taxonomy for classifying computer architectures based on the number of instruction and data streams. It discusses parallel processing models such as SIMD, SPMD, MIMD and MPMD and how they differ in terms of instruction and data streams. It also covers parallel computer communication methods including shared memory, distributed memory and hybrid systems. Examples of parallel algorithms and how to implement them using control versus data parallelism are provided.

Uploaded by

Marcelo Azevedo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Programao Paralela e Distribuda

Introduo Computao Paralela


Prof. Fbio M. Costa
Programa de Ps-Graduao em Cincia da Computao
Instituto de Informtica
Universidade Federal de Gois
2 Semestre / 2016

Parallel Computing Architectures - Overview

Classification of Computer Architectures


Control Flow: Single and Multiple
Instruction Streams
Communication: Shared and Distributed
Memory

Flynns Taxonomy

Parallel computers: Control Flow


SISD (single instruction stream, single data stream): these
computers correspond to the conventional sequential
computer.
MISD (multiple instruction stream, single data stream):
these computers are rare and can be thought of as
systolic (pipelined) array computers.
SIMD (single instruction stream, multiple data stream):
computers of this kinds have a single control unit that
dispatches the same instruction to various processors
(which work on different data).
MIMD (multiple instruction stream, multiple data stream):
computers have processors with their own control unit.

SIMD Processors
Some of the earliest parallel computers such as the Illiac IV,
MPP, DAP, CM-2, and MasPar MP-1 belonged to this class
of machines.
Variants of this concept have found use in co-processing units,
such as the MMX units in Intel processors, and in DSP chips,
such as the Sharc, and more recently on GPUs.
SIMD relies on the regular structure of computations (such as
those in image processing).
It is often necessary to selectively turn off operations on certain
data items. For this reason, most SIMD programming
paradigms allow for an activity mask, which determines if a
processor should participate in a computation or not.

MIMD Processors
In contrast to SIMD processors, MIMD processors can
execute different programs on different processors.
A variant of this, called single program multiple data streams
(SPMD) executes the same program on different
processors.
It is easy to see that SPMD and MIMD are closely related in
terms of programming flexibility and underlying
architectural support.
Examples of such platforms include current generation Sun
Ultra Servers, SGI Origin Servers, multiprocessor PCs,
workstation clusters, and the IBM SP.

SPMD Model
(Single Program Multiple Data)
Each processor executes the same program
asynchronously.
They can execute different instructions within the same
program using instructions similar to:
if myNodeNum = 1 do this, else do that
Synchronization takes place only when processors need
to exchange data
SPMD is an extension of SIMD (relax synchronized
instruction execution) and a restriction of MIMD (use only
one source/object code)

SIMD-SPMD Comparison
In SPMD, multiple autonomous processors
simultaneously execute the same program, but at
independent points
In SIMD, processors execute the program in lockstep
(same instruction at the same time)
With SPMD, tasks can be executed on general
purpose CPUs
SIMD requires vector processors to manipulate data
streams.

SIMD-MIMD Comparison
SIMD computers require less hardware than MIMD
computers (single control unit).
However, since SIMD processors are specially designed,
they tend to be expensive and have long design cycles.
Not all applications are naturally suited to SIMD
processors.
In contrast, platforms supporting the SPMD paradigm can
be built from inexpensive off-the-shelf components with
relatively little effort in a short amount of time.

MPMD Model
(Multiple Program Multiple Data)
MPMD is the equivalent of having different
programs executing on different processors
(ex. Client/Server)
(This will be covered in the Distributed Programming
part of the course)

Parallel computers: Communication


There are two primary forms of data exchange
between parallel tasks - accessing a shared data
space and exchanging messages.
Platforms that provide a shared data space are
called shared-address-space machines, or
multiprocessors.
Platforms that support messaging are called
message passing machines, or multicomputers.

Shared-Address-Space Computers
Part (or all) of the memory is accessible to all processors.
Processors interact by modifying data objects stored in
this shared-address-space.
If the time taken by a processor to access any memory
word in the system (either global or local) is identical,
the platform is classified as a uniform memory access
(UMA), else, it is a non-uniform memory access
(NUMA) machine.

NUMA and UMA Shared-Address-Space


Computers
The distinction between NUMA and UMA platforms is
important from the point of view of algorithm design.
NUMA machines require locality from underlying
algorithms for performance.
Programming these platforms is easier since reads and
writes are implicitly visible to other processors.
However, read-write to shared data must be coordinated.
Caches in such machines require coordinated access to
multiple copies. This leads to the cache coherence
problem.

Shared-Address-Space vs. Shared Memory


Computers
It is important to note the difference between the
terms shared address space and shared
memory.
We refer to the former as a programming
abstraction and to the latter as a physical
machine attribute.
It is possible to provide a shared address space
using a physically distributed memory.

Shared Memory
One or more memories.
Global address space (all system memory visible to all
processors).
Transfer of data between processors is usually implicit, just
read (write) to (from) a given memory address (OpenMP).
Cache-coherency protocol to maintain consistency between
processors.

Distributed Memory
Each processor has access to its own memory only.
Data transfer between processors is explicit, user calls
message passing functions.
Common Libraries for message passing: MPI, PVM
User has complete control/responsibility for data placement and
management.

Distributed Shared Memory


(Shared-Address-Space)
Single address space with implicit communication
Hardware support for read/write to non-local memories,
cache coherency.
Latency for a memory operation is greater when
accessing non local data than when accessing data
within a CPUs own memory.

Hybrid Systems
Distributed memory system where each node is a
multiprocessor with shared memory.
Most common architecture for current generation of
parallel machines.

Message-Passing Computers
These platforms comprise a set of processors and their own
(exclusive) memory.
Instances of such a view come naturally from clustered
workstations and non-shared-address-space
multicomputers.
These platforms are programmed using (variants of) send
and receive primitives.
Libraries such as MPI and PVM provide such primitives.

Message Passing vs. Shared Address Space


Computers
Message passing requires little hardware
support, other than a network.
Shared address space platforms can easily
emulate message passing. The reverse is
more difficult to do (in an efficient manner).

Flynn-Johnson classification of computers

Approaches to Parallelism

Approaches to Parallelism
Dividing the processing
Discovering the maximum possible
parallelism
Approaches
Data-centered: Data parallelism
Process-centered: Control parallelism

Approaches to Parallelism
Functional Decomposition: Control Parallelism

First: divide the processing into parts


Second: determine how to associate data
to processing

Approaches to Parallelism
Domain decomposition: Data parallelism
First: divide the data into parts
Second: determine how to associate
processing with data
Focusing on the largest and/or most
accessed data structure in the program

Approaches to Parallelism
Checklist for data parallelism:
The number of primitive tasks is at least an
order of magnitude greater than the number
of processors
Redundant processing and data structure
storage is minimized
Primitive tasks are all the same size
The number of tasks increases with the size of
the problem

Example: Eratosthenes Sieve


A classical algorithm to obtain the prime
numbers n
Use multiples of the prime numbers (2, 3, 5,
) to remove composite numbers
Terminates when multiples of the greatest
prime number n is obtained

Sequential Algorithm

Sequential Algorithm (conted)

5 is the greatest prime number 30 = 5.5 => Terminates

How to solve it using Control Parallelism?


Any suggestions?

Solution using Control Parallelism


Algorithm: Each processor looks for the
next prime number and marks its multiples
Problems:
Two processors may use the same prime
number to walk the sieve => waste of time
(though no error is caused)
A processor may mark multiples of a composite
number

How to solve it using Data Parallelism?


Any suggestions?

Solution using Data Parallelism


Algorithm: First processor finds the next prime
number and sends it to the others, which in turn
work together to mark the multiples, each
processor working on a distinct data segment
Note:
All prime numbers n must be in the first processor
Each processor receives no more than n/p data
items

Problem
Propose parallel solutions to calculate the
following vector-based expression:
k1*A + k2*B
where k1 and k2 are constants, and A and B
are arrays of size n.
Present two solutions: one that explores
control parallelism and another one that
explores data parallelism

Possible solution: Control Parallelism


A solution using control parallelism could
use 2 processors.
Each processor would compute one of the
terms of the expression (i.e., k1*A or k2*B).
Then one of the processors would compute
the sum of the terms.

Possible solution: Data Parallelism


For a solution using data parallelism, we could use
n processors, assigning one array element to
each processor, and computing the expression in
3 steps:
In the first step, each processor multiplies
constant k1 by its element.
In the second step, the second term of the
expression would be computed similarly.
In the third step, each processor computes the
two elements computed in steps 1 and 2.

Parallel vs Distributed Programming


Parallel Programming: improves performance
of a single application, making it possible to
handle a huge amount of data that otherwise
wouldn't fit into a single memory. Does not deal
with issues such as security, failures etc.
Distributed Programming: allows the
cooperation of more than one application/task.
Must deal with security, fault tolerance etc.

Parallel Programming
Shared Memory:
Pthreads (task parallelism, SPMD, a few threads - SMT, Simultaneous
Multithreading)
OpenMP (data parallelism, SPMD) higher level of abstraction
CUDA/OpenCL(data/task parallelism,SPMD, massive multithreading). Exploits
data parallelism using a SIMD-like approach, without having to resort to
vector code. Instead, uses a SIMT (Single Instruction, Multiple Threads).

Distributed Memory:
MPI (data/task parallelism, SPMD)
MapReduce (data parallelism, SPMD) - (higher level abstractions). Data
parallelism with large chunks. SIMD where operation is a function and data
is a data partition.

Distributed Programming
Shared Memory
Pthreads (task parallelism, SPMD): different tasks (functions) of
a single application cooperate to improve performance (ex.
Spreadsheet: user interface + calculation + save backup etc)

Distributed Memory
Sockets (task parallelism, MPMD)
RPC and its higher level abstractions: Java RMI, CORBA (task
parallelism, MPMD)
Message-Oriented Middleware (JMS) (data/control driven)
Publish-Subscribe (DDS) (data-driven)
Tuple Spaces (data-driven)

You might also like