Computer Achitecture II - Parallel - Computing
Computer Achitecture II - Parallel - Computing
Computer Achitecture II - Parallel - Computing
Architecture
Parallel Computing
CPE 713 Core 3 Units
Parallel Computing
The compute resources might be:
A single computer with multiple processors;
An arbitrary number of computers connected by a network;
A combination of both.
The computational problem should be able to:
Be broken apart into discrete pieces of work that can be solved
simultaneously;
Execute multiple program instructions at any moment in time;
Be solved in less time with multiple compute resources than with a
single compute resource.
The Real World is Massively Parallel
The Universe is Parallel:
Parallel computing is an evolution of serial computing that attempts to emulate
what has always been the state of affairs in the natural world: many complex,
interrelated events happening at the same time, yet within a temporal sequence.
Uses for Parallel Computing:
•Science and Engineering: To model difficult/complex problems in many areas
of science and engineering:
•Atmosphere, Earth, Environment •Geology, Seismology
•Physics - applied, nuclear, •Mechanical Engineering - from
particle, condensed matter, high prosthetics to spacecraft
pressure, fusion, photonics •Electrical Engineering, Circuit
•Bioscience, Biotechnology, Design, Microelectronics
Genetics •Computer Science, Mathematics
•Chemistry, Molecular Sciences
o Read/write, random access memory is used to store both program instructions and data
Program instructions are coded data which tell the computer to do something
Data is simply information to be used by the program
o Control unit fetches instructions/data from memory, decodes the instructions and then
sequentially coordinates operations to accomplish the programmed task.
o Aritmetic Unit performs basic arithmetic operations
o Input/Output is the interface to the human operator
Concepts and Terminology:
Flynn’s Classical Taxonomy (1966)
Distinguishes multi-processor architecture
by instruction and data. Each dimension
can only be Single or Multiple. There are 4
possible classifications
SISD – Single Instruction, Single Data
SIMD – Single Instruction, Multiple Data
MISD – Multiple Instruction, Single Data
MIMD – Multiple Instruction, Multiple Data
Flynn’s Classical Taxonomy:
SISD
Serial Computer
Only one instruction
and one data stream
is acted on during any
one clock cycle
Oldest and the most
common type of
computer even today,
eg PC
Flynn’s Classical Taxonomy:
SIMD
All processing units
execute the same
instruction at any given
clock cycle.
Each processing unit
operates on a different
data element.
A type of Parallel
Computer
Doing the same operation
repeatedly over a large
data set. This is
commonly done in
signal processing
Flynn’s Classical Taxonomy:
MISD
Different instructions operated on a single data
element.
Very few practical uses for this type of
classification & most rarely used. This is a type
of parallel computer
Example: Multiple cryptography algorithms attempting to crack a single coded
message.
multiple frequency filters operating on a single signal stream
Flynn’s Classical Taxonomy:
MIMD
Can execute different
instructions on different data
elements.
Most common type of
parallel computer.
Many MIMD architectures
also include SIMD execution
sub-components.
Examples: Supercomputers,
Multi-core PCs, Networked
parallel computer
clusters/grids
Examples of MIMD Architectures
Data Parallelism
Data parallelism is parallelism inherent in program loops, which
focuses on distributing the data across different computing nodes to
be processed in parallel. "Parallelizing loops often leads to similar
(not necessarily identical) operation sequences or functions being
performed on elements of a large data structure
Task Parallelism
Task parallelism is the characteristic of a parallel program that entirely
different calculations can be performed on either the same or
different sets of data This contrasts with data parallelism, where the
same calculation is performed on the same or different sets of data.
Task parallelism does not usually scale with the size of a problem.
Parallel Computer Memory
Architectures:
Shared Memory (SM) Architecture
All processors access
all memory as a single
global address space.
Data sharing is fast.
Lack of scalability
between memory and Uniform Memory Access
(UMA)
CPUs
Shared Memory
machines: UMA or
NUMA based on
memory access time
Non-Uniform Memory Access (NUMA)
Parallel Computer Memory
Architectures:
Distributed Memory
Each processor has
its own memory.
Is scalable, no
overhead for cache
coherency.
Programmer is
responsible for many
details of
communication
between processors
& synchronization
between tasks.
Parallel Computer Memory
Architectures
Uniform Memory Access (UMA) is sometime
called CC-UMA - Cache Coherent UMA. Cache
coherent means if one processor updates a
location in shared memory, all the other
processors know about the update. Cache
coherency is accomplished at the hardware
level.
Likewise, If cache coherency is maintained in
Non Uniform Memory Access(NUMA), then may
also be called CC-NUMA - Cache Coherent
NUMA
Parallel Programming Models
Exist as an abstraction above hardware
and memory architectures
Examples:
Shared Memory
Threads
Messaging Passing – Distributed memory
model on a shared memory machine
Data Parallel
Parallel Programming Models:
Shared Memory Model
Appears to the user as a single shared
memory, despite hardware
implementations.
Mechanism such as Locks and
semaphores may be used to control
shared memory access.
Program development can be simplified
since there is no need to explicitly specify
communication between tasks.
Parallel Programming Models:
Threads Model
A single process may have multiple,
concurrent execution paths.
Typically used with a shared memory
architecture.
Programmer is responsible for determining
all parallelism.
A thread's work may best be described as
a subroutine within the main program. Any
thread can execute any subroutine at the
same time as other threads.
Parallel Programming Models:
Message Passing Model
Tasks exchange data by sending and receiving
messages.
Typically used with distributed memory architectures.
Data transfer requires cooperative operations to be
performed by each process. Example - a send operation
must have a receive operation.
MPI (Message Passing Interface) is the interface
standard for message passing.
Parallel Programming Models:
Data Parallel Model
Tasks performing the same operations on a set
of data. Each task working on a separate piece
of the set.
Works well with either shared memory or
distributed memory architectures.
Designing Parallel Programs:
Automatic Parallelization
Automatic
Compiler analyzes code and identifies
opportunities for parallelism
Analysis includes attempting to compute
whether or not the parallelism actually
improves performance.
Loops are the most frequent target for
automatic parallelism.
Designing Parallel Programs:
Manual Parallelization
Understand the problem
A Parallelizable Problem:
Calculate the potential energy for each of several
thousand independent conformations of a
molecule. When done find the minimum energy
conformation.
A Non-Parallelizable Problem:
The Fibonacci Series
All calculations are dependent
Designing Parallel Programs:
Domain Decomposition
Each task handles a portion of the data set.
Designing Parallel Programs:
Functional Decomposition
Each task performs a function of the overall work
Designing Parallel Programs:
Limits & Costs of parallel programming (Algorithm)