Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Computer Achitecture II - Parallel - Computing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

Advance Computer

Architecture
Parallel Computing
CPE 713 Core 3 Units

Department of Computer Engineering


CPE 713
Course Lecturer
Dr. Eustace Dogo
Lecture Time: 11:30am
Location: CPE Board Room
Student Evaluation: CA=40%,
Exams=60%
Overview
Parallel Computing – what is it & its uses?
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
Parallel Algorithm Examples
Conclusion
Concepts and Terminology:
What is Parallel Computing?
Traditionally software has been written for
serial computation.
 To be run on a single computer having a single Central Processing Unit (CPU);
 A problem is broken into a discrete series of instructions.
 Instructions are executed one after another.
 Only one instruction may execute at any moment in time

Parallel computing is the simultaneous use


of multiple compute resources to solve a
computational problem.
 To be run using multiple CPUs
 A problem is broken into discrete parts that can be solved concurrently (“in parallel”)
 Each part is further broken down to a series of instructions
 Instructions from each part execute simultaneously on different CPUs
Traditional/Serial
Computing

Parallel Computing
The compute resources might be:
 A single computer with multiple processors;
 An arbitrary number of computers connected by a network;
 A combination of both.
The computational problem should be able to:
 Be broken apart into discrete pieces of work that can be solved
simultaneously;
 Execute multiple program instructions at any moment in time;
 Be solved in less time with multiple compute resources than with a
single compute resource.
The Real World is Massively Parallel
The Universe is Parallel:
Parallel computing is an evolution of serial computing that attempts to emulate
what has always been the state of affairs in the natural world: many complex,
interrelated events happening at the same time, yet within a temporal sequence.
Uses for Parallel Computing:
•Science and Engineering: To model difficult/complex problems in many areas
of science and engineering:
•Atmosphere, Earth, Environment •Geology, Seismology
•Physics - applied, nuclear, •Mechanical Engineering - from
particle, condensed matter, high prosthetics to spacecraft
pressure, fusion, photonics •Electrical Engineering, Circuit
•Bioscience, Biotechnology, Design, Microelectronics
Genetics •Computer Science, Mathematics
•Chemistry, Molecular Sciences

Industrial and Commercial: faster computers for processing of large


amounts of data in sophisticated ways.

•Databases, data mining •Financial and economic


•Oil exploration modeling
•Web search engines, web based •Management of national and
business services multi-national corporations
•Medical imaging and diagnosis •Advanced graphics and virtual
•Pharmaceutical design reality, particularly in the
entertainment industry
•Networked video and multi-
media technologies
•Collaborative work environments
Concepts and Terminology:
Why Use Parallel Computing?
Saves time – wall clock time
Potentially saving Cost
Overcoming memory constraints to able to
solve large/complex problems.
Provide concurrency
Limitations in serial computing (Computer
Architecture is increasing relying on hardware level parallelism Multi execution Units,
pipelined instruction, Multi-core

It’s the future of computing


Frequency scaling due to physical
constraints
Concepts and Terminology:
von Neumann Architecture
Named after Hungarian (general requirements for
electronic computer)Mathematician John von
Neumann in 1945
Virtually all computers have followed the basic
design. Parallel computers still follow this basic
design/Architecture, just multiple units
Comprise of four main components:
o Memory
o Control Unit
o Arithmetic Logic Unit
o Input/output

o Read/write, random access memory is used to store both program instructions and data
 Program instructions are coded data which tell the computer to do something
 Data is simply information to be used by the program
o Control unit fetches instructions/data from memory, decodes the instructions and then
sequentially coordinates operations to accomplish the programmed task.
o Aritmetic Unit performs basic arithmetic operations
o Input/Output is the interface to the human operator
Concepts and Terminology:
Flynn’s Classical Taxonomy (1966)
Distinguishes multi-processor architecture
by instruction and data. Each dimension
can only be Single or Multiple. There are 4
possible classifications
SISD – Single Instruction, Single Data
SIMD – Single Instruction, Multiple Data
MISD – Multiple Instruction, Single Data
MIMD – Multiple Instruction, Multiple Data
Flynn’s Classical Taxonomy:
SISD
Serial Computer
Only one instruction
and one data stream
is acted on during any
one clock cycle
Oldest and the most
common type of
computer even today,
eg PC
Flynn’s Classical Taxonomy:
SIMD
All processing units
execute the same
instruction at any given
clock cycle.
Each processing unit
operates on a different
data element.
A type of Parallel
Computer
Doing the same operation
repeatedly over a large
data set. This is
commonly done in
signal processing
Flynn’s Classical Taxonomy:
MISD
Different instructions operated on a single data
element.
Very few practical uses for this type of
classification & most rarely used. This is a type
of parallel computer
Example: Multiple cryptography algorithms attempting to crack a single coded
message.
multiple frequency filters operating on a single signal stream
Flynn’s Classical Taxonomy:
MIMD
Can execute different
instructions on different data
elements.
Most common type of
parallel computer.
Many MIMD architectures
also include SIMD execution
sub-components.
Examples: Supercomputers,
Multi-core PCs, Networked
parallel computer
clusters/grids
Examples of MIMD Architectures

IBM POWER5 HP/Compaq Alpha server INTEL IA32

AMD Opteron Cray XT3 IBM BG/L


Concepts and Terminology:
General Terminology
Task – A logically discrete section of
computational work
Parallel Task – Task that can be executed
by multiple processors safely
Communications – Data exchange
between parallel tasks
Synchronization – The coordination of
parallel tasks in real time
Concepts and Terminology:
More Terminology
Granularity – The ratio of computation to
communication
 Coarse – High computation, low communication
 Fine – Low computation, high communication
Parallel Overhead – Amount of time required to
coordinate parallel task, not useful work
 Synchronizations
 Data Communications
 Software overhead - Overhead imposed by
compilers, libraries, tools, operating systems, etc.
Concepts and Terminology:
More Terminology
Massively Parallel – Hardware that comprises a
give parallel system having many processors
numbering in the hundreds of thousands

Scalability – Hardware/Software increase in


speed with addition of more processors. Factor
include: Hardware (Memory, CPU, bandwidth &
network communication); Application algorithm;
parallel overhead; Characteristic of specific
application code
Concepts and Terminology:
More Terminology
Pipelining - Breaking a task into steps
performed by different processor units,
with inputs streaming through, much like
an assembly line; a type of parallel
computing.
Symmetric Multi-Processor (SMP) -
Hardware architecture where multiple
processors share a single address space
and access to all resources; shared
memory computing
Type of Parallelism
Bit-level Parallelism
speed-up in computer architecture was driven by doubling
computer word size – the amount of information the processor can
manipulate per cycle.
Increasing the word size reduces the number of instructions the
processor must execute to perform an operation on variables whose
sizes are greater than the length of the word. For example, where
an 8-bit processor must add two 16-bit integers, the processor must
first add the 8 lower-order bits from each integer using the standard
addition instruction, then add the 8 higher-order bits using an add-
with-carry instruction and the carry bit from the lower order addition;
thus, an 8-bit processor requires two instructions to complete a
single operation, where a 16-bit processor would be able to
complete the operation with a single instruction
4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit
microprocessors.
Instruction-level Parallelism
A computer program is, in essence, a stream of instructions executed
by a processor. These instructions can be re-ordered and combined
into groups which are then executed in parallel without changing the
result of the program. This is known as instruction-level parallelism

Data Parallelism
Data parallelism is parallelism inherent in program loops, which
focuses on distributing the data across different computing nodes to
be processed in parallel. "Parallelizing loops often leads to similar
(not necessarily identical) operation sequences or functions being
performed on elements of a large data structure
Task Parallelism
Task parallelism is the characteristic of a parallel program that entirely
different calculations can be performed on either the same or
different sets of data This contrasts with data parallelism, where the
same calculation is performed on the same or different sets of data.
Task parallelism does not usually scale with the size of a problem.
Parallel Computer Memory
Architectures:
Shared Memory (SM) Architecture
All processors access
all memory as a single
global address space.
Data sharing is fast.
Lack of scalability
between memory and Uniform Memory Access
(UMA)
CPUs
Shared Memory
machines: UMA or
NUMA based on
memory access time
Non-Uniform Memory Access (NUMA)
Parallel Computer Memory
Architectures:
Distributed Memory
Each processor has
its own memory.
Is scalable, no
overhead for cache
coherency.
Programmer is
responsible for many
details of
communication
between processors
& synchronization
between tasks.
Parallel Computer Memory
Architectures
Uniform Memory Access (UMA) is sometime
called CC-UMA - Cache Coherent UMA. Cache
coherent means if one processor updates a
location in shared memory, all the other
processors know about the update. Cache
coherency is accomplished at the hardware
level.
Likewise, If cache coherency is maintained in
Non Uniform Memory Access(NUMA), then may
also be called CC-NUMA - Cache Coherent
NUMA
Parallel Programming Models
Exist as an abstraction above hardware
and memory architectures
Examples:
 Shared Memory
 Threads
 Messaging Passing – Distributed memory
model on a shared memory machine
 Data Parallel
Parallel Programming Models:
Shared Memory Model
Appears to the user as a single shared
memory, despite hardware
implementations.
Mechanism such as Locks and
semaphores may be used to control
shared memory access.
Program development can be simplified
since there is no need to explicitly specify
communication between tasks.
Parallel Programming Models:
Threads Model
A single process may have multiple,
concurrent execution paths.
Typically used with a shared memory
architecture.
Programmer is responsible for determining
all parallelism.
A thread's work may best be described as
a subroutine within the main program. Any
thread can execute any subroutine at the
same time as other threads.
Parallel Programming Models:
Message Passing Model
Tasks exchange data by sending and receiving
messages.
Typically used with distributed memory architectures.
Data transfer requires cooperative operations to be
performed by each process. Example - a send operation
must have a receive operation.
MPI (Message Passing Interface) is the interface
standard for message passing.
Parallel Programming Models:
Data Parallel Model
Tasks performing the same operations on a set
of data. Each task working on a separate piece
of the set.
Works well with either shared memory or
distributed memory architectures.
Designing Parallel Programs:
Automatic Parallelization
Automatic
 Compiler analyzes code and identifies
opportunities for parallelism
 Analysis includes attempting to compute
whether or not the parallelism actually
improves performance.
 Loops are the most frequent target for
automatic parallelism.
Designing Parallel Programs:
Manual Parallelization
Understand the problem
 A Parallelizable Problem:
Calculate the potential energy for each of several
thousand independent conformations of a
molecule. When done find the minimum energy
conformation.
 A Non-Parallelizable Problem:
The Fibonacci Series
 All calculations are dependent
Designing Parallel Programs:
Domain Decomposition
Each task handles a portion of the data set.
Designing Parallel Programs:
Functional Decomposition
Each task performs a function of the overall work
Designing Parallel Programs:
Limits & Costs of parallel programming (Algorithm)

Amdahl’s Law states that potential program


speedup is defined by the fraction of code (P)
that can be parallelized
1
speedup = --------
1 -P
If P = 0 Speedup = 1 (no speedup
If P = 1 Speedup = ∞ (Infinite)
If P = 50% Speedup = 2x
Designing Parallel Programs:
Limits & Costs of parallel programming (Algorithm)
Designing Parallel Programs:
Limits & Costs of parallel programming (Algorithm)
Introducing the number of processors performing the parallel
fraction of work, the relationship can be modeled by Gustafson’s
law:
1
Speedup = ----------------
P+S
---
N
where P = parallel fraction, N = number of processors and S = serial
fraction.
It soon become obvious that there are limits to scalability of parallelism,
as shown by the graph in next slide
Both Amdahl’s & Gustafson’s law assume that the running time of
sequential portion of the program is independent of the number of
processors.
Designing Parallel Programs:
Limits & Costs of parallel programming (Algorithm)
Designing Parallel Programs:
Limits & Costs of parallel programming (Algorithm)

Amdahl's law assumes that the entire


problem is of fixed size so that the total
amount of work to be done in parallel is
also independent of the number of
processors, whereas Gustafson's law
assumes that the total amount of work to
be done in parallel varies linearly with the
number of processors.
Parallel Algorithm Examples:
Array Processing
Serial Solution
 Perform a function on a 2D array.
 Single processor iterates through each
element in the array
Possible Parallel Solution
 Assign each processor a partition of the array.
 Each process iterates through its own
partition.
Parallel Algorithm Examples:
Odd-Even Transposition Sort
Basic idea is bubble sort, but concurrently
comparing odd indexed elements with an
adjacent element, then even indexed
elements.
If there are n elements in an array and
there are n/2 processors. The algorithm is
effectively O(n)!
Parallel Algorithm Examples:
Odd Even Transposition Sort
Initial array: Worst case scenario.
 6, 5, 4, 3, 2, 1, 0
6, 4, 5, 2, 3, 0, 1 Phase 1
4, 6, 2, 5, 0, 3, 1 Phase 2
4, 2, 6, 0, 5, 1, 3 Phase 1
2, 4, 0, 6, 1, 5, 3 Phase 2
2, 0, 4, 1, 6, 3, 5 Phase 1
0, 2, 1, 4, 3, 6, 5 Phase 2
0, 1, 2, 3, 4, 5, 6 Phase 1
Other Parallelizable Problems
The n-body problem
Floyd’s Algorithm
 Serial: O(n^3), Parallel: O(n log p)
Game Trees
Divide and Conquer Algorithms
Conclusion
Parallel computing is fast.
There are many different approaches and
models of parallel computing.
Parallel computing is the future of
computing.
References
A Library of Parallel Algorithms,
www-2.cs.cmu.edu/~scandal/nesl/algorithms.html

Internet Parallel Computing Archive, wotug.ukc.ac.uk/parallel

Parallel Programming in C with MPI and OpenMP, Michael J. Quinn,


McGraw Hill Higher Education, 2003

The New Turing Omnibus, A. K. Dewdney, Henry Holt and


Company, 1993

You might also like