Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
116 views

1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing

n abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled progra

Uploaded by

SAMINA ATTARI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing

n abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled progra

Uploaded by

SAMINA ATTARI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Figure : Serial Computing

Fig : Parallel Computing


Motivations
Challenges in Parallel Processing
• Not always obvious where to “split” workload
or even possible.
• If you don’t use it, you lose it…programs not
specifically written for parallel architecture
run no more efficiently on parallel systems
Challenges in Parallel Processing
• Connecting your CPUs
• Dynamic vs Static—connections can change from one
communication to next
• Blocking vs Nonblocking—can simultaneous connections be
present?
• Connections can be complete, linear, star, grid, tree, hypercube,
etc.
• Bus-based routing
• Crossbar switching—impractical for all but the most expensive
super-computers
• 2X2 switch—can route inputs to different destinations
Challenges in Parallel Processing
• Dealing with memory
• Various options:
• Global Shared Memory
• Distributed Shared Memory
• Global shared memory with separate cache for processors

• Potential Hazards:
• Individual CPU caches or memories can become out of synch
with each other. “Cache Coherence”

• Solutions:
• UMA/NUMA machines
• Snoopy cache controllers
• Write-through protocols
Scientific Computing Demand
• Ever increasing demand due to need for more accuracy,
higher-level modeling and knowledge, and analysis of
exploding amounts of data
– Example area 1: Climate and Ecological Modeling goals
• By 2010 or so:
– Simply resolution, simulated time, and improved physics leads to increased
requirement by factors of 104 to 107. Then …
– Reliable global warming, natural disaster and weather prediction
• By 2015 or so:
– Predictive models of rainforest destruction, forest sustainability, effects of climate
change on ecoystems and on foodwebs, global health trends
• By 2020 or so:
– Verifiable global ecosystem and epidemic models
– Integration of macro-effects with localized and then micro-effects
– Predictive effects of human activities on earth’s life support systems
– Understanding earth’s life support systems
Scientific Computing Demand
• Example area 2: Biology goals
– By 2010 or so:
• Ex vivo and then in vivo molecular-computer diagnosis
– By 2015 or so:
• Modeling based vaccines
• Individualized medicine
• Comprehensive biological data integration (most data co-analyzable)
• Full model of a single cell
– By 2020 or so:
• Full model of a multi-cellular tissue/organism
• Purely in-silico developed drugs; personalized smart drugs
• Understanding complex biological systems: cells and organisms to
ecosystems
• Verifiable predictive models of biological systems
Engineering Computing
• Large parallel machinesDemand
a mainstay in many industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural mechanics,
electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
• in all of the above
• entertainment (movies), architecture (walk-throughs, rendering)
– Financial modeling (yield and derivative analysis)
– etc.
Application
• Demand for cycles fuelsTrends
advances in hardware, and vice-versa
– Cycle drives exponential increase in microprocessor performance
– Drives parallel architecture harder: most demanding applications
• Range of performance demands
– Need range of system performance with progressively increasing cost
– Platform pyramid
• Goal of applications in usingPerformance (p processors)
parallel machines: Speedup
Speedup (p processors) = Performance (1 processor)

• For a fixed problem size (input data set), performance = 1/time


Time (1 processor)

Time (p processors)
• Speedup fixed problem (p processors) =
Goals of Parallel Programming
Performance: Parallel program runs faster than
its sequential counterpart (a speedup is
measured)
Scalability: as the size of the problem grows,
more processors can be “usefully” added to
solve the problem faster
Portability: The solutions run well on different
parallel platforms
Communication and Co-ordination
Message-based communication
• One-to-one
• Group communication
– One-to-All Broadcast and All-to-One Reduction
– All-to-All Broadcast and Reduction
– All-Reduce and Prefix-Sum Operations
– Scatter and Gather
– All-to-All Personalized Communication
Basic Communication Operations:
Introduction
• Many interactions in practical parallel programs occur
in well-defined patterns involving groups of
processors.
• Efficient implementations of these operations can
improve performance, reduce development effort and
cost, and improve software quality.
• Efficient implementations must leverage underlying
architecture. For this reason, we refer to specific
architectures here.
Basic Communication Operations:
Introduction
• Group communication operations are built using point-
to-point messaging primitives.
• Recall from our discussion of architectures that
communicating a message of size m over an
uncongested network takes time ts +tmw.
• We use this as the basis for our analyses. Where
necessary, we take congestion into account explicitly
by scaling the tw term.
• We assume that the network is bidirectional and that
communication is single-ported.
4.1. One-to-All Broadcast and All-to-One
Reduction
• One processor has a piece of data (of size m) it
needs to send to everyone.
• The dual of one-to-all broadcast is all-to-one
reduction.
• In all-to-one reduction, each processor has m
units of data. These data items must be
combined piece-wise (using some associative
operator, such as addition or min), and the
result made available at a target processor.
One-to-All Broadcast and All-to-One
Reduction

One-to-all broadcast and all-to-one reduction among processors.


One-to-All Broadcast and All-to-One
Reduction on Rings
• Simplest way is to send p-1 messages from the
source to the other p-1 processors - this is not
very efficient.
• Use recursive doubling: source sends a message
to a selected processor. We now have two
independent problems derined over halves of
machines.
• Reduction can be performed in an identical
fashion by inverting the process.
One-to-All Broadcast

One-to-all broadcast on an eight-node ring. Node 0 is the source of the


broadcast. Each message transfer step is shown by a numbered, dotted
arrow from the source of the message to its destination. The number on
an arrow indicates the time step during which the message is transferred.
All-to-One Reduction

Reduction on an eight-node ring with node 0 as the destination


of the reduction.
Who Needs Communications?
• The need for communications between tasks depends upon your problem
• You DON'T need communications
– Some types of problems can be decomposed and executed in parallel with virtually no
need for tasks to share data. For example, imagine an image processing operation where
every pixel in a black and white image needs to have its color reversed. The image data
can easily be distributed to multiple tasks that then act independently of each other to
do their portion of the work.
– These types of problems are often called embarrassingly parallel because they are so
straight-forward. Very little inter-task communication is required.
• You DO need communications
– Most parallel applications are not quite so simple, and do require tasks to share data
with each other. For example, a 3-D heat diffusion problem requires a task to know the
temperatures calculated by the tasks that have neighboring data. Changes to
neighboring data has a direct effect on that task's data.
Factors to Consider (1)
• There are a number of important factors to consider
when designing your program's inter-task
communications
• Cost of communications
– Inter-task communication virtually always implies overhead.
– Machine cycles and resources that could be used for
computation are instead used to package and transmit data.
– Communications frequently require some type of
synchronization between tasks, which can result in tasks
spending time "waiting" instead of doing work.
– Competing communication traffic can saturate the available
network bandwidth, further aggravating performance
problems.
Factors to Consider (2)
• Latency vs. Bandwidth
– latency is the time it takes to send a minimal (0 byte)
message from point A to point B. Commonly expressed as
microseconds.
– bandwidth is the amount of data that can be
communicated per unit of time. Commonly expressed as
megabytes/sec.
– Sending many small messages can cause latency to
dominate communication overheads. Often it is more
efficient to package small messages into a larger message,
thus increasing the effective communications bandwidth.
Factors to Consider (3)
• Visibility of communications
– With the Message Passing Model, communications
are explicit and generally quite visible and under
the control of the programmer.
– With the Data Parallel Model, communications
often occur transparently to the programmer,
particularly on distributed memory architectures.
The programmer may not even be able to know
exactly how inter-task communications are being
accomplished.
Factors to Consider (4)
• Synchronous vs. asynchronous communications
– Synchronous communications require some type of
"handshaking" between tasks that are sharing data. This can be
explicitly structured in code by the programmer, or it may
happen at a lower level unknown to the programmer.
– Synchronous communications are often referred to as blocking
communications since other work must wait until the
communications have completed.
– Asynchronous communications allow tasks to transfer data
independently from one another. For example, task 1 can
prepare and send a message to task 2, and then immediately
begin doing other work. When task 2 actually receives the data
doesn't matter.
– Asynchronous communications are often referred to as non-
blocking communications since other work can be done while
the communications are taking place.
– Interleaving computation with communication is the single
greatest benefit for using asynchronous communications.
Factors to Consider (5)
• Scope of communications
– Knowing which tasks must communicate with each other
is critical during the design stage of a parallel code. Both
of the two scopings described below can be implemented
synchronously or asynchronously.
– Point-to-point - involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
receiver/consumer.
– Collective - involves data sharing between more than two
tasks, which are often specified as being members in a
common group, or collective.
Collective Communication
Types of Synchronization
• Barrier
– Usually implies that all tasks are involved
– Each task performs its work until it reaches the barrier. It then stops, or "blocks".
– When the last task reaches the barrier, all tasks are synchronized.
– What happens from here varies. Often, a serial section of work must be done. In
other cases, the tasks are automatically released to continue their work.
• Lock / semaphore
– Can involve any number of tasks
– Typically used to serialize (protect) access to global data or a section of code. Only
one task at a time may use (own) the lock / semaphore / flag.
– The first task to acquire the lock "sets" it. This task can then safely (serially) access
the protected data or code.
– Other tasks can attempt to acquire the lock but must wait until the task that owns
the lock releases it.
– Can be blocking or non-blocking
• Synchronous communication operations
– Involves only those tasks executing a communication operation
– When a task performs a communication operation, some form of coordination is
required with the other task(s) participating in the communication. For example,
before a task can perform a send operation, it must first receive an
acknowledgment from the receiving task that it is OK to send.
– Discussed previously in the Communications section.
Speed-up, Amdahl’s Law,
Gustafson’s Law, efficiency, basic
performance metrics
Concurrency/Granularity
• One key to efficient parallel programming is
concurrency.
• For parallel tasks we talk about the granularity – size
of the computation between synchronization points
– Coarse – heavyweight processes + IPC (interprocess
communication (PVM, MPI, … )
– Fine – Instruction level (eg. SIMD)
– Medium – Threads + [message passing + shared memory
synch]
One measurement of granularity
• Computation to Communication Ratio
– (Computation time)/(Communication time)
– Increasing this ratio is often a key to good
efficiency
– How does this measure granularity?
•  CCR = ? Grain
•  CCR = ? Grain
Communication Overhead
• Another important metric is communication
overhead – time (measured in instructions) a zero-
byte message consumes in a process
– Measure time spent on communication that cannot be
spent on computation
• Overlapped Messages – portion of message lifetime
that can occur concurrently with computation
– time bits are on wire
– time bits are in the switch or NIC
Many little things add up …
• Lots of little things add up that add overhead
to a parallel program
– Efficient implementations demand
• Overlapping (aka hiding) the overheads as much as
possible
• Keeping non-overlapping overheads as small as
possible
Speed-Up
• S(n) =
– (Execution time on Single CPU)/(Execution on N
parallel processors)
– ts /tp
– Serial time is for best serial algorithm
• This may be a different algorithm than a parallel version
– Divide-and-conquer Quicksort O(NlogN) vs. Mergesort
Linear and Superlinear Speedup
• Linear speedup = N, for N processors
– Parallel program is perfectly scalable
– Rarely achieved in practice
• Superlinear Speedup
– S(N) > N for N processors
• Theoretically not possible
• How is this achievable on real machines?
– Think about physical resources of N processors
Space-Time Diagrams

• Shows comm. patterns/dependencies


• XPVM has a nice view.
Process
Time
1

Overhead Message
Waiting
Computing
What is the Maximum Speedup?
• f = fraction of computation (algorithm) that is serial
and cannot be parallelized
– Data setup
– Reading/writing to a single disk file
• ts = fts + (1-f) ts
= serial portion + parallelizable portion
• tp = fts + ((1-f) ts)/n
• S(n) = ts/(fts + ((1-f) ts)/n)
= n/(1 + (n-1)f)  Amdahl’s Law
Limit as n ->  = 1/f
Example of Amdahl’s Law
• Suppose that a calculation has a 4% serial
portion, what is the limit of speedup on 16
processors?
– 16/(1 + (16 – 1)*.04) = 10
– What is the maximum speedup?
1/0.04 = 25
More to think about …
• Amdahl’s law works on a fixed problem size
– This is reasonable if your only goal is to solve a
problem faster.
– What if you also want to solve a larger problem?
• Gustafson’s Law (Scaled Speedup)
Gustafson’s Law
• Fix execution of on a single processor as
– s + p = serial part + parallelizable part = 1
• S(n) = (s +p)/(s + p/n)
= 1/(s + (1 – s)/n) = Amdahl’s law
• Now let, 1 = s +  = execution time on a parallel
computer, with  = parallel part.
– Ss(n) = (s +  n)/(s + ) = n + (1-n)s
More on Gustafson’s Law
• Derived by fixing the parallel execution time
(Amdahl fixed the problem size -> fixed serial
execution time)
– For many practical situations, Gustafson’s law
makes more sense
• Have a bigger computer, solve a bigger problem.
• Amdahl’s law turns out to be too conservative
for high-performance computing.
Efficiency
• E(n) = S(n)/n * 100%
• A program with linear speedup is 100%
efficient.
Example questions
• Given a (scaled) speed up of 20 on 32
processors, what is the serial fraction from
Amdahl’s law?, From Gustafson’s Law?
• A program attains 89% efficiency with a serial
fraction of 2%. Approximately how many
processors are being used according to
Amdahl’s law?
Evaluation of Parallel Programs
• Basic metrics
– Bandwith
– Latency
• Parallel metrics
– Barrier speed
– Broadcast/Multicast
– Reductions (eg. Global sum, average, …)
– Scatter speed
Bandwidth
• Various methods of measuring bandwidth
– Ping-pong
• Measure multiple roundtrips of message length L
• BW = 2*L*<#trials>/t
– Send + ACK
• Send #trials messages of length L, wait for single
ACKnowledgement
• BW = L*<#trials>/t
• Is there a difference in what you are measuring?
• Simple model: tcomm = tstartup + ntdata
Ping-Pong
• All the overhead (including startup) are
included in every message
• When message is very long, you get an
accurate indication of bandwidth

Time
1

2
Send + Ack
• Overheads of messages are masked
• Has the effect of decoupling startup latency
from bandwidth (concurrency)
• More accurate with a large # of trials

Time
1

2
In the limit …
• As messages get larger, both methods converge to
the same number
• How does one measure latency?
– Ping-pong over multiple trials
– Latency = 2*<#ntrials>/t
• What things aren’t being measured (or are being
smeared by these two methods)?
– Will talk about cost models and the start of LogP analysis
next time.
Gustafson-Barsis’s Law example

A parallel program takes 134 seconds to run on 32 processors. The total


time spent in the sequential part of the program was 12 seconds. What is
the scaled speedup?
Here α = (134 − 12)/134 = 122/134 so the scaled speedup is

(1 − α) + αN = ( 1 − + 122/134 X 32 = 29.224
122/134)

This means that the program is running approximately 29 times faster


than the program would run on one processor..., assuming it could run on
one processor.

80
Performance Metrics, Prediction, and uremen 19 / /32
Comparison of Shared Vs Distributed Memory

You might also like