1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing
1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing
• Potential Hazards:
• Individual CPU caches or memories can become out of synch
with each other. “Cache Coherence”
• Solutions:
• UMA/NUMA machines
• Snoopy cache controllers
• Write-through protocols
Scientific Computing Demand
• Ever increasing demand due to need for more accuracy,
higher-level modeling and knowledge, and analysis of
exploding amounts of data
– Example area 1: Climate and Ecological Modeling goals
• By 2010 or so:
– Simply resolution, simulated time, and improved physics leads to increased
requirement by factors of 104 to 107. Then …
– Reliable global warming, natural disaster and weather prediction
• By 2015 or so:
– Predictive models of rainforest destruction, forest sustainability, effects of climate
change on ecoystems and on foodwebs, global health trends
• By 2020 or so:
– Verifiable global ecosystem and epidemic models
– Integration of macro-effects with localized and then micro-effects
– Predictive effects of human activities on earth’s life support systems
– Understanding earth’s life support systems
Scientific Computing Demand
• Example area 2: Biology goals
– By 2010 or so:
• Ex vivo and then in vivo molecular-computer diagnosis
– By 2015 or so:
• Modeling based vaccines
• Individualized medicine
• Comprehensive biological data integration (most data co-analyzable)
• Full model of a single cell
– By 2020 or so:
• Full model of a multi-cellular tissue/organism
• Purely in-silico developed drugs; personalized smart drugs
• Understanding complex biological systems: cells and organisms to
ecosystems
• Verifiable predictive models of biological systems
Engineering Computing
• Large parallel machinesDemand
a mainstay in many industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural mechanics,
electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
• in all of the above
• entertainment (movies), architecture (walk-throughs, rendering)
– Financial modeling (yield and derivative analysis)
– etc.
Application
• Demand for cycles fuelsTrends
advances in hardware, and vice-versa
– Cycle drives exponential increase in microprocessor performance
– Drives parallel architecture harder: most demanding applications
• Range of performance demands
– Need range of system performance with progressively increasing cost
– Platform pyramid
• Goal of applications in usingPerformance (p processors)
parallel machines: Speedup
Speedup (p processors) = Performance (1 processor)
Time (p processors)
• Speedup fixed problem (p processors) =
Goals of Parallel Programming
Performance: Parallel program runs faster than
its sequential counterpart (a speedup is
measured)
Scalability: as the size of the problem grows,
more processors can be “usefully” added to
solve the problem faster
Portability: The solutions run well on different
parallel platforms
Communication and Co-ordination
Message-based communication
• One-to-one
• Group communication
– One-to-All Broadcast and All-to-One Reduction
– All-to-All Broadcast and Reduction
– All-Reduce and Prefix-Sum Operations
– Scatter and Gather
– All-to-All Personalized Communication
Basic Communication Operations:
Introduction
• Many interactions in practical parallel programs occur
in well-defined patterns involving groups of
processors.
• Efficient implementations of these operations can
improve performance, reduce development effort and
cost, and improve software quality.
• Efficient implementations must leverage underlying
architecture. For this reason, we refer to specific
architectures here.
Basic Communication Operations:
Introduction
• Group communication operations are built using point-
to-point messaging primitives.
• Recall from our discussion of architectures that
communicating a message of size m over an
uncongested network takes time ts +tmw.
• We use this as the basis for our analyses. Where
necessary, we take congestion into account explicitly
by scaling the tw term.
• We assume that the network is bidirectional and that
communication is single-ported.
4.1. One-to-All Broadcast and All-to-One
Reduction
• One processor has a piece of data (of size m) it
needs to send to everyone.
• The dual of one-to-all broadcast is all-to-one
reduction.
• In all-to-one reduction, each processor has m
units of data. These data items must be
combined piece-wise (using some associative
operator, such as addition or min), and the
result made available at a target processor.
One-to-All Broadcast and All-to-One
Reduction
Overhead Message
Waiting
Computing
What is the Maximum Speedup?
• f = fraction of computation (algorithm) that is serial
and cannot be parallelized
– Data setup
– Reading/writing to a single disk file
• ts = fts + (1-f) ts
= serial portion + parallelizable portion
• tp = fts + ((1-f) ts)/n
• S(n) = ts/(fts + ((1-f) ts)/n)
= n/(1 + (n-1)f) Amdahl’s Law
Limit as n -> = 1/f
Example of Amdahl’s Law
• Suppose that a calculation has a 4% serial
portion, what is the limit of speedup on 16
processors?
– 16/(1 + (16 – 1)*.04) = 10
– What is the maximum speedup?
1/0.04 = 25
More to think about …
• Amdahl’s law works on a fixed problem size
– This is reasonable if your only goal is to solve a
problem faster.
– What if you also want to solve a larger problem?
• Gustafson’s Law (Scaled Speedup)
Gustafson’s Law
• Fix execution of on a single processor as
– s + p = serial part + parallelizable part = 1
• S(n) = (s +p)/(s + p/n)
= 1/(s + (1 – s)/n) = Amdahl’s law
• Now let, 1 = s + = execution time on a parallel
computer, with = parallel part.
– Ss(n) = (s + n)/(s + ) = n + (1-n)s
More on Gustafson’s Law
• Derived by fixing the parallel execution time
(Amdahl fixed the problem size -> fixed serial
execution time)
– For many practical situations, Gustafson’s law
makes more sense
• Have a bigger computer, solve a bigger problem.
• Amdahl’s law turns out to be too conservative
for high-performance computing.
Efficiency
• E(n) = S(n)/n * 100%
• A program with linear speedup is 100%
efficient.
Example questions
• Given a (scaled) speed up of 20 on 32
processors, what is the serial fraction from
Amdahl’s law?, From Gustafson’s Law?
• A program attains 89% efficiency with a serial
fraction of 2%. Approximately how many
processors are being used according to
Amdahl’s law?
Evaluation of Parallel Programs
• Basic metrics
– Bandwith
– Latency
• Parallel metrics
– Barrier speed
– Broadcast/Multicast
– Reductions (eg. Global sum, average, …)
– Scatter speed
Bandwidth
• Various methods of measuring bandwidth
– Ping-pong
• Measure multiple roundtrips of message length L
• BW = 2*L*<#trials>/t
– Send + ACK
• Send #trials messages of length L, wait for single
ACKnowledgement
• BW = L*<#trials>/t
• Is there a difference in what you are measuring?
• Simple model: tcomm = tstartup + ntdata
Ping-Pong
• All the overhead (including startup) are
included in every message
• When message is very long, you get an
accurate indication of bandwidth
Time
1
2
Send + Ack
• Overheads of messages are masked
• Has the effect of decoupling startup latency
from bandwidth (concurrency)
• More accurate with a large # of trials
Time
1
2
In the limit …
• As messages get larger, both methods converge to
the same number
• How does one measure latency?
– Ping-pong over multiple trials
– Latency = 2*<#ntrials>/t
• What things aren’t being measured (or are being
smeared by these two methods)?
– Will talk about cost models and the start of LogP analysis
next time.
Gustafson-Barsis’s Law example
(1 − α) + αN = ( 1 − + 122/134 X 32 = 29.224
122/134)
80
Performance Metrics, Prediction, and uremen 19 / /32
Comparison of Shared Vs Distributed Memory