Introduction To Parallel Programming
Introduction To Parallel Programming
Architecture
6
1.
This chapter is not designed for a detailed study of computer archi-
tecture. Rather, it is a cursory review of concepts that are useful to
understand the performance issues in parallel programs. Readers
may well need to refer to a more detailed treatise on architecture to
delve deeper into some of the concepts.1 , 2 1
John L. Hennessy and David A.
Patterson. Computer Architecture: A
Quantitative Approach. Morgan Kaufman,
1.1 Parallel Organization 2017
2
José Duato, Sudhakar Yalamanchili,
FT
There are two distinct facets of parallel architecture: the structure of
the processors, i. e., the hardware architecture, and the structure of the
and Lionel Ni. Interconnection Networks:
An Engineering Approach. Morgan
Kaufmann, 2003
programs, i.e., the software architecture. The hardware architecture Question: What are execution engines
and how are instructions executed?
has three major components:
6
hardware and software architectures are, in principle, independent
of each other. In practice, however, certain software organizations are
more suited to certain hardware organizations. We will discuss these
graphs and their relationship later in the textbook.
1.
Another way to categorize the hardware organization was pro-
posed by Flynn 3 and is based on the relationship between the in- 3
M. J. Flynn. Some computer organi-
structions different processors execute at a time. This is popularly zations and their effectiveness. IEEE
Transactions on Computers, C-21(9):
known as Flynn’s taxonomy. 948–960, 1972
example, each pair in eight pairs of numbers may be added and eight
sums produced. Thus, there are as many output streams as input
streams. Such operations are sometimes referred to as vector opera-
tions. (Usually, the number of data-streams is limited by the number
of execution units available, but also see SIMT in the summary at the
end of the chapter.)
an introduction to parallel computer architecture 23
6
MISD: Multiple Instruction, Single Data
The only other possible category in this taxonomy has multiple
1.
processors, each with a separate instruction stream. All operate
simultaneously on the same operand from a single data-stream. This
is a rather specialized situation, and a general study of this category
is not common. (Sometimes, the same data-stream is processed
by different processors, either for redundancy, or with differing
objectives. For example, in an aircraft, one instruction stream may
be analysing data for anomaly, while another uses it to control pitch,
FT
and yet another simply encodes and records the data.) These can
often be studied as multiple SISD programs.
Modern parallel computers are generally designed with a mix of
SIMD and MIMD architectures. SIMD provides high efficiency at a
lower cost because only a single instruction stream needs to be man-
aged, but when vector operations are not required, meaning there
is an insufficient number of data-streams available, the execution
engines can be underutilized.
RA
6
1.
Figure 1.1: Shared-memory vs
Distributed-memory architecture
to execute instructions on its behalf.
Computing
System
or
Computing Computing
System
w
System
et
Computing
System Computing
System
N
Computing
System Computing
System
System Computing
System
System
6
1.
Figure 1.3: Computing system
Thus, both the cluster as well as a single node are common exam-
ples of the MIMD architecture.
6
1.
FT Figure 1.4: Computing core
The core’s functional pipeline9 is illustrated in Figure 1.4. Each 9
Defined : A pipeline is like an assem-
stage of the pipeline passes its results on to the next stage on com- bly line: a sequence of sub-operations
that together complete a given opera-
pletion and immediately seeks its next task from the previous stage. tion.
The front-end of a core’s controller fetches instructions from memory,
decodes them, and schedules them on one of several execution units.
RA
may occur in the same clock cycle. It is important to note that exe-
cution units themselves are pipelined, and a pipeline holds multiple
instructions in its different stages simultaneously. However, an in-
struction cannot begin to be executed until its operands are available.
Note that some input operand of an instruction may be the output
of a previous one. Such operand is available only after that earlier
instruction completes. The later instruction is said to depend on the
earlier one as shown:
6
1 R1 = read address A;
2 R2 = read address B; // Independent of instruction 1: can begin before 1 is complete.
3 R3 = R1+R2; // Depends on instruction 1 and 2
1.
And all that is only at a rather high level of abstraction. The point
is that the architecture’s details are intricate, but the following reper-
cussions are important to note. The ‘execution’ of an instruction
takes finite and variable time. Not only does a computing system
have many cores, potentially executing different parts of the same
program at any given instant, but each core also has multiple instruc-
tions in flight at any given time. These in-flight instructions do not
necessarily follow each other sequentially through the various stages
FT
of the core’s pipeline, but they retire sequentially. This parallel execu-
tion, or start, of multiple instructions in the same clock cycle is called
instruction level parallelism.
From the discussion in this section, it should be clear that even a
single core follows the MIMD principle at some level. It can indeed
execute multiple instructions (on its multiple execution units) in the
same step. Some of these execution units process only a single data-
stream and are examples of SISD. At the same time, some modern
RA
cores also contain execution units that are SIMD. Intel’s AVX and
AVX2 and nVIDIA’s SMX are examples of such execution units.
request is made).
complete for a relatively long period, possibly delaying the start of
subsequent instructions.
Hence, it is common for hardware to maintain copies of a subset
of the data in fast local memory, called cache. Indeed, an entire cache
hierarchy – a series of caches – is maintained with an eye towards
the cost. A cache too small may not be of much help, and a cache
28 an introduction to parallel programming
6
copy. Data re-use and locality of use within a program is a common
reason why this is possible.
With a cache hierarchy, if a data item is not found in level i cache,
1.
i.e., it is a cache-miss, it is allocated space in that cache. That space is
populated by bringing the item from level i + 1 (and recursively from
higher levels if necessary). This means that any data previously resi-
dent in that allocated space in level i must be evicted first, possibly by
updating its proxies at higher levels. The performance of a program’s
memory operations depends on the allocations and eviction policy.
Some systems allow the program to control both policies. More often,
though, a fixed policy is available.
FT
For example, in direct-mapped caches, the cache-location of an item
is uniquely determined by its memory address. Another item already
occupying that location must be evicted to bring in the new item
before the core can access it. In the more pervasive associative caches,
an item is allowed to be placed in one of several cache locations. If all
those candidate locations are occupied, one must be vacated to make
space for the new item. The cache replacement policy governs which
item is evicted. FIFO eviction policy (FIFO stands for First in first out)
RA
dictates that the item that came into the cache before other candidates
is evicted. In LRU eviction policy (LRU stands for least recently used),
the evicted cache entry is the one that was last accessed before every
other candidate.
Even if a fixed policy is in effect, programs can be written to adapt
to it. For example, a program may ensure that multiple SIMD cores
that share a cache do not incessantly evict each other’s data. Suppose
direct-mapped cache addressing is used. In a cache with k locations,
memory address m occupies cache location m%k. This means that up
to k contiguous memory items read simultaneously by k SIMD cores
D
6
1.
Figure 1.5: Cache coherence: two cores
are shown with separate caches. R1 is a
Each cache level is divided into cache-lines: equal-sized blocks of local register in each core. x and y are
contiguous bytes. The policies are implemented in terms of entire memory items.
FT
lines. So organizing a cache into lines helps reduce the hardware
cost of the query about whether a data item accessed by a core is in
that cache, i. e., whether there is a cache-hit or a cache-miss. However,
dealing in cache-lines means that an entire cache-line must be fetched
in order to access a smaller memory item. This acts to prefetch
certain data, in case the other items in that cache-line are accessed in
the near future.
Caches impose significant complexity in parallel computing envi-
RA
ronments. Note in Figure 1.3 that each computing core has its own
cache. These multiple cores may retain their own copies of some
data, or write to it. This duplication can lead to different parts of
the same program executing on those cores to see different – and
hence inconsistent – data in the same memory location at the ‘same
time.’ Such inconsistency is hardly surprising if each part assumes
that there is only one data item in one memory location at a time.
This consistency is called cache coherence. Coherence is maintained by
ensuring that two cores do not modify their copies concurrently. If a
core modifies its copy, other copies are invalidated or updated with
the new value.
D
6
propagated to all its cached copies. The appearance is similar to the
case where the item is directly accessed from the memory un-cached.
This does not preclude two concurrent changes to an item leading
1.
to unpredictable results. For example, in figure 1.5, P2 could write
the value of its register R1 into x. This value would be 6 if P2 ’s read
of x completes before P1 ’s write to x. P1 ’s increment would thus come
undone. Furthermore, the interplay between cache-coherent accesses
of two or more different items can also violate expectations that are
routine in a sequential program. Such violations occur because the
order in which updates to two items x and y become visible to one
core is different from the order in which they may have been made,
FT
We will later study this larger issue of memory-wide consistency
in more detail in section 4.2. One must understand the type of mem-
ory consistency guaranteed by a parallel programming environment
to design programs that execute correctly in that environment. In
fact, some programming environments even allow incoherent caches
in an attempt to bolster performance. After all, coherence comes at
a performance cost. Such environments leave it to the program to
manage consistency as needed. We will see such examples in Chapter
RA
6.
Recall that caches operate in units of cache-lines, meaning co-
herence protocols deal in lines. If item x in a P2 ’s cache needs to be
invalidated, its entire line — including items not written by P1 — is
invalidated. This is called false-sharing and is discussed in Chapter 6.
6
Figure 1.3). The general architecture of GPUs is shown in Figure 1.6.
GPU cores are organized in a hierarchy of groups. GPU execution en-
gines comprise SIMD cores. For example, one engine may consist of,
1.
say, 32 floating-point execution units, and all may be used to execute
the next floating-point instruction in some instruction stream with its
32 data-streams. Just like CPU execution units, each core of the SIMD
group consists of a pipeline of sub-units.
Similarly, another execution unit may cause read or write of, say,
32 memory addresses. GPUs have memory separate from the CPU
memory. This memory is accessible by all GPU cores. Due to a higher
number of concurrent operations, GPU memory pipelines tend to
FT
be even longer (i.e., they are deep pipelines) than CPU’s memory
pipelines, even as the cache hierarchy may have fewer levels. On the
other hand, GPU’s execution unit pipelines are often shorter than
CPU’s, and the imbalance in GPU memory and compute latencies is
significant. (See Section 6.5 for its impact on GPU programming.)
Stream Processors (SPs) are grouped into clusters variously called
streaming multi-processors (SM), or compute-unit (CU). SPs within
an SM usually share an L0 or L1 level cache local to that SM. In addi-
RA
tion, SMs may also contain a user-managed cache shared by its cores.
This cache is referred to as scratchpad, local data-share (LDS), or
sometimes merely shared-memory. Sometimes, groups of SMs may
be further organized into ‘super-clusters,’ for example, for sharing
graphics-related hardware. Several of these super-clusters may share
higher levels of cache. At other time, the processors of an SM (or CU)
may be partitioned into multiple subsets, each subset operating in
SIMD fashion. Thus, there is a hierarchy of cores and a hierarchy of
caches. Again, due to the possible replication of data into multiple
local caches, their coherence is an important consideration.
D
many more execution units. They also are likely to have somewhat
smaller memory and cache, particularly on a per-core basis. Many
more simultaneous memory reads and writes need to be sustained
by GPUs, and hence they need to lay a greater emphasis on efficient
memory operations. The hierarchical organization of cores and
caches into clusters aids this effort.
For example, each SM has a separate shared-memory unit (see
block marked local cache in Figure 1.6), and each shared-memory
unit may be further divided into several banks. Each bank of each
6
unit can be accessed simultaneously. The SIMD nature of instruc-
tions allows a program to control the banks accessed by a single
instruction and thus improve its memory performance. For example,
1.
a 32-core SIMD instruction could read up to 32 contiguous elements
of an array in parallel if those elements reside in different banks.
Similarly, all items accessed by an instruction could occupy the same
cache-line.
FT
RA
execute program instructions, but they all have the ability to con-
sume, produce, or collate data. The memory controller is an example.
Multiple cores connect to the memory controller using a network.
They send requests to the controller, whch returns the response after
performing memory operations on the cores’ behalf.
Sometimes, a unit contains multiple connections, each of which
an introduction to parallel computer architecture 33
6
common for general-purpose networks to employ modular design
and populate ports into switches and employ switched networks.
Internal, on-chip networks on devices like CPUs and GPUs can often
1.
be direct networks instead.
Routing
Messages are routed either using circuit switching or packet switch-
ing. For circuit switching, the entire path between the sender and
the recipient is reserved and may not be shared by any other pair
until that communication is complete. For packet switching, each
FT
switch routes incoming packets12 ‘towards’ the recipient at each step.
Sometimes switches are equipped with buffers to store and then
12
Defined : A packet is a small amount
of data. Larger ‘messages’ may be
subdivided into multiple ‘packets.’ We
forward packets in a later step. This is useful to resolve contention will use these terms interchangeably.
when two messages from two different sources arrive at the same
time-step and are required to be forwarded onto the same link on the
way to their respective destinations. One alternative is to drop one of
the messages and require it to be re-transmitted. That is a high level
overview. We will not discuss detailed routing issues in this book.
RA
Links
Most network topologies support bi-directional links that can carry
data in both directions simultaneously. These are called full-duplex
links. It is in many ways similar to having two simplex links instead
– simplex links are unidirectional. In contrast, half-duplex links are
bi-directional but carry data in only one direction at a time. In this
section, we will not separately discuss the duplex variants of the de-
scribed topologies, but it may be easier to understand the discussion
D
6
Figure 1.7: Completely connected
network
be required in addition to n 1 ports per node. This is expensive.
1.
Another convenient interconnect is a bus (Figure 1.8): a pervasive
channel to which each end-point attaches. This method is cost effec-
tive but hard to extend over large distances. Bus communication is
also slowed by a large number of end-points, as only one end-point
may send its data on a fully shared bus at one time, and some bus
access arbitration is required for conflict-free communication.
FT
Figure 1.8: A Bus network
One measure of such conflict is whether a network is blocking. A
nonblocking network exhibits no contention or conflict for any com-
bination of sender-recipient pairs as long as no two senders seek to
communicate with the same recipient. In other words, disjoint pairs
RA
6
a network is the highest degree among its nodes. In common
high-degree networks, many nodes have high degrees. However,
extremes are possible, where only one or a few node have a high
1.
degree. For example, in a star network, a ‘hub’ is connected to
every other node. Any programs on the hub in such cases have
to account for its high degree. A hub can also be source of high
contention, not unlike a bus.
long latency.
6
communicating. Again, a bisection width of n2 is not guaranteed
to allow all pairs to proceed, as the network may be blocking, and
there may be other conflicts along the paths between different
1.
pairs.
Torus Network
A simple network that reduces the bus bottleneck is a ring (see
RA
6
along each dimension and dn links, each node having 2d ports. We
call such tori k-way tori. The diameter of such a network is kd2 : the
furthest node from a node is at a distance of 2k along each dimension.
1.
The bisection width is 2kd 1 : a (d 1)-dimensional slice through the
middle would divide the nodes into two.
FT
RA
One benefit of the Torus is its short link lengths except for the
wrap-around links – all dkd 1 of them. In the context of networks in-
side a chip, not only is the long delay in long links undesirable, vari-
able delay in variable link lengths causes a significant impediment
to speed and throughput. It is possible to lay tori out to alleviate the
link length variability problem at a slight cost to the overall lengths.
Figure 1.12 demonstrates one simple strategy, but we will not discuss
these in detail here. Regardless, laying out high link counts, particu-
larly on a plane, or on a few planar layers, or even in 3D, is quite a
D
complicated.
Torus is a blocking network. Consider, for example, a message
from node (1, 1) to node (2, 2) at the same time as a message from
node (2, 1) to node (1, 2). Both must employ a common link (unless
a longer path is taken but there may be similar conflicts on other
edges). It is possible to create a nonblocking Torus network, but that
38 an introduction to parallel programming
6
1.
Figure 1.11: A 4 ⇥ 4 2D Torus, showing
conflicting routes from node (1,1) to
FT (2,2) and from node (2,1) to (1,2)
Hypercube Network
The Hypercube network 13 is an alternative to Torus. A Hypercube 13
Jon S Squire and Sandra M Palais.
of dimension d + 1, d 0, is constructed by combining two copies Programming and design consider-
ations of a highly parallel computer.
of d-dimensional Hypercubes by mutually connecting by a link the In Proceedings of the AFIPS spring joint
ith node of one copy to the ith node of the other copy for all i (See computer conference, May 1963
Figure 1.13). A 0-dimensional Hypercube is a single node with no
RA
links and index 0. After combining, the nodes from one copy retain
their previous index numbers and those from the other copy are
renumbered to 2d + i, where i is a given node’s previous index number.
Thus, an n-node network is recursively constructed by adding n2 links
to two n2 node networks.
D
6
Cross-bar Network
The Cross-bar seeks to reduce the cost of the completely connected
1.
network. A Cross-bar switch has 2n ports connecting n nodes as
shown in Figure 1.14.
FT
Figure 1.14: Cross-bar: dots designate
that crossing wires are closed (meaning
The Cross-bar switch can connect at most one pair of cross-wires connected). Other crossings are open.
RA
and no other source may set any junction in its column. Similarly, ev-
ery destination owns its row and no junction in row d is set unless d
is a destination. Thus the junction in row s and column d is reserved
for the exclusive use of s to d communications.
The complexity and expense of the Cross-bar can be ameliorated
by using a modular multi-stage connector at the expense of latency.
40 an introduction to parallel programming
Shuffle-exchange Network
Shuffle-exchange networks are of many types. Let us consider an
example. Omega network is a multi-stage network, as shown in
Figure 1.15.
6
1.
Figure 1.15: Omega network: Output of
switches are shuffled into the inputs at
FT
All the switches used have a pair of input and a pair of output
ports. Each switch can be separately controlled to either let both
its inputs pass-through straight to its corresponding outputs or to
the next stage. The first half of the links
connect consecutively to the left input
ports of the next level switches. The
second half connect to the right ports.
swap them. This is really a 2 ⇥ 2 Cross-bar, also called a Banyan
switch element, as shown in the inset in Figure 1.15 (although other
implementations are possible). In the figure, cross-connects (or
exchanges) are set up for swap (i.e., cross). Connecting the other
diagonal junctions instead would result in pass-through (i. e., bar).
An n-node omega network requires log n stages15 , with n2 switches
RA
15
Logarithm base 2 is implied in this
per stage. The output of a stage is shuffled into the input of the next book.
stage – the left half of the links connect consecutively to the left input
of each switch and the right half of the links connect to the right
input of consecutive switches. In other words, if we number the
output from left to right, output i, for i < n2 , connects to switch i.
n i
Output i, for i 2 , similarly connects to the right input of switch 2 .
Omega networks are examples of a family of multi-stage shuffle-
exchange networks16 like Butterfly17 or Benes18 . Different members 16
H. S. Stone. Parallel processing with
of the family mainly have different shuffle patterns. Figure 1.16 the perfect shuffle. IEEE Trans. Comput.,
C-20(2):153–161, 1971
shows a butterfly topology, for example. It contains (log n 1) shuffle
D
17
Thomas J. LeBlanc, Michael L. Scott,
stages consisting of n exchange switches each. This leads to a slightly and Christopher M. Brown. Large-scale
parallel programming: experience with
lower diameter than Omega (log n vs log n + 1) and a higher bisection
bbn butterfly parallel processor. In ACM
width (2n vs n) at the cost of almost doubling the number of links SIGPLAN Symposium on Principles and
(2n log n vs n log n + n). One practical advantage of Omega network is Practice of Parallel Programming, 1988
18
that the shuffle pattern does not change from stage to stage allowing V. E. Benes. Mathematical Theory of
Connecting Networks and Telephone Trafic.
a more modular design. Also note that although the diagrams appar- Academic Press, 1965
an introduction to parallel computer architecture 41
ently show uni-directional data flow, it does not have to be. This is
demonstrated later in this chapter.
6
1.
Figure 1.16: Butterfly Network
Clos Network
Clos networks take a different approach to reduce the cost of the
Cross-bar. The main idea is to reduce the size and complexity by
dividing the ports into smaller groups, say of size k, and use a Cross-
bar within the smaller groups as shown in Figure 1.17. In a way, Clos
FT
is also a generalization of the shuffle-exchange network. Recall that
exchange is but a 2 ⇥ 2 Cross-bar. Clos allows larger cross-bars. In
this three-stage network, the shuffle is a perfect r-way shuffle, for a
chosen r. The ith output of switch j is connected to the jth input of
switch i of the next stage.
The bottom stage uses k ⇥ l cross-bars. The middle stage uses r ⇥ r
cross-bars, r = d nk e. The top stage uses l ⇥ k cross-bars. Clos has
shown19 that if l 2k 1, this network is nonblocking, retaining the 19
Charles Clos. A study of non-blocking
RA
contention-free routing of the Cross-bar. For a large number of ports switching networks. Technical Report 2,
Bell Labs, 1953
n, Clos network requires multiple but significantly smaller cross-
bars than a full n ⇥ n Cross-bar at the cost of a few more links. For
example, a 1, 024 node Cross-bar requires 1, 048, 576 cross-connects.
In contrast, we can use 64 16 ⇥ 31 cross-bars in the first stage, 31
64 ⇥ 64 cross-bars in the second stage and 64 31 ⇥ 16 cross-bars
in the third stage for a total of only 190, 464 cross-connects and
1, 984 additional links, albeit of smaller lengths than those inside a
1024 ⇥ 1024 Cross-bar.
D
Tree Network
One of the simplest networks to design and route is a binary tree
as shown in Figure 1.18. Possibly, the root can be removed and its
two children directly connected. Network complexity is small. The
link count is only 2n 3 for n nodes. Switches are simple three-port
connectors able to route between any two ports in one step. The tree
42 an introduction to parallel programming
6
1.
Figure 1.17: A Clos network
and vice versa. This problem can be addressed by adding more links
at the higher levels of the tree. For example, double the number of
links going up at the level above the leaf, quadruple above that, and
so on. See Figure 1.19. So modified, it is called the Fat tree network.
Of course, the tree need not necessarily be a binary tree but may have
any degree d > 1.
D
links to be full-duplex and fold the figure down the middle, as shown
in Figure 1.20. The middle stage r ⇥ r cross-bars of Figure 1.17 now
looks like the top row of Figure 1.20. On folding, this row now has
r ‘output’ links on the same side as r ‘input’ links, making 2r full-
duplex links. All l switches at this stage are folded. After folding, the
cross-bar in the top and the bottom stages of Figure 1.17 occupy the
bottom row of Figure 1.20. Thus the two rows of Figure 1.20 together
make for a root node with 2n links. Given duplex links, the network
becomes symmetric. Any port can send or receive in a given step –
6
possibly both if the links are full-duplex.
1.
FT
We can re-interpret Figure 1.20 as demonstrated in Figure 1.21,
integrating the l r ⇥ r cross-bars of the top row into single node with l
Figure 1.20: Folded Clos Network
links to each of the 2r nodes in the bottom row. The topology reduces
to that of a three-level Fat tree topology. The root’s degree is 2r. and
that of each switch in the bottom row is k. In this configuration, the
root is often referred to as the spine and its children as leaf switches.
If l = k, k links between each leaf switch and the spine are sufficient
RA
Network Comparison
We have discussed a few popular network topologies in this section.
Each has its pluses and minuses. Broadly speaking, the performance
increases with increasing complexity and cost. Ideally, these details
are hidden from a parallel application programmer, whose main
44 an introduction to parallel programming
6
Network Link count Diameter Bisection width Table 1.1: Network Comparison
Completely
n ( n 1) n n
connected 2 1 2 ⇥ 2
1.
d-dimensional
dk 1
k-way Torus, kd = n nd 2 2kd
n
Fat Tree (Binary) n log n 2 log n 2
n log n
HyperCube 2 log(n) n/2
1.7 Summary
FT
Parallel processors are ubiquitous. These include CPUs with near 10
or 20 cores, GPUs with a few thousand cores, and clusters with up
to a million cores and more. Each core usually accepts a sequence of
instructions and executes them ostensibly in that order. Each instruc-
tion may execute on a single set of operands (scalar operation) or
an array of them at a time (vector operation). Some of these instruc-
tions read from or write to memory locations. Cores communicate
RA
6
• Parallel MIMD cores execute independent instructions. SIMD
cores all execute the same instruction simultaneously on different
data.
1.
• Somewhat refined terminology is also in vogue. SIMT – single in-
struction multiple threads – architecture allows a variable number
of virtual cores to apparently execute an instruction ‘simultane-
ously.’ If the available number of physical cores is smaller than
the requested SIMT-width, each SIMT instruction is serialized into
multiple SIMD instructions.
FT
• Similarly, SPMD (single program multiple data) and MPMD
(multiple programs multiple data) are variants of MIMD, except
the definition works at the level of the entire user program, rather
than that of individual instructions. For example, in an SPMD
architecture, the same program is executed on multiple cores, and
at each core it operates on its own data. The executions together
solve a problem. It is up to the program to determine at the time
of execution which part of the solution each core undertakes.
RA
6
Textbooks on parallel architecture 20 are good sources for a deeper 20
John L. Hennessy and David A.
study of these topics. GPU architecture evolves at such a fast paces Patterson. Computer Architecture: A
Quantitative Approach. Morgan Kaufman,
that any textbook 21 quickly becomes out of date. However, architec- 2017; Smruti R. Sarangi. Computer
1.
ture vendors usually release white papers and programming guides, Organisation and Architecture. McGraw
Hill India, 2017; and Kai Hwang.
which are up-to-date sources of detailed information. A detailed Computer Architecture and Parallel
analysis of design and performance issues of interconects have been Processing. McGraw Hill Education, 2017
discussed in several books 22 . 21
David B. Kirk and Wen mei W. Hwu.
Programming Massively Parallel Proces-
Large scale computing systems have many components. They sors: A Hands-on Approach. Morgan
add up to large power consumption. That is a major concern in Kaufmann, 2010; and Nicholas Wilt. The
high-performance computing. A large number of components also CUDA Handbook: A Comprehensive Guide
to GPU Programming. Addison-Wesley
FT
translates into a large chance of failure: even one component failing
could abort a long-running program if the failure is not handled. Sig-
nificant effort is devoted to designing low-power and fault-tolerant
Professional, 2013
22
F. Thomson Leighton. Introduction
to parallel algorithms and architectures:
Arrays Trees Hypercubes. Morgan Kauf-
architecture. Programs designed to take advantage of these features mann, 1992; José Duato, Sudhakar
can reduce power consumption and can respond to certain failures. Yalamanchili, and Lionel Ni. Interconnec-
tion Networks: An Engineering Approach.
These topics are out of the scope of this book, but several overviews Morgan Kaufmann, 2003; and Sudhakar
of recent techniques have been published 23 . Please refer to these to Yalamanchili. Interconnection Networks,
learn about such topics. pages 964–975. Springer US, Boston, MA,
2011. ISBN 978-0-387-09766-4
RA
23
Sparsh Mittal. A survey of archi-
tectural techniques for near-threshold
Exercise computing. ACM Journal on Emerging
Technologies in Computing Systems, 12
1.1. What is the NUMA memory configuration? (4), 2015; Sangyeun Cho and Rami
Melhem. On the interplay of paral-
1.2. What is false-sharing? lelization, program performance, and
energy consumption. Parallel and Dis-
tributed Systems, IEEE Transactions on,
1.3. What are the reasons multiple levels of cache may be employed?
21:342–353, 04 2010; Kenneth Ob́rien,
Ilia Pietri, Ravi Thouti Reddy, Alexey L
1.4. Does memory in a UMA configuration with four attached cores Lastovetsky, and Rizos Sakellariou. A
require four ports for the four cores to attach to? If a single port survey of power and energy predictive
models in hpc systems and applica-
exists, how can four the cores connect to the same port?
D
6
pipelines stages: stage1 to stage7 . Each stage is able to complete its
operation in a single clock-cycle. What is the instruction latency?
Suppose a new instruction may starts only two clock-cycles after
the previous one does. What is the maximum execution through-
1.
put?
1.11. All SIMD cores perform the same operation in any clock cycle.
FT
However, branches can complicate this. For example, consider a
group of 32 SIMD cores executing the following program. (id is
the core number in the range 0..31).
6 } else {
7 a[id] = a[id] - afactor * aincr;
8 }
All cores can execute the test on line 3 and then branch to their
corresponding lines (4 or 7), depending on the result of the test.
However, if some cores take the branch to line 4 and others to line
7, they have different instructions to execute next. The groups
execute them taking turns. In each turn, the non-executing subset
remains idle. In the example above, the first group executes line
D
6
1.13. Assume a single level cache with a cache-line of 16 integers.
What is the total number of memory operations performed in the
following code? What percent of those operations are cache-hits?
Assume the cache holds 1024 lines, and there is only one processor.
1.
Assume direct-mapping of addresses, such that the integer at
index i always maps to the cache location i%160, given that the
cache can hold up to 160 integers. (This would be in the cache-line
number (i%60)/16.)
What are the reasons the progress of these SIMD-units can get
arbitrarily out of pace with each other?
6
items are requested by two cores in the same cycle, these requests
are in conflict, and they are issued serially. Assume that data[i]
resides in bank i%32, Also sssume that i, j, and tmp are local to
each core and reside, respectively, in three of the registers of that
1.
core.
int data[][33]
for(int i=0; i<32; i++)
for(int j=i+1; j<32; j++)
int tmp = data[i][id];
data[i][id] = data[id][i];
data[id][i] = tmp;
}
FT
Prove that the code above causes no bank conflict.
1.20. Consider the addressing scheme shown in Figure 1.18 but for a
tree network with degree 32. Devise the routing algorithm. Find
the maximum number of links traversed by any packet.
1.21. For the tree network in Exercise 1.20 find the maximum net-
work latency observed if each device sends one packet to one other
device. Assume that a packet takes one time-step to traverse each
D
link. Two packets may not traverse the same link at any time-step.
In addition, a node may only perform a single operation on a sin-
gle link at one time-step. Thus, it may accept one packet at any
time-step on one of its links, or send send one out on one link.
1.23. Show that the Butterfly network shown in Figure 1.16 is equiva-
lent to an Omega network.
6
computing systems. You may choose appropriate r, l, and k.
1.
FT
RA
D
2 Parallel Programming Models
6
Question: How are execution engines
You have the hardware and understand its architecture. You have a and data organized into a parallel
large problem to solve. You suspect that a parallel program may be program?
1.
Question: What are some common
an understanding of the software infrastructure is required. In this types of parallel programs?
chapter, we will discuss general organization of parallel programs, i.
e., typical software architecture. Chapter 5 elaborates this further and
discusses how to design solutions to different types of problems.
As we have noted, truly sequential processors hardly exist, but
they execute sequential programs fully well. Some parts of the se-
quential program may even be executed in parallel, either directly by
FT
the hardware’s design, or with the help of a parallelizing compiler.
On the other hand, we are likely to achieve severely sub-par perfor-
mance by relying solely on the hardware and the compiler. With
only a little more thought, it is often possible to simply organize a
sequential program into multiple components and turn it into a truly
parallel program.
This chapter introduces parallel programming models. Parallel
programming models characterize the anatomy or structure of par-
RA
6
1.
FT
Figure 2.1: Distributed-memory pro-
gramming model
The distributed-memory programming model is demonstrated in
Figure 2.1. Each execution part – let us call it a fragment1 – is able 1
Defined : A fragment is sequence of
to address one or more memory areas. However, addresses accessed executed instructions
6
Fragment 0 Fragment 1
x = 5; send(0, 10);
receive(1, y); receive(0, x);
send(1, x+y);
1.
The first argument to the receive and send functions is the name of
the fragment to which the second argument is communicated. Both
fragments have variables called x and y, but they mean different data
and are not shared. Note that Fragment 0, on its second line, is ready
to receive in its variable y, some data from Fragment 1. At the same
time, Fragment 1 on line 1 sends the value 10 to Fragment 0. Both
must execute complementary instructions. Managing such hand-
FT
shakes is an important part of distributed-memory programming.
Later in their codes, Fragment 0 sends back the sum of its variables x
and y (i. e., 15) to Fragment 1, which it receives in its variable y.
We will study enhancements to this model where the synchro-
nization in handshakes is loose, or where explicit send and receive
functions are not required. We will also see examples of higher-order
communication primitives that allow more intricate data transfer
patterns involving more than two participants, e. g., scatter-gather and
RA
reduce.
Fragment 0 Fragment 1
x = 1; while(x == 0);
x = 2;
6
1.
Figure 2.2: Shared-memory program-
ming model
by fragments, the oft-implicit assumption breaks that a memory
FT
location remains what it was when it was last read (or written) by a
given fragment. Not accounting for this possibility can – and does –
have disastrous consequences. Consider the following listing for the
system shown in Figure 2.2
1. Location x = 1;
6
Atomic:
2. Location y = 5;
1.
but not the value 5 in y – assuming that x had a value other than 1,
and y had a value other than 5 before this fragment updated those
variables. This holds as long as the other threads also use atomic
operations to access x and y.
In general, synchrony has to do with time or time-step. Recall
that this time-step is related to clock ticks. However, there may not
be a universal clock in a parallel system. Indeed, unsynchronized
FT
clocks ticking at different rates is the norm. Nevertheless, a fragment
can observe the impact of other fragments’ activities. For example, a
recipient observes the sender’s activity. Similarly, a reader observes
a writer’s activity. Of course, each fragment directly observes its
own activities. Synchronization then can be defined by ordering of
such observations by any fragment with respect to its own steps. The
two accesses in the example above being atomic, other fragments
that access x and y atomically must always observe both updates
RA
6
definition, but it is useful. A parallel program consists of many such
tasks. In the extreme case, a task may consist of a single instruction’s
execution, but given that hardware comprises sequential execution
1.
engines, it is common for parallel programs to consist of longer
tasks. The number of steps in a task relative to those in the complete
parallel program is called its granularity. Coarse-grained tasks are
relatively longer; fine-grained tasks are shorter. We will discuss their
trade-offs in more detail in Chapter 5.
We have informally used the term ‘executing fragment’ in the
previous sections. In general, these fragments could be parallel
constructs in those cases, but an executing task is always sequential
FT
by our definition. The relationships among the tasks are encoded in
directed edges of the task graph. These edges are of two types:
6
bi-directional. Some tasks may share memory with each other, while
other tasks share data only through communication edges.
Task graph programming is based on, e.g., primitives to create,
1.
start, terminate, suspend, merge, or continue tasks. We defer detailed
discussion about how tasks start, where they execute, and how edges
are managed to chapter 6, where we will discuss practical tools for
task graph programming. In particular, we will discuss higher-level
primitives that create and manage multiple tasks in one shot, e.g.,
fork-join and task-arrays. We will also discuss in chapter 5 how to
decompose a problem into tasks in the first place.
2.4
FT
Variants of Task Parallelism
RA
6
1.
Figure 2.5: Task Pipeline
Actors are another abstraction over the task graphs. Unlike stream
processing, which focuses on data, actors stand for independent and
arbitrary computational chunks. In both, however, the computation
is local to an actor or stream processing unit — and it only maintains
its local state. Of course, information about their states can be passed
to each other as data.
parallel programming models 59
6
behavior, which operates on each message. Actions can be:
1.
• send a finite number of messages to other actors known to this
actor
2.5 Summary
6
grams collaboratively determine which tasks execute when. Pro-
grams accept input from others and produce output to others.
Such interactions can be encoded in a task graph program, which
a task graph processor executes.
1.
• Pipelined operation – limits may be imposed on the structure of
the task graph. For example, tasks may be processed in a strict
pipeline, with a fixed role for each task. Alternatively, a set of
tasks may produce ‘data,’ while another set accepts the data and
performs some operation on it. This amounts to a pipeline of
groups. Such organization often requires a work-queue, to which
FT
task-generators add data, and from which task-executors remove
items.
teraction 3 can help improve the insight into many correctness issues. 3
C. A. R. Hoare. Communicating
sequential processes. Communications
Shared memory style programming on distributed memory hardware
of the ACM, 21(8):666–677, 1978; and
requires a careful design of the memory consistency guarantees and Carl Hewitt, Peter Bishop, and Richard
synchronization primitives. We will discuss these issues in Chapter Steiger. A universal modular actor
formalism for artificial intelligence.
5. There are many examples of shared memory style programming In Proceedings of the 3rd International
using distributed memory hardware.4 Task graphs are a powerful Joint Conference on Artificial Intelligence,
way to model parallelism, but expressing explicit graphs in programs IJCAI’73, page 235–245, San Francisco,
CA, USA, 1973. Morgan Kaufmann
can be expensive. Task graphs have historically been used for perfor- Publishers Inc
mance analysis and scheduling.5 They are often used as an internal 4
Philippe Charles, Christian Grothoff,
Vijay Saraswat, Christopher Don-
representation of middleware.6 ,7 Programming APIs8 for applications awa, Allan Kielstra, Kemal Ebcioglu,
D
to specify explicit task graphs are also emerging Christoph von Praun, and Vivek Sarkar.
X10: An object-oriented approach to
non-uniform cluster computing. In
Exercise Proceedings of the 20th Annual ACM
SIGPLAN Conference on Object-Oriented
Programming, Systems, Languages, and
2.1. What is shared-memory programming? Applications, OOPSLA ’05, page 519–538,
New York, NY, USA, 2005. Associa-
2.2. What is distributed-memory? tion for Computing Machinery; Tarek
El-Ghazawi and Lauren Smith. Upc:
Unified parallel c. In Proceedings of the
2006 ACM/IEEE Conference on Supercom-
puting, SC ’06, page 27–es, New York,
NY, USA, 2006. Association for Com-
puting Machinery. ISBN 0769527000;
J. Nieplocha, R. J. Harrison, and R. J.
Littlefield. Global arrays: a portable
"shared-memory" programming model
for distributed memory computers. In
parallel programming models 61
Fragment 0
Fragment 1
x = 0;
x = 1;
6
y = 5;
if(x == 0)
if(x == 1)
y = 10;
y += 20;
output x + y;
output x+y;
1.
2.5. What is output by the fragments in Exercise 2.4 if they use
distributed memory (and in what order)?
2.6. Shared memory programs can suffer from subtle memory con-
sistency issues. Exercise 2.4 is an example. One could, however,
convert a shared-memory program to distributed-memory by
FT
ensuring that the memory is partitioned, allocating one partition
to each executing fragment. Describe what other changes may
be required in the program? What are the shortcomings of this
approach?
Fragment 0 Fragment 1
while(x == 0); while(x == 1);
y += 5; y += 50;
x = 1; x = 0;
Fragment 0 Fragment 1
send(1, x); send(0, x);
receive(1, y); receive(0, y);
output x+y; output x+y;
62 an introduction to parallel programming
6
registers to store operands for later use.) Propose a technique to
perform the send-receive hand-shake to enable such out-of-order
instruction execution.
1.
2.10. The tasks on the longest chain of dependencies in a task graph
are said to form its critical path. Tasks on this chain must proceed
one after the other. The maximum concurrency of a task graph is
the maximum number of tasks that can execute in parallel with
each other. For the tasks in Figure 2.6, compute their critical paths,
and the maximum concurrency. You may assume that the inter-
task communication occurs only at the task start and end times.
FT
Figure 2.6: Example Task Graphs
RA
2.12. Provide the task graph for the reduction algorithm in Section
3.2, assumiong each node is a task.
atomic{
- lines of code -
}
parallel programming models 63
6
1.
FT
RA
D
D
RA
FT
1.
6
3 Parallel Performance Analysis
6
Programs need to be correct. Programs also need to be fast. In order
to write efficient programs, one surely must know how to evaluate
efficiency. One might take recourse to our prior understanding of
1.
efficiency in the sequential context and compare observed parallel
performance to observed sequential performance. Or, we can de-
fine parallel efficiency independent of sequential performance. We
may yet draw inspiration from the way efficiency is evaluated in a
sequential context. Into that scheme, we would need to incorporate
the impact of an increasing number of processors deployed to solve
the given problem. Question: How do you reason about
FT
Efficiency has two metrics. The first is in an abstract setting, e.
g., asymptotic analysis1 of the underlying algorithm. The second is
how long an algorithm or program
takes?
1
The notion of asymptotic complexity is
concrete – how well does the algorithm’s implementation behave in
not described here. Readers not aware
practice on the available hardware and on data sizes of interest. Both of this tool should refer to a book.
are important. Thomas H. Cormen, Charles E.
Leiserson, Ronald L. Rivest, and
There is no substitute for measuring the performance of the real Clifford Stein. Introduction to Algorithms.
implementation on real data. On the other hand, developing and MIT Press, 1990
testing iteratively on large parallel systems is prohibitively expensive.
RA
6
cache sizes, etc., but taking a cue from the sequential analysis style,
we will use a simplified model of a parallel system.
1.
3.1 Simple Parallel Model
sized local memory locations, which are not accessible to other and programs for one specific machine.
processors. They must be flexible and support
variable p. See Section 3.5 for a more
3. Each processor can read from or write to any local memory loca- detailed explanation.
This model is simple and more useful than it may first seem. Its
major shortcoming is that the time taken by the network in message
transmission is not modeled. The cost of synchronization is also
ignored. Instead, it assumes that if a message addressed to proces-
sor i is sent by some other processor, it arrives instantaneously and
processor i spends 1 time-unit reading it. In effect, processor i may
parallel performance analysis 67
receive a message at any time, and only the unit time spent in re-
ceiving is counted. This model works reasonably well in practice for
programs based on the distributed-memory model. A more precise
model accounts for the message transmission delay as well as the
synchronization overhead.
6
those two shortcomings. At the same time, it avoids modeling syn- parallel computation. Communications of
the ACM, 33(8):103–111, August 1990
chronizations in too great a detail. The BSP model limits synchroniza-
tion to defined points after every few local steps. Thus recognizing
1.
that synchronization is an occasional requirement, it groups instruc-
tions into super-steps. A super-step consists of any number of local
arithmetic or memory steps, followed by one synchronization step.
Just as in the simple model, an arbitrary number of processors is
available per super-step. We continue to denote their count by p.
Each processor has access to an arbitrary number of local memory
locations.
FT
1. Super-steps proceed in synchrony: all processors complete step s
before any starts step s + 1.
6
1.
Figure 3.1: BSP computation model
only of p.
Thus the super-step time is Ls + t hs + Ss . The total execution time =
 ( L s + t h s + Ss ).
8 parallel super-step s
6
total number of messages. We have also seen in chapter 1 that not
all pairs have equal latency or throughput. BSP model also ignores
that messages may overlap local computation. A cleverly written
1.
program attempts to hide communication latency by performing
other computation concurrently with the communication.
Assuming complete concurrency between computation and com-
munication, we can account for the overlap by replacing Ls + t hs with
max( Ls , t hs ). This would not impact asymptotic analysis as the big-O
complexity remains the same. It is desirable for a computational
model to abstract away many complexities – particularly ones that
vary from system to system. The role of the model it to help with a
FT
gross analysis of the parallel algorithm. This algorithm may then be
suitably adapted to the actual hardware architecture, at which point
some of the abstracted details can be reconsidered.
BSP Example
Let us consider an illustrative example of performance analysis using
the BSP model. Take the problem of computing the dot product of
RA
two vectors.
Assume that the n elements of vectors A and B are initially equally
divided among p processors. The vector segments are in arrays
referred to locally as l A and lB in all processors. The number of
elements in each local array = np . Assume n is divisible by p and
consider the following code:
Output:
n 1
A·B = Â A [i ] ⇥ B [i ]
D
i =0
Solution:
forall5 processor i < p {// in parallel 5
forall means that all indicated pro-
{ // Super-step local computation: cessors perform the loop in parallel.
int lc = 0; // Local at each processor
The range of forall index variable (i
here), along with an optional condition
int lC[p]; // Only needed at processor 0. Used for Receipt.
indicates how many processors are
used. The use of the index variable i in
the enclosed body indicates what each
processor does. We sometimes omit the
keyword processor to emphasize the
data-parallelism.
70 an introduction to parallel programming
6
{ // Super-step local computation
for(int idx=1; idx<p; idx++)
lc += lC[idx];
output lc;
1.
} {// The barrier is implicit.
}
}
}
while(p > 0) { // Super-step loop
// Data sent in the previous step have now been received into lc2
forall processor i < p { // Only those that remain active
{ // Super-step local computation
lc += lc2; // Accumulate the received value
if(i >= p/2)
send lc to processor i - p/2 // 2nd half sends to 1st half
parallel performance analysis 71
6
}
1.
FT
Figure 3.2: Binary tree like Computa-
Now, there are more super-steps. The super-step loop has log p tion tree
this variant of reduction, the number of processors employed in combined to produce a single scalar
value. This is called reduction.
each super-step halves from that at the previous step, until it goes
down to 1 in the final step. In this example, each active processor
sends a single message in each iteration. Thus the total time is again
Q( np + p + log p) = Q( np + p):
p
• The first super-step takes k1 np local time, k2 2 communication time
and k3 p synchronization time.
programming model. Like the BSP model, the PRAM model also
assumes an arbitrary number of processors, p, each with an arbitrary
number of constant-sized local memory locations. Further:
6
Thus, there is a barrier after each local step. While unrealistic in
comparison to BSP, this leads to simpler analysis.
1.
chronous sub-steps, each taking a constant time:
The processors that are active at any step depends on the algo-
rithm. Not all active processors are required to perform each sub-
step. Some processors may remain idle in some sub-step.
The imposition of lock-step progress eliminates the need for
explicit synchronization by the program, but it may yet result in
conflicting writes by two processors to the same memory location
RA
6
• Priority-CRCW: If wi = w j , the smaller of i and j succeeds. If
more than two processors conflict, the smallest indexed processor
among all conflicting processors has the priority and its value is
1.
written.
FT
Figure 3.3: PRAM computation model
RA
6
difference is in their execution times and the simplicity of designing Torben Hagerup, and Tomasz Radzik.
New simulations between crcw prams.
algorithms. Priority-CRCW is the most useful since any algorithm In J. Csirik, J. Demetrovics, and F. Géc-
of other models can be executed in this model as is without any seg, editors, Fundamentals of Compu-
tation Theory, pages 95–104, Berlin,
1.
translation. We could choose this model for our design. However, Heidelberg, 1989. Springer Berlin
in practice, this model is the furthest from practical hardware, and Heidelberg. ISBN 978-3-540-48180-5
hides more cost than the others. Detecting and prioritizing conflicts 11
Joseph Jájá. Introduction to Parallel
Algorithms. Pearson, 1992
of an arbitrary number of processors in constant time is not feasible.
Comparatively, Common-CRCW and Arbitrary-CRCW are safer
models to design algorithms with, being more representative of the
hardware. However, the cost of supporting conflicting reads and
writes can be non-trivial in a distributed-memory setting, where the
FT
EREW model may be more effective.
Regardless, all models assume perfect synchrony, which is hard to
achieve in hardware in constant time for a large number of proces-
sors. This means that that communication and synchronization costs
are not accounted for in PRAM analysis.
Each step of PRAM takes a constant time-unit. The total time taken is
then proportional to the number of PRAM steps.
There is a local step in PRAM, quite like BSP does. The commu-
nication step maps to reads and writes. Processors ‘send’ in the
PRAM model by writing to a shared location. ‘Recipients’ read from
there. In a sense, the (read, local step, write) triplet is analogous to
BSP super-step, except each of the three sub-steps is synchronous in
PRAM, whereas only the full super-step is synchronous in BSP. Also,
the cost of synchronization is hidden in PRAM, while BSP accounts
for synchronization and also allows arbitrary but local super-steps.
D
6
PRAM Example
Input: Array A and B with n integers each in shared memory.
1.
Output:
n 1
A·B = Â A [i ] ⇥ B [i ]
i =0
Solution:
int C[p]; // C is a shared int array of size p
forall processor i < p {
}
C[i] = 0;
FT
for(int idx=0; idx<n/p; idx++)
C[i] += A[i*n/p+idx] * B[i*n/p+idx];
forall processor i == 0 {
for(int idx=1; idx<p; idx++)
C[0] += C[idx];
output C[0];
}
RA
forall processor i == 0
6
output C[0];
The first loop is unchanged from the previous version and takes
time Q( np ). The second loop takes Q(1) time per iteration and log p
1.
iterations, taking total time Q(log p). The last step takes Q(1) time
by processor 0. Notice that the total time based on this analysis, i.
e., Q( np + log p), is different from the time taken by the analogous
algorithm in the BSP model. This is because the extra messages
passed in the reduction variant are exposed and counted in the BSP
model. This count remains hidden in the PRAM model because more
processors are able to perform more shared-memory accesses in
FT
parallel in the same time-step. In this aspect, PRAM is like the simple
parallel model. In the case of shared-memory hardware, this unit
time-step for shared-memory read is a reasonable assumption. Note
that we sometimes allow p to be a suitable function of n for unified
analysis. For example, if p = Q(n) in the example above, the time
complexity is Q(log n).
For distributed-memory setting, PRAM is simpler, but BSP may be
better suited. Particularly so for algorithms that are communication-
RA
6
times, and for algorithms, we talk of the number of notional steps as
described above.
1.
Latency and Throughput
The time taken to complete one program, call it job execution, since
the time it began is also called the elapsed time or job latency. Often,
many jobs are executed on a parallel system. They may be processed
one at a time from a queue, or several could execute concurrently
on a large parallel system. These could be unrelated programs,
related programs, or different executions of the same program. In
FT
all cases, the number of jobs retired per unit time is known as the
job throughput. Job throughput is related to average job latency. If
jobs take less time on average, more jobs are processed per unit time.
However, the latency of different jobs may vary wildly from job to
job, without impacting the throughput. The worst-case latency, i.e.,
the longest latency of any job, is an important metric.
Speed-up
RA
t1 ( n1 , p1 )
S= (3.1)
t(n, p)
Like before, n is the size of the input and p is the number of pro-
cessors deployed by an algorithm. So are n1 and p1 , respectively.
Although not explicit in the notation, S is clearly a function of P , P1 ,
D
t(n, 1)
S par = (3.2)
t(n, p)
78 an introduction to parallel programming
6
P using p processors with respect to it using p1 processors, p1 < p
for the same input size should be greater than 1. (In reality, however,
early learners often find this hard to achieve at first. It does get better
1.
in due course.)
Cost
Speed-up can increase with increasing p. On the other hand, de-
ploying more processors is costly. We define the cost C of a parallel
program as the product of its time and the processor count:
FTC = t(n, p) ⇥ p (3.4)
Efficiency
Another way to express the ‘quality’ of speed-up is efficiency. Ex-
pected speed-up over a sequential program is higher for a higher
value of p. The quality of this speed-up, or the speed-up efficiency E ,
is the maximum speed-up per deployed processor:
D
Smax
E= (3.5)
p
6
operations can depend heavily on this latency. Consequently, even
small improvements in memory access latency can improve the
program’s performance. There can also be other scenarios, e.g., a
1.
parallel “multi-pronged" search may serendipitously converge to a
solution quicker. The tools we develop next are designed in a more
idealized setting and ignore these real effects. Regardless, they are
meaningful and may generally be used even in the presence of these
effects.
Scalability
FT
Scalability is related to efficiency and measures the ability to increase
the speed-up linearly with p. In particular, if the efficiency of pro-
gram P remains 1 with increasing processor count p, we say it scales
perfectly with the size of the computing system. Most problems
cannot be solved this efficiently, and those that can are often said
to be embarrassingly parallel. Indeed, the program may begin to
slow-down for larger values of p, as shown in Figure 3.4 for p = 17
and n = 104 . This can happen due to several reasons. For example,
RA
6
1.
Figure 3.4: Efficiency curve: speed-up
vs. processor count
that the efficiency E does not reduce with increasing p – it remains
constant. This means the efficiency curve remains linear, even if
its slope may be somewhat less than 1. We refine this quantitative
measure of scalability next.
Iso-efficiency
FT
The Iso-efficiency of a scalable program indicates how (and if) the
problem size must grow to maintain efficiency on increasingly larger
computing systems. Iso-efficiency is, in reality, a restating of the
sequential execution time as a function of p, the processor count.
Recall from Eq 3.3 and 3.5:
RA
and
I( p) = t1 (n, 1) = t(n, p) p ō (n, p) (3.8)
6
1 E (n, p)
1.
remains constant. In other words, if the overhead grows rapidly with
increasing p, the problem size also must grow as rapidly to maintain
the same efficiency. That indicates poor iso-efficiency.
For illustration, consider the BSP example of parallel reduction
in Section 3.2: t(n, p) = Q( np + p). We know the optimal sequential
algorithm is linear in n: t1 (n, 1) = Q(n). This means:
ō (n, p) = W( p2 )
FT
) I( p) = KW( p2 )
This means that the problem size must grow at least quadratically
with increasing p to maintain constant efficiency. Check that in the
PRAM model, I is bounded sub-quadratically (see Exercise 3.11) in
p.
Note that by Equations 3.6 and 3.7, for embarrassingly parallel
RA
The final metric we will study is called parallel work. This is the total
sum of work done by processors actually employed at different steps
of an algorithm. Recall, the cost is the time taken by an algorithm
multiplied by the maximum number of processors available for use
at any step. Work is a more thorough accounting of the processors
actually used. In other words, parallel work required for input of size
n,
6
t(n,p)
W (n) = Â p s ( n ), (3.10)
s =1
where ps (n) processors are active at step s. Note that we allow the
1.
number of active processors to be a function of input size n. Each
processor takes unit time per step, and the algorithm takes t(n, p)
steps. Note also that in t(n, p), p varies at each step. We leave this
intricacy out of the notation for p. The value of p at each step is
specified for algorithms, however.
As an example, the initial number of processors assumed in the
binary tree reduction algorithm is n2 . The algorithm requires log n
FT
steps, but the number of active processors halves at each step. For
instance, in the first step of the PRAM algorithm n2 processors each
perform unit work (a single addition in this example). n4 processors
are used in the second step and so on. Thus the total work, W(n) is:
log n 1
 2s = n
s =0
RA
Let us assume the PRAM model to take a specific example, but other
models are equally compliant. Step s of the original algorithm takes
Q(1) using ps processors.
l m In its execution, step s is scheduled on Pr
ps
processors taking Pr steps. The total number of steps are:
6
s =1 s =1 s =1 s =1
(3.11)
The work and time both impact the actual performance. For many
algorithms t(n, p) = O(W (n)), and hence the work is the main
1.
determinant of the execution time. Another useful way to think about
W (n)
this is that with Pr processors, the algorithm takes time O( Pr ), for
Pr W (n)/t(n).
We can also now define the notion of work optimality. A parallel
algorithm is called work-optimal, if W (n) = O(t1 (n, 1)). Further, a
work-optimal algorithm for which t(n, p) is a lower bound on the
running time and cannot be further reduced is called work-time
optimal.
FT
3.6 Amdahl’s Law
Question: Is this the best performance
There are certain limits to the speed-up and scalability of algorithms. achievable?
Sometimes the problem itself is limited by its definition. Such limits
may exist, e.g., because there may be dependencies that reduce or
preclude concurrency. Recall that concurrency is a prerequisite for
RA
may only begin after a certain minimum number of large boxes are
loaded.
Here is a more ‘computational’ example, called the prefix sum
problem.
Solution:
B[0] = A[0];
for(int i=1; i<n; i++)
B[i] = A[i] + B[i-1];
6
This solution has each iteration i depend on the value of B[i-
1] computed in the previous iteration. Thus, different entries of B
1.
cannot be filled in parallel; rather, the entire loop is sequential. We
will later see that this is a shortcoming of the chosen algorithm and
not a limitation of the problem itself. There do exist parallel solutions
to this problem.
Amdahl’s law12 is an idealization of such sequential constraints. 12
Gene M. Amdahl. Validity of the
Suppose fraction f of a program is sequential. That may be because single processor approach to achieving
large scale computing capabilities.
of inherent limits to parallelization or because that fraction was In Proceedings of the April 18-20, 1967,
simply not parallelized. The fraction is in terms of the problem size (i. Spring Joint Computer Conference, AFIPS
FT
e., the fraction of time taken by the sequential program). This implies
that fraction f would take time at least t1 (n, 1) f . Assuming that the
’67 (Spring), page 483–485, New York,
NY, USA, 1967. Association for Comput-
ing Machinery. ISBN 9781450378956
rest is perfectly parallelizable, it can be speeded up by factor up to
p. This means that time t(n, p) taken by a parallel program can be
t (n,1)
no lower than t1 (n, 1) f + 1 p (1 f ). This implies a maximum
speed-up of:
t1 (n, 1) 1
Smax = = (3.12)
RA
t1 (n,1) 1 f
t1 (n, 1) f + p (1 f) f+ p
lel speed-up could never be more than 10. It would seem that there
is little benefit of using, say, more than a hundred processors, which
yield a speed-up greater than 9. This is rarely true in practice. First,
the formula assumes an efficiency of 1. If the efficiency is less, even
the speed-up of 9 likely requires many more than a hundred proces-
sors. Second, for weakly scaling solutions, larger problems could be
parallel performance analysis 85
6
1.
Figure 3.5: Maximum speed-up possible
with different processor counts (in
solved efficiently on larger machines, even if the small problem does idealized setting)
Note that the fractions f used by Amdahl and Gustafson are differ-
ent In Amdahl’s treatment, f represents the fraction of a sequential
D
program that is not parallelized, and f does not vary with p, whether
the problem size n grows or not. In Gustafson’s treatment, f accounts
for the overheads of parallel computation. This fraction relative to
the parallel execution time remains constant even as n and p change.
This effectively means that the time spent in the sequential part re-
duces in proportion to that spent in the parallel part. In Amdahl’s
86 an introduction to parallel programming
6
1.
Figure 3.6: Maximum speed-up possible
by scaling problem size with processor
In practice, it is possible that Gustafson’s f does not remain con-
FT
stant but grows more slowly than envisaged by Amdahl. This would
lead to a sub-linear growth of speed-up with increasing p, but pos-
count (in idealized setting)
up is less than the maximum possible. That can happen due to the
overheads of parallelization. In that sense, f may thus generically
represent the overhead ō.
3.9 Summary
6
programs that perform well on all n and p, or at least many n and
p. It is not practical to measure the performance on all instances.
Rather, one must argue about the performance on n and p that are
1.
anticipated.
Hence, modeling and analyzing performance are pre-requisites
for writing efficient parallel programs. This chapter discusses a few
abstract models of computation, which can be used to express and
analyze parallel algorithms. It also introduces practical metrics to
evaluate parallel programs’ design and performance in compari-
son to, say, sequential programs, and as it relates to the number of
FT
processors used. Theses lessons include:
6
PRAM, but it does not require complete lock-step progress of
processors. Instead, processors may take an arbitrary number of
local steps before synchronizing. Further, data is exchanged by
the processors explicitly – there is no shared memory. BSP counts
1.
the number of messages communicated. The lack of per-step
synchrony does not make algorithms much more complicated than
in the PRAM model, but the communication overhead is counted.
BSP does not consider the size or batching of messages.
6
to be cost-effective because the speed per processor is high.
1.
only if a small number of processors are used. As the number
of processors grows, so do the overheads of synchronizing them,
exchanging data, or simply waiting for certain action by other pro-
cessors. This overhead can be detrimental to both efficiency and
cost. More the number of processors, more such overhead. In fact,
the overhead from using too many processors can outweigh the en-
tire benefit of the extra execution engines. Scalable programs limit
FT
such overheads. As a result, they continue to get faster with more
processors. Some even continue to maintain the speed-up per
processor, i.e., they continue to remain efficient, for large values of
p.
faster with more processors. On the other hand, the parallel com-
ponents do get faster. Consequently, the sequential components
start to dominate the total execution time, limiting total speed-up.
6
scenario.
1.
overhead by observing the speed-up with an increasing number of
processors. Growth of this overhead with an increasing number of
processors while keeping the problem size constant indicates that
the overhead is significant. This suggest that attempts to reduce
overhead may be useful.
(i.e., banks) and only one word may be accessed from each module 28th Annual Symposium on Foundations
in one time-step. Limitations of perfect synchrony have also been of Computer Science, SFCS ’87, page
addressed.19 , 20 204–216, USA, 1987. IEEE Computer
Society. ISBN 0818608072
The BSP model also addresses both the synchrony and commu- 17
Alok Aggarwal, Ashok K. Chandra,
nication shortcomings of the PRAM model. The BSPRAM model21 and Marc Snir. Communication
attempts to combine the PRAM and BSP models. Others like the complexity of prams. Theoretical
Computer Science, 71(1):3 – 28, 1990. ISSN
LogP model (Culler et al., 1993) account the message cost more 0304-3975
realistically by considering detailed parameters like the communi- 18
Kurt Mehlhorn and Uzi Vishkin. Ran-
cation bandwidth and overhead and message delay. Barrier is still domized and deterministic simulations
of prams by parallel machines with
supported but not required. Others have also focussed on removing restricted granularity of parallel memo-
the synchronous barrier by supporting higher-level communication ries. Acta Inf., 21(4):339–374, November
D
delay. All these models can simulate each other and are equivalent
in that sense. That may be the reason why the simplest models like
PRAM and BSP have gained prevalence. However, the models do
differ in their performance analysis. A case can be made that a more
realistic model discourages algorithms from taking steps that are
costly on real machines by making such cost explicit in the model.
More importantly, though, it is the awareness of the differences be-
tween the model and the target hardware that drives good algorithm
design.
6
Besides designing efficient algorithms suitable for specific hard-
ware and software architecture, one must also select the number
of the processors before execution begins. Large supercomputers
1.
may be available, but they are generally partitioned among many
applications. It is important for applications not to oversubscribe to
processors. As many processors should be used as provide the best
speed-up and efficiency trade-off. Sometimes speed-up can reduce
with large p. At other times speed-up increases, but the efficiency re-
duces rapidly beyond a certain value of p. In many applications, the
size of the problem, n, can also be configured. Further, the memory
reserved for an application, m, may also be configured. Optimally
FT
choosing S , E , p, n, and m is hard. A study of time and memory con-
strained scaling22 , 23 is useful in this regard. In particular, Sun-Ni 22
John L. Gustafson, Gary R. Montry,
law24 extends Amdahl’s and Gustafson’s laws to study limits on and Robert E. Benner. Development of
parallel methods for a $1024$-processor
scaling due to memory limits. hypercube. SIAM Journal on Scientific and
Multiple studies25 , 26 , 27 have shown the utility of optimizing the Statistical Computing, 9(4):609–638, 1988
23
product of efficiency and speed-up: E S . Several of these conclude Patrick H. Worley. The effect of
time constraints on scaled speedup.
that there exists a maximum value of p beyond which the speed-up SIAM Journal on Scientific and Statistical
inevitably plateaus or decreases for a given problem. In general, seek- Computing, 11(5):838–858, 1990
RA
24
ing to obtain an efficiency of 0.5 provides a good trade-off between X.H. Sun and L.M. Ni. Scalable prob-
lems and memory-bounded speedup.
speed-up and efficiency.28 , 29 Journal of Parallel and Distributed Comput-
ing, 19(1):27 – 37, 1993. ISSN 0743-7315
25
David J. Kuck. Parallel processing
Exercise of ordinary programs. In Morris Ru-
binoff and Marshall C. Yovits, editors,
3.1. Consider the following steps in a 3-processor PRAM. Explain the Advances in Computers, volume 15, pages
119 – 179. Elsevier, 1976
effect of each instruction for each of the following models. Note 26
D. L. Eager, J. Zahorjan, and E. D.
that some instruction may be illegal under certain models; indicate Lozowska. Speedup versus efficiency in
so. All variables are in shared memory. parallel systems. IEEE Trans. Comput., 38
(3):408–423, March 1989. ISSN 0018-9340
D
P0 P1 P2
x = 5; x = 5; x = z;
y = z; y = z; y = z;
6
also valid for p-processor Priority-CRCW PRAM.
1.
3.5. Show that every BSP algorithm can be converted to a PRAM
algorithm.
6
sorting algorithm of O(log n)?
1.
3.12. The following table lists execution times of two different so-
lutions (Program 1 and Program 2) to a problem. The executions
times were recorded with varying number of processors p and
varying input size n. This table applies to many following ques-
tions.
Input size n Processor count Time t(n, p) (minutes)
(million)
FT p
1
10
Program 1 Program 2
12
3.5
12
5.28
1 50 3.2 17.0
100 3.0 26.5
500 3.1 126.6
1 22 22
10 8.6 10.5
RA
10 50 7.1 11.9
100 7.0 31.5
500 7.2 126.2
1 263 263
10 63.2 64.9
50 50 43.2 57.9
100 40.6 59.9
500 40.3 158.6
1 1021 1021
10 189 191
100 50 110.5 125
D
6
3.17. Referring to the table in Exercise 3.12, find the maximum
Speed-up S of Program 1 over Program 2 for n = 10 million.
1.
3.18. Referring to the table in Exercise 3.12, find the efficiency E of
Program 1 and Program 2 for n = 10 million and p = 100.
3.21. Discuss how well Amdahl’s law and Gustafson’s law hold for
Programs 1 and 2 for the table in Exercise 3.12. Do they accurately
estimate the bounds on the speed-up?
3.22. Refer to the table in Exercise 3.12. Using the Karp-Flatt metric,
estimate the overhead (including any sequential components) in
Program 2 for each value of p and n = 10 million. Discuss how the
RA
6
1.
Interaction between concurrently executing fragments is an essential
characteristic of parallel programs and the major source of differ-
ence between sequential programming and parallel programming.
Synchronization and communication are the two ways in which
fragments directly interact, and these are the subjects of this chapter.
We begin with a brief review of basic operating system concepts,
particularly in the context of parallel and concurrent execution. If
you already have a good knowledge of operating systems concepts,
FT
browse lightly or skip ahead. Question: Who controls the executing
fragments? How do different executing
fragments interact and impact each
other’s execution?
4.1 Threads and Processes
6
fork, a copy of the parent’s address space may be created for the for unrelated processes to also share
address space with each other. We will
child. The child then owns the copy, which is hidden from the parent. not discuss their detais.
Creating such copies is time-consuming. Hence non-copy variants are
1.
sometimes called light-weight processes.
There is thus a spectrum of relationships between processes, but
we will use this broad distinction: each process has its own address
space, threads within a process all share that address space. A process
comprises one or more threads that share that process’s address
space. We say that a process that only executes sequentially is a
single thread, and its code shares the address space with no other
thread. In this sense, a single-threaded process may be conveniently
FT
called a thread. Hence, we will commonly use the term thread when
referring to a sequential execution. In other words, an executing
fragment is a part of some thread’s execution. Two threads from
different processes do not share an address space, but through page-
mapping mechanisms, they may yet be able to share memory.
There are intricacies we will not delve into. For example, the
execution of kernel threads are scheduled directly and separately
by the operating system, whereas user threads may be scheduled
RA
6
of instructions at its own pace. Recall that in a parallel system, no
universal clock may be available. We will assume that there is a
universal time that continually increases, but threads may not have
1.
any way to know this universal time at any instant. Rather, threads
have their own local clocks, possibly ticking at a rate different from
other threads’ clocks. Even within a thread, there may be an arbitrary
lag between its two consecutive instructions (e.g., if the execution is
interrupted after the execution of the first). Thus, events occurring in
concurrent threads at independent times impact the shared state and
progress of a parallel program.
The order in which these events occur is non-deterministic.2
FT
Consequently, the behavior of the program may be non-deterministic.
The program must always produce the expected result even in the
2
Defined : Non-determinism implies
not knowing in advance. For example,
non-deterministic order of two events
means the order in which they occur
presence of such non-determinism. If this non-determinism can changes unpredictably from execution
to execution.
lead to incorrect results, we call this race-condition. A race condition
happens when the relative order of events impacts correctness.3 Here 3
Recall from Chapter 1 that events are
is a simple example of a race condition: not necessarily instantaneous and the
order may not even be well defined.
More generally, we say the relative
Listing 4.1: Race Condition timing of two events impact correctness
RA
counter$ = counter$ + 1;
6
code execution may solve the problem. For example, thread i may
be prevented from starting this sequence during the interval any
other thread j executes the three-step sequence. This eliminates the
1.
overlap and the race condition. The threads’ relative order no longer
impacts the correctness. In other words, in the middle of thread i
incrementing count$ no other thread accesses4 it. Thus, if thread j 4
Access refers to a fetch or store of
follows thread i, it necessarily sees the value stored by thread i, and value at an address
Sequential Consistency
Recall that a CPU core may execute several instructions of a code
fragment in parallel, but it ensures that they appear to execute in
RA
the sequence in which they are presented – the program order. For
example, if instructions numbered i and i + 1 in some code fragment
do not depend on each other (as inferred by the compiler/hardware
logic), instruction i + 1 may be completed before i is. On the other
hand, if there is a dependency, even if parts of their execution do
overlap, the results of the instruction are the same as they would be
if instruction i + 1 started only after instruction i completed. A way
to reconcile parallel execution with strict ordering is that execution
of instructions may well overlap with others, but each ‘takes effect’
instantaneously and these instants are in an expected order. For
D
the first instruction may take effect when the value 5 has appeared
synchronization and communication primitives 99
6
some examples.
We start by defining the notion of sequential equivalence, or
sequential consistency in the shared state. Consider the following
1.
listing, assuming the values in A$ are 0 initially, and two threads
with threadID 0 and 1, respectively, execute:
operations by every thread. The global sequence is consistent with ISSN 0018-9340
thread i only if:
1. If thread i executes operation o1 before o2 , o1 appears before o2
also in the global sequence.
This defines that a read of address x in every thread returns the value
of the most recent write to x in that global sequence. In this sequence,
we do not worry about the time when an operation takes effect
but only their order. Operations of different threads are allowed
to fully interleave. However, every thread’s view of the order in
which its operations take effect is consistent with every other thread’s
6
view. The following are examples of sequentially consistent and
inconsistent executions.
1.
(a) Sequentially consistent execution
FT
(b) Sequentially inconsistent execution
RA
6
sequentially consistent, the platform is sequentially consistent.
Figure 4.1(b) and (c) are both inconsistent executions because
no consistent global order exists. This can be seen by following the
1.
arrows in red. These arrows form a cycle, meaning there is no way to
order them in a sequence. For example, in Figure 4.1(b) 1.3 occurred
before 0.3; otherwise, it would have read the value 3 in x. Of course,
0.3 always must occur before 0.4, which occurred before 1.2 in this
execution, because 1.2 is the read-effect of 0.4. This could occur in an
execution if the update to variable y becomes quickly visible to other
threads, while updates to x travel slower.
This variance in the update speed can also occur due to caches in
FT
case of shared-memory – even if the caches are coherent. Updates
indeed reach all threads, just not in the same order. Keeping all
updates in order implies slower updates hinder the faster ones.
Moreover, in-order updates do not guarantee sequential consistency,
as Figure 4.1(c) shows. Each thread updates different variables –
x and y, respectively. So, there is no inherent order between 0.2
and 1.2. Still, there exists a cycle as red arrows show, and hence no
consistent sequential order. In this case, both updates are just slow.
RA
Thread 0 Thread 1
while(! ready$); data$ = generate();
D
6
sequence. We next discuss a few common relaxations to the require-
ment of complete sequential equivalence. The general idea is to
allow the platform to guarantee somewhat relaxed constraints, thus
1.
supporting higher performance. The programmer is then responsi-
ble for enforcing any other ordering constraints if required, using
synchronization techniques discussed later in this chapter.
Causal Consistency
The idea of causal consistency is to limit the consistent ordering
constraint only to what are called causally related operations. In
FT
particular, there is no requirement of a consistent global sequence of
all operations to exist. Rather, each thread views the write operations
of other threads in a causally consistent order, meaning two causally
related operations are viewed by every thread in the order of their
causality, which is defined as follows:
1. All writes of one thread are causally related after that thread’s
earlier reads and writes
RA
6
not be consistently ordered. The example in figure 4.1(b) remains
FIFO inconsistent, but the example in Figure 4.2(a) exhibits FIFO
consistency, even though it violates causal consistency, and hence
1.
also sequential consistency. In this figure, the red arrows demonstrate
transitive causality, which forces a relationship between otherwise
concurrent writes 0.2 and 1.3. Note that 1.3 must occur before 0.1, as
0.1 is its read-effect, and similarly, 1.2 must occur after 0.2. We may
assume in these examples that the initial value of variables is, say, 0.
Figure 4.2(a) is a FIFO consistent execution because there is only a
single write by thread 0, and it can be viewed anywhere before 1.2 in
FT
thread 2’s view. The two writes to y in thread 1 must appear in that
same order in thread 0’s view. (1.1 ! 1.3 ! 0.1 ! 0.2) is a FIFO
consistent order.
The notion of processor consistency is a slight tightening of
constraints. In addition to a consistent ordering of all writes by a
given thread, all threads must also view all writes to the same vari-
able in the same order. FIFO and processor consistency are both
weaker than causal consistency and allow the execution in Figure
RA
before 1.3.
Figure 4.1(c) is neither FIFO consistent nor processor consistent.
Thread 1 must see all writes from thread 0 in order (0.1 ! 0.2 !
0.3 ! 0.4). However, it sees 0.4 before 1.2 but 0.3 after 1.3.
Note that a guarantee of FIFO consistency is sufficient to prove the
correctness of Listing 4.4 in its every execution.
104 an introduction to parallel programming
6
1.
(b) Processor inconsistent execution
Figure 4.2: FIFO consistent executions
Weak consistency
Finally, there is a practical notion of consistency called weak consis-
FT
tency, under which minimal guarantees are made by the program-
ming platform. The responsibility of maintaining consistency is in-
stead left to the programmer. This follows the principle ‘programmer
knows best’ and allows the system to make aggressive optimizations.
A programmer in need of enforcing order between two operations
then must employ special primitives; an example is flush. Another
possibility is to enforce sequential or some other form of consistency
only on specially designated variables or resources, called synchro-
RA
memory_fence(); memory_fence();
x = data$; ready$ = true;
Fences slow down memory operations and the code above may
be an overkill but it guarantees correctness even if caches are not
coherent. A fence ensures that caches are flushed and an updated
value of ready$ is indeed fetched by thread 0. Further, the memory
fence of thread 1 ensures that the data read by thread 0 is indeed the
updated data written by thread 1.
6
Linearizability
1.
ability is stronger than sequential consistency. It guarantees not only
that all operations have a global order consistent with all threads’
execution, but also that each operation completes within a known
time interval. In particular, the operation is supposed to take global
effect at some specific instant between the invocation and comple-
tion of each operation by its thread. It thus requires the notion of a
real-time central clock and requries arguments about an operation
FT
having completed before a certain real time t. As one consequence, if
an execution is linearizable with respect to each variable, the overall
execution also becomes linearizable. (We will not prove this state-
ment here.) Sequentially consistent sub-executions are not able to be
composed in this manner to produce a longer sequentially consistent
execution.
There is a related notion of serializability, mostly used in the
context of databases. It’s an ordering constraint on transactions.
RA
4.3 Synchronization
D
about the order of events. A read-effect is so, whether the write com-
pleted immediately before the read or somewhat earlier. Hence, we
focus on enforcing consistency using synchronization to impose order
between selected events of two or more threads. We use two types of
synchronization: exclusion and inclusion. Exclusion synchronization
precludes mutual execution of two or more threads – rather two or
more specific events within those threads. (An event is a sequence
of execution steps.) Inclusion synchronization, on the other hand,
ensures co-occurrence of events. We will now examine a few impor-
6
tant synchronization concepts and tools. Later, we will see some
examples.
1.
Synchronization Condition
With some support from the hardware, operating systems provide
several basic synchronization primitives. However, their context is
only the system controlled by the operating system. If multiple op-
erating systems are involved, additional primitives must be built,
possibly using these basic primitives. In any such primitive, once the
execution of a thread encounters a synchronization event, it requires
FT
certain conditions to be satisfied before it may proceed further. Other
fragment executions may impact those conditions. There are two
types of conditions: shared conditions and exclusive conditions. Mul-
tiple threads waiting for a shared condition may all see it when the
condition becomes satisfied, and therefore all continue their execu-
tion. Only one of the waiting threads may continue if the condition
is exclusive. There are also hybrids, which allow a fixed number of
waiting threads to continue. Usually, the choice of continuing threads
RA
Protocol Control
There is usually a coordination protocol involving multiple syn-
chronization events a thread must follow before it can complete the
synchronized activity. There are two classes of protocols: centralized
protocol and distributed protocol. In centralized protocols, there is a
coordinating entity, e.g., another thread, an operating system, or
some piece of hardware. This centralized controller flags a thread
D
ahead or stops it, not unlike what a traffic signal does. In parallel
computation involving a large number of synchronizing threads,
such a centralized controller is often a source of bottleneck. Failure
of the coordinator also can be disastrous for the entire program. In
distributed protocols, there is no centralized controller. Rather the
threads themselves follow a set of steps synchronizing each other.
synchronization and communication primitives 107
This may involve the use of multiple passive shared resources, e.g.,
memory locations.
As a simple example, concurrent operations on a queue (e.g.,
insertion or removal) by multiple threads may be activities. Checking
if a queue is full is a synchronization event. Checking if there are
ongoing removals could be another event. A protocol is the set of
events designed to ensure that multiple threads may safely add and
remove elements without being misled by any transient variables (set
by another thread).
6
Progress
1.
Synchronization event is nothing but a sequence of instructions
executed by a thread, often via a function call. There are two parts
to this call: checking if the condition is satisfied and then waiting
or (eventually) proceeding past the event, depending on the result.
Atomicity is required for exclusive conditions because two different
threads must not view, and both proceed on the same condition. This
requires some coordination among competing threads, and even
FT
the test for the condition may itself be impacted by the state of a
different thread. Still, it is possible to implement synchronization in a
way that allows the test to safely complete independent of action by
other threads. Of course, the synchronization protocol still applies,
and actions to be taken when the condition is satisfied may still be
taken only if the condition is satisfied. Such methods are called non-
blocking. In particular, a non-blocking function completes in finite
time even in the presence of indefinite delays, or failure, in other
RA
threads’ execution.
This notion of non-blocking functions applies to contexts other
than synchronization as well, e.g., data communication or file IO.
Although similar, this notion is slightly different from that of non-
blocking network topology discussed in chapter 1. There, messages
between one pair of nodes could progress without being blocked by
messages between a different pair, i.e., one message was not blocked
by another as long as the communicators were separate.
A blocking function merely does not return until the synchro-
nization condition is satisfied. The execution proceeds to the next
instruction after this return, just as it would after every other func-
D
6
wait-free synchronization later in this chapter.
Separate from whether a function is non-blocking is the issue of
how the condition-checking and progress are managed. In busy-wait
1.
based implementation, the fragment repetitively checks until the
condition turns favorable. A busy-wait loop can result in blocking
if the condition checking depends on action by other threads. The
other alternative is signal-wait. The calling thread is suspended until
the conditions become favorable again, after which an external entity
like the operating system wakes the thread and makes it eligible
for execution. The signal-wait mechanism is blocking by definition,
as the thread can make no progress in the absence of action by the
external entity.
FT
When there are many more threads than the number of cores
available to execute them, the busy-wait strategy can waste com-
puting cycles in repetitively testing and failing – particularly if the
synchronization event involves a large number of threads or long syn-
chronized activities. On the other hand, the latency of such tests is
usually much lower than that of signal-wait. In any case, synchroniza-
tion overhead is not trivial. This overhead includes the time spent
RA
Synchronization Hazards
some other condition. Effectively, they all indefinitely wait for each
other. There is a famous abstraction called the dining philosopher’s
problem demonstrating deadlocks. A modified version goes like this.
Consider five philosophers sitting around a table with five forks
alternately laid between them. Philosophers meditate and eat alter-
nately, but they may eat only with two forks. After they eat, they
clean both forks and put them back in their original setting. Each
philosopher eats and meditates for arbitrarily long periods. Their
eat-meditate lifecycle goes on indefinitely. No more than two philoso-
6
phers may eat at the same time (maybe because the food cannot be
supplied quickly enough).
Consider the following protocol. Philosophers pick any avail-
1.
able fork on their left and then their right when hungry. If both are
picked, they eat. Once full, they put the forks down one at a time and
go back to meditating. If only one fork is available, they pick it up.
If they do not have two, they meditate some more before checking
again. They do so repeatedly until they get both forks. They then eat
before replacing the forks.
This protocol ensures that no philosopher eats with only a single
fork. It also ensures that only up to two philosophers can be eating
FT
at any given time. Note that a philosopher who is using two forks
ensures that neither neighbor may have two forks. A synchronization
protocol that guarantees the required behavior at all times, as this
eating protocol does, is called safe. What happens in this protocol,
however, if all philosophers pick up the forks to their left almost
simultaneously, and then wait for the fork to their right to become
free? Since no one got two, no one eats, and no fork is set down, no
matter how many times they check to their right. This is deadlock.
RA
lack of starvation are called fair. Notice that in the previous example,
a dining philosopher could starve in the listed protocol, for there is
no guarantee that they check during the period their neighbors left
the fork down on the table. The neighbor is allowed to pick the fork
back up.
Considering the traffic light example, the basic purpose of syn-
110 an introduction to parallel programming
6
Figure 4.3: Traffic Light Deadlock
1.
chronization is that no two vehicles may be in each other’s path
(following behind is allowed, reversing is not). A protocol like ‘go
only on green’ is safe as long as the signals are properly coordinated.
There may be a deadlock, however, if a slow vehicle that enters the
intersection on green is not able to get through before the light turns
green for the cross-traffic. See Figure 4.3 for a deadlocked configura-
FT
tion. One may modify the protocol to prevent these deadlocks. For
example, vehicles could go on green only if there is no cross-vehicle
in the intersection. Now, there would be no deadlock, but starvation
is possible. Too many slow vehicles could ensure that a waiting ve-
hicle continually sees a vehicle in the intersection while its light is
green (and it turns red again before any progress is made). Further-
more, the vehicle throughput may reduce. This situation arises in
many synchronization protocols, and simple solutions to prevent
deadlocks often risk starvation.
RA
Lock
A simple synchronization tool is lock. Each lock has a name known to
all participating threads. The simplest locks are exclusive. A thread is
allowed two main operations on a lock: acquire (also called lock) and
release (also called unlock).
6
thread waits until the current holder releases x. Acquire operation
blocks until the acquisition is successful (although non-blocking
variants exist also). If two concurrent threads attempt to acquire an
1.
available lock, only one succeeds (if the lock is exclusive).
Release(counter_lock$)
6
Peterson’s Algorithm
1.
Peterson’s algorithm guarantees mutual exclusion between two
threads. Both execute code 4.7, which may be executed any number
of times by each thread. This method employs shared variables
ready$ (an array of size 2) and defer$ to achieve exclusion. Assume
the two threads are identified by IDs 0 and 1, respectively. The value
of the ID is always found in an automatic private variable threadID.
Private variables, even if they have the same name in each thread,
are local to each thread and thus not shared. Initially, ready$[0] =
ready$[1] = false.
FT
Listing 4.7: Peterson’s algorithm for two thread mutual exclusion
1 other_id = 1 - threadID;
2 ready$[threadID] = true; // This thread wants in
3 defer$ = threadID; // This thread defers
4 while(ready$[other_id] && defer$ == threadID); // Busy-wait
5 // Critical Section goes here
RA
Safety means that if a thread exits its loop and enters the critical
section, the other thread is guaranteed to not enter until the first is
out of the critical section.
Figure 4.4 demonstrates the possible order of operations on the
shared memory. In the top row are operations by thread 0 and the
bottom row has those by thread 1. By our assumption that each
synchronization and communication primitives 113
6
occurs before i.j0 8 j0 > j. Solid arrows from i.j to i.j0 help visualize
this. (Recall that i.j is the jth operations of thread i.) Note that we
do not, in general, have any pre-determined order between i.j and
k.l if i 6= k. Accordingly in our example, an operation like 1.2 could
1.
occur between any 0.j and 0.( j + 1). These possibilities are shown in
dashed arrows and marked a f . Regardless, 1.1 must occur before
1.2, which must occur before 1.3, and so on. No two operations on
the same shared memory may occur simultaneously; otherwise, the
value read or written would be undefined.
In a given execution, if possibilities a or b materialize, thread 0
must find defer$ == 0 at 0.4. Similarly, if possibilities c or d occur,
FT
the value would be 1. Finally, if e or f materialize, thread 0 would
still find value 0. Let’s analyze each case.
hence before 1.3, thread 1 cannot get past its loop as it finds thread
0 ready and defer$ == 1.
Case e The behavior depends on 1.1. If 1.1 occurs before 0.3 (call this
case e1), neither thread 0 nor thread 1 may exit their respective
loops. Both find the other is ready and both find their own IDs
in defer$. However, this cannot last. As both execute the next
iteration of their busy-wait loops, thread 0 now finds 1 in defer$
and enters the critical section. Thread 1 does not.
If, on the other hand, 1.1 occurs before 0.3 (case e2), thread 0
D
enters the critical section finding that thread 1 is not ready. This
time thread 1 would not enter the critical section as it would find
defer$ == 1 at 1.4 and ready$[0] == 1 at 1.3 as long as thread 0
remains in the critical section.
Thus 1.1 may only occur before 0.3. This is similar to case e2.
6
is false when tested, if later thread 1 becomes ready and then sets
defer$ to 1, it is guaranteed to find defer$ to be 1 in its loop condi-
tion. Thus, thread 1 cannot enter the critical section until thread 0
stops being ready, post its execution of the critical section.
1.
The same argument holds for thread 1. Peterson’s algorithm is
also deadlock-free and starvation-free.
A deadlock could occur only if both threads are indefinitely stuck
in their busy-wait loops. This would imply that thread 0 continually
finds defer$ == 0 and thread 1 continually finds defer$ == 1. Both
cannot be true because neither thread has a chance to change defer$
while busy-waiting.
FT
No thread can starve either. Suppose without loss of generality
that thread 0 does. This would imply that thread 1 is able to re-
peatedly complete its critical section, return for the next round and
overtake the busy-waiting loop of thread 1. This means that each
time thread 0 checks, defer$ == 0 and ready$[1] is true. But only
thread 0 may ever set defer$ to 0, never thread 1. Rather, if thread
1 is able to repeatedly execute the protocol, it is obligated to set
defer$ to 1 each time. How then does then defer$ become 0 without
RA
Bakery algorithm
Bakery algorithm is based on a ticket system. When a thread be-
comes ready, it takes a ‘number,’ and await its turn. The pseudo-code
is as follows:
synchronization and communication primitives 115
6
takes a number – one more than the maximum number taken by
any thread. It then busy-waits until no ready thread has a smaller
number. Note that finding the new number itself is not protected
by a critical section or synchronization. This means two (or more)
1.
threads may obtain the same number on line 2 of code 4.8, say n.
When this occurs, ID is used to break the tie: the thread with lower
ID exits loop first. If that winning thread later wants to enter the
critical section again, the next time its number is guaranteed to be
greater than n after line 2. Thus number$[i] strictly increases every
time thread i completes line 2.
To prove safety, show that two threads may not be in a critical
FT
section at the same time. Suppose these are threads i and j with, say,
i < j. When j exited its busy-wait loop on line 3, either number$[ j] <
number$[i] or ready$[i] == false.
If ready$[i] was false, thread i set ready$[i] to true later than
thread j’s condition test. This means thread i also computed its
number after j’s test and hence number$[i] > number$[ j]. Hence, it
could not have exited its loop, as ready$[ j] remains true until after j
completes the critical section.
RA
If, instead, ready$[i] was true, number$[ j] < number$[i] at j.3 (i.
e., at line 3 for thread j), meaning i.2 occurred after j.2. Since j is in
the critical section, ready$[j] would be true at i.3 and thread i could
not exit its busy-wait loop.
Bakery algorithm is also deadlock-free and starvation-free. Dead-
locks are avoided because there exists a total order on updates to
number$ and some thread with the smallest number is always able to
get past the busy-wait loop. At the same time, an increasing number
ensures no thread is able to overtake one that got a number earlier.
Such number based design is common in many wait-free protocols.
D
The drawback of Bakery algorithm is the need for large shared ar-
rays (ready$ and number$). It turns out that there exists no algorithm
that can guarantee mutual exclusion with a smaller size using only
shared-memory reads and write operations.
116 an introduction to parallel programming
6
result of the comparison. Here is an example function for an integer
shared variable with address ref$:
1.
boolean compareAndSwap(void *ref$, int expected, int newvalue) {
Do Atomically
int oldvalue;
fetch *ref$, store the value in oldvalue;
if(oldvalue == expected) {
store newvalue into *ref$;
return true;
}
}
return false;
FT
The updates to *ref$ are seen in a consistent order by all threads
using compareAndSwap. Many hardware-supported implementa-
tions of this function exist and are lock-free. Each call or execution
returns true if the old value was as expected after writing the new
value to the shared location. If the value was not as expected, the
RA
function returns false. Other variants exist. For example, ones that
return the old value instead. (The caller may compare the old value
to the expected value to decipher what happened inside the function.)
This peculiar primitive can help implement a rich set of synchroniza-
tion functions, including n thread mutual exclusion as shown below
(assume turn$ is initially 1):
turn$ = -1;
Transactional Memory
Transactional memory is an emerging paradigm that seeks to address
6
the challenges of Compare and Swap. The main idea is to define a set
of operations on the shared state as a transaction and ensure serial-
ization of these transactions. One way to ensure such serialization is,
of course, mutual exclusion. Another is to optimistically perform un-
1.
synchronized operations on shared resources. These are performed in
a tentative sense, but the risk of races is detected by identifying other
instances of transactions. In case no race is detected, the transaction
is committed and considered complete. In case a race is detected, the
entire transaction is discarded and effectively rolled back. After the
discard, an alternate course is followed: simply retry the transaction
or apply mutual exclusion this time. Note that multiple conflicting
FT
transactions would all be discarded and retried.
Transactions are a higher-order primitive than locks and compare
and swap. Hence, they may be easier for programmers: it could be as
simple as encapsulating a sequence of operations into a transaction
that appears to execute atomically. Further, transactions can also be
nested – a transaction can consist of sub-transactions, and only the
sub-transaction is discarded if it encounters a conflict. Transactions
can also be composed. However, they are not do-all. For example,
RA
must reach the barrier event. Every member blocks at the event
until they have surety that all the other members have reached their
respective barrier events. This is quite clearly not a non-blocking nor
a lock-free operation. A stand-alone barrier for an n member barrier
group may be implemented as follows. Initially, numt$ is 0.
6
while(! compareAndSwap(&numt$, num, num+1))
num = numt$; // Re-read count and retry incrementing
while(numt$ < n); // Busy-wait
}
1.
Each thread reads the then-current value of numt$. If no other
thread has modified it in the interim, the thread writes the incre-
mented value into numt$. Otherwise, it re-reads the new value of
numt$ and retries incrementing it. It needs to retry no more than n
times before it must succeed because a successful thread does not
retry. Once the thread succeeds in registering its presence, it moves
FT
to check if all threads have registered. It busy-waits until then. This
barrier may be used only once. It is possible to modify it so multiple
barriers can re-use the same variables. numt$ would need to be reset
to 0. But also note that threads may exit their busy-wait loops as soon
as numt$ equals n, but some could be delayed. Either a thread’s next
entry into the barrier must be prevented until the last one is out, or
the entries would need to be otherwise separated. Implementation is
left as an exercise (see Exercise 4.14).
RA
Although other fancier versions exist, e.g., one returns the sum
of integer values supplied by the members, this simple version is
instructive. It is related to consensus: having all threads reach the
same value. Consensus is often used to argue about the power of
synchronization primitives12 . In the basic consensus problem, the 12
Block-chains are based on the
returned value of vote must be the same for all members of a group, consensus problem.
S. Nakamoto. Bitcoin: A peer-to-peer
electronic cash system, 2008. URL
https://bitcoin.org/bitcoin.pdf
synchronization and communication primitives 119
6
an arbitrary number of threads, meaning its consensus number is
•. A sample implementation of vote follows. Assume one_value$ is
initially 1.
1.
Listing 4.13: Consensus
bool consensus (bool value) {
compareAndSwap(one_value$, -1, value);
return one_value$;
}
124–149, 1993
only read/write) as well the message-passing distributed-memory 14
Michael J. Fischer, Nancy A. Lynch,
and Michael S. Paterson. Impossibility
model. In particular, consensus is not guaranteed if threads (and mes-
of distributed consensus with one faulty
sages) can be arbitrarily slow. For controlled environments, which process. J. ACM, 32(2):374–382, April
a parallel computer system may be, the knowledge of the bound on 1985. ISSN 0004-5411
15
Herlihy, 1993
delays is employed to achieve consensus in a practical manner, even
in the presence of failure. In this book, we will not focus on fault-
tolerant algorithms, which continue to provide synchronization and
safety in the presence of failure.
4.5
D
Communication
6
Note that scalable and consistent distributed shared memory im-
plementation is rather complex, and good performance can be hard
to achieve. For many applications, direct use of message-passing
1.
primitives is easier to design, synchronize, and reason about. Note
that there is natural coordination required for two or more threads
to communicate among themselves. Inter-thread interactions are
more direct and more explicit compared to the shared-memory
model, where a passive memory location has no ability to detect
anomalous interactions. Hence, it is important to understand the
nature of message-passing based programming. Broadly speaking,
in the message-passing model, shared states and critical sections are
FT
eschewed in favor of the synchronization implicit in communica-
tion, which has certain features of the barrier. We will see detailed
examples in Chapter 6.
As a practical matter, it is useful to realize that synchronization
across message-passing threads is likely to be slower as network
delays are generally much higher than local memory latency. We will
briefly review the communication system next. With this understand-
ing, we can devise more efficient inter-thread interaction. Common
RA
Point-to-Point Communication
There are two essential components of communication. A thread
D
Sender and recipient each must have a local buffer where that
synchronization and communication primitives 121
data must be stored. The sender sends from its *buffer, and the
recipient receives into its *buffer. This means that the recipient
must have sufficient space in its buffer to hold the entire data that
the sender sends. Possibly, this size is shared in advance. Or, the
communication could use fixed-size buffers, but that bounds the size
of each message. A sender would have to subdivide larger messages,
and that unnecessarily complicates program logic. Moreover, some
setup is required to send each message (for example, route setup or
buffer reservation on intermediate switches). Subdividing messages
6
may incur the overhead of repeated setup. The other big concern is
synchronization between the sender and the recipient. How does
Receive behave if the sender has not reached its corresponding Send
1.
and vice-versa?
To answer such questions, let us delve deeper into how commu-
nications happen on the sender and recipient nodes. There is at
least one network interface card (NIC) in a computing system. A
special NIC processor is responsible for the actual sending of data
onto an attached link or receiving data on the link. Links are pas-
sive; hence, there must be two active execution units on both ends
of a link. These execution units are built into the NIC and usually
FT
have their own buffers for temporary storage of data. This means
that an application program does not need to concern itself with the
transmission details, nor be forced to synchronize simultaneously
executing fragments on both ends of the link. For security and gen-
erality, the access to NIC operations is through the operating system,
and usually through several layers of software, which may have their
own limits on message or packet size. This, in turn, implies that the
user buffer may be subdivided into multiple packets and copied sev-
RA
eral times (from user buffer to operating system buffer to NIC buffer).
Some of this copying is done by the operating system code on behalf
of the application, and some is managed by the DMA engine (see
Section 1.2) associated with the NIC. See Figure 4.5.
6
or both buffers with operating system owned buffers, but that leads
to its own complications.
One problem with DMA-based operations is their interference
1.
with the virtual memory paging system. Page management is the
operating system’s domain and requires CPU instructions, but the
DMA engine’s job is to off-load the copying from the CPU. Hence
the operating system needs to lock or pin to real memory the pages
that are in use by DMA. Thus memory registration is a heavy-weight
operation, not to be repeated incessantly. Re-using registered mem-
ory for multiple data transfers is important. However, the size of the
actual message is known only at the Send event, and pre-registered
FT
buffers could be too small to accommodate a transfer of the required
size. Algorithms exist for dynamic re-registration and pipelined re-
use of small parcels of memory16 , but we will not discuss those in 16
Tim Woodall, Galen Shipman, George
this textbook. Bosilca, Richard Graham, and Arthur
Maccabe. High performance RDMA
RDMA or not, the separation of concerns between the application protocols in HPC. pages 76–85, 09 2006
program and the network subsystem allows the program to ‘fire
and forget,’ assuming that the entire message will be delivered ‘as is’
without any loss, corruption, or the need for further intervention or
RA
6
called asynchronous.
RPC
1.
RPC, or remote procedure call, is a type of point to point communi-
cation but not described explicitly as a Send-Receive pair. Rather, a
thread makes a function call that looks similar to a local function call,
except the call is executed on a remote system. This means that the
arguments of the function are packed into a message and sent to the
designated recipient. The ID of the recipient may be a part of the call
or pre-registered with the function’s name. On receiving the mes-
FT
sage, the recipient unpacks the arguments and calls a local function,
which in turn may make another RPC. Once the function execution
is complete, the function provider packs the value returned by the
function into another message and sends it back to the initiator of
the RPC. Synchronous RPC requires that the initiator only proceeds
beyond the call after receiving the results back. Asynchronous RPC,
not unlike asynchronous Send and Receive, allows the initiator to
continue execution beyond the RPC call without receiving the results.
RA
Collective Communication
Sometimes a more complex pattern of communication can exist
among a set of cooperating threads. Describing complex communica-
tion among a set in terms of several individual point-to-point pairs is
wasteful. Such higher-level primitives can again be built on top of the
Send-Receive primitive. These group level, or collective communication
primitives, are similar to the barrier: all threads in a group encounter
this event. However, unlike the barrier, they need not do so simul-
6
taneously. Rather, communication may be asynchronous and strict
synchronization mandated by the barrier is not necessary. Some com-
mon collective communication primitives are listed below. Chapter 6
describes specific implementations and contains some more detail.
1.
Broadcast:
Message from one sender is received by many recipients
Scatter:
n messages from one sender are distributed to n recipients, one
each
Gather:
FT
One message each from n senders are received by a single recipi-
ents
AlltoAll:
Each member of a group consisting of n + 1 threads scatters n
messages (one each to the other members) and consequently
gathers n messages (one from each recipient).
Reduce:
RA
4.6 Summary
They take finite time, the length of which can vary significantly.
Moreover, multiple memory operations of each thread in a parallel
environment may overlap in execution. This can cause unexpected
program behavior because operations started earlier could end
later. Therefore, when evaluating program logic, it is important to
understand the guarantees provided by the programming platform.
synchronization and communication primitives 125
In particular, one must not assume that if one thread observes the
effect of memory operation o1 before o2 , all other threads would
observe the same order. If the platform does not guarantee so, the
program must include explicit synchronization to ensure consistency
where needed. Missing synchronization often leads to errors that
may be hard to reproduce. Memory consistency errors are among
the most obscure. Correct execution in a large number of test cases
should not be taken as a proof of correctness. To reiterate the main
points:
6
• Addresses hold values and may be accessed by multiple code
fragments. The instructions of two or more code fragments that
share addresses may execute in parallel or interleave. Their ac-
1.
cesses are hence concurrent, and their order of execution is non-
deterministic.
Inconsistency ensues even if, e.g., one thread view o2 before o3 and
another views o3 before o1 . Ordering respects transitivity.
6
Indeed, when inspecting shared-memory code, we often uncon-
sciously assume certain consistency. It’s important to know when
we may be over-assuming. A guarantee of consistency by the plat-
form has performance implications. Hence, in practice, popular
1.
programming platforms only guarantee consistency on demand
– on certain variables at certain times. This allows the program to
increase performance when strict global ordering can be dispensed
with. An understanding of memory fences helps this endeavor.
6
ease of programming.
1.
synchronization involving two or more threads. In contrast, some
synchronization is built into message passing – all participating
threads must take explicit action for each communication. These
actions may be synchronous (akin to a barrier) or asynchronous.
Nonetheless, there is a one-to-one matching of actions, meaning
that each action of a thread can be associated with a corresponding
action of partner threads. Communication through shared memory
FT
is often fine-grained, whereas message passing is usually coarse-
grained. This is because message passing requires significant setup
and often requires successive copies to a pipeline of buffers.
whereas Mosberger 20 analyzes the trade-offs of weaker consistency Programming Languages and Systems, 13:
models. Among the most successful high level message-passing in- 124–149, 1993
19
S.V. Adve and K. Gharachorloo.
terfaces is MPI21 , 22 , which we will discuss in some detail in Chapter
Shared memory consistency models: a
6. Common communication interface23 offers a deeper look at the tutorial. Computer, 29(12):66–76, 1996.
breadth of message-passing issues. doi: 10.1109/2.546611
20
David Mosberger. Memory consistency
models. SIGOPS Oper. Syst. Rev., 27(1):
18–26, January 1993. ISSN 0163-5980.
doi: 10.1145/160551.160553. URL https:
//doi.org/10.1145/160551.160553
21
William Gropp, Ewing Lusk, Nathan
Doss, and Anthony Skjellum. A high-
performance, portable implementation
of the mpi message passing interface
standard. Parallel Computing, 22
(6):789–828, 1996. ISSN 0167-8191.
doi: https://doi.org/10.1016/0167-
128 an introduction to parallel programming
Exercise
4.3. In most ways, a file shared by multiple threads acts like shared
memory. Consider a file system in which, instead of general write
6
operations, a thread may only append to the ‘end’ of a shared
file. Note that threads may share multiple files. The platform
guarantees that the data of two concurrent appends to one file
1.
are serialized, meaning their data are not interleaved. Reading
threads may read from any address. What additional support from
the platform is necessary to ensure that the files are sequentially
consistent?
1 philosopher(place, numplaces):
2 left = (place-1)%numplaces
3 right = (place+1)%numplaces
4 Repeat:
RA
5 lock(lock$[left])
6 lock(lock$[right])
7 Eat()
8 unlock(lock$[left])
9 unlock(lock$[right])
10 Ponder()
if (Ref$ == null)
lock(lock1$)
synchronization and communication primitives 129
tmp = allocateMemory()
initilalize(tmp)
Ref$ = tmp
unlock(lock1$)
use(Ref$)
if (Ref$ == null)
6
lock(lock1$)
if (Ref$ == null)
tmp = allocateMemory()
initilalize(tmp)
1.
Ref$ = tmp
unlock(lock1$)
use(Ref$)
4.9. Consider the following code with shared variables A$ and B$.
2
x = 2*A$;
B$ = A$ + B$
FT
Suppose the compiler optimizes away the second read of A$ on
line 2, and reuses instead the value read earlier at line 1 (that it
had saved in a register). Could that ever violate FIFO consistency
if the memory subsystem guarantees FIFO consistency?
RA
A$[threadID] = 1
print A$[1-threadID]
6
if the value is an address, even if the address itself changed back
to v, the contents at address v could have changed. Propose a
modification to the compare and swap protocol which allows a
thread a guarantee that there has been no change made to v.
1.
4.14. Implement the function barrier described in Section 4.4, which
can be called by all members of a thread group any number of
times.
4.16. Memory fences are also known as memory barriers. How are
memory barriers different from (computation) barriers?
RA
4.19. What is the difference between the terms lock-free and non-
blocking? Could a barrier event be lock-free? Could it be non-
blocking?
6
i Withdraw(accountNumber, amount)
ii Deposit(accountNumber, amount)
1.
iii Transfer(accountNumberFrom, accountNumberTo, amount)
You may use Compare and Swap. Assume that the provided
account number is valid, and the same account number may be
used at multiple ATMs at one time. Neither the bank nor any
account holder should lose money.
FT
RA
D
5 Parallel Program Design
6
Question: How to devise the parallel
Parallel programming is challenging. There are many parts interact- solution to a given problem?
ing in a complex manner: algorithm-imposed dependency, schedul-
Question: What is the detailed structure
ing on multiple execution units, synchronization, data communi-
1.
of parallel programs?
cation capacity, network topology, memory bandwidth limit, cache
performance in the presence of multiple independent threads access-
ing memory, program scalability, heterogeneity of hardware. The
list goes on. It is useful to understand each of these aspects sepa-
rately. We discuss general parallel design principles in this chapter.
These ideas largely apply to both shared-memory style and message-
passing style programming, as well as task-centric programs.
FT
At first cut, there are two approaches to start designing parallel
applications.
6
5.1 Design Steps
1.
1. Decomposition: subdivide the solution into components.
Granularity
How large the components, or tasks, are relative to the size of the
overall problem. Fine-grained decomposition creates more tasks,
and hence more concurrency, which usually allows solutions to
scale well. Fine-grained decomposition also allows fine-grained
scheduling, which often leads to more flexibility, but scheduling itself
may become costly at too fine a granularity. Also, the finer-grained
the tasks are, the more inter-task communication or synchronization
6
may be required. There is a balance to achieve. Naturally, the amount
of memory available on each device is an important consideration in
task sizing. In some situations the entire data of even one task need
not fit in the main memory. Instead, they can be processed in batches.
1.
In many other situations, the inability to fit the entire address-space
used by the task can lead to significant thrashing1 1
Defined : When data in the address
Consider matrix multiplication: C = A ⇥ B. Suppose A and B are space of a process does not fit in the
main memory, parts of it can be evicted
each n ⇥ n. There are n2 tasks if Task ij computes the element C [i, j] and stored in a slower storage by the
with i and j in the range [0, n). Task ij requires row i of matrix A and Virtual Memory manager. Constant
swapping of data between the main
column j of matrix B. n2 tasks fetch 2n items each. Alternatively, memory and the slower storage is
there are n tasks if Task i computes row i of matrix C. In that case, called Thrashing.
FT
Task i requires row i of A and the entire matrix B. In this decomposi-
tion, n tasks fetch (n + n2 ) items each, thus requiring fewer fetches.
However, it has fewer tasks, and hence a lower degree of parallelism.
(We will discuss the characteristics of a task graph in more detail in
Section 5.2.) Yet another decomposition could have n3 tasks. Task ijk
computes A[i, k] ⇥ B[k, j]. n3 tasks fetch 2 items each. However, they
do not compute C. Instead, Task0ij adds the result of Task ijk for all
j 2 [0, n). This method uses the most tasks but also fetches the most
RA
amount of data. Moreover, Task0ij must wait for all such Task ijk to
complete. This increases the length of the critical path.
Communication
Communication is costly. Hence, minimizing communication is an
important design goal. Tasks that inter-communicate more should
preferably execute on the same node or those ‘close’ to each other
if they are across a network. Execution of a step that depends on a
remote piece of data must wait for the data to be possibly requested
and the requested data to arrive. Request-based communication in-
D
6
thread x needs to send to thread y may be dispersed in its address
space, and not in contiguous locations. A good data structure re-
duces the need for repeatedly packing such dispersed data into a
1.
buffer. Coarse-grained communication design ensures that the com-
putation to communication ratio is high, meaning relatively fewer
messages are exchanged between large periods of local computation.
Also, communication can be point-to-point or collective. Even though
a single collective primitive has a larger overhead than a single point-
to-point transfer, they accomplish more. Tasks that admit collective
communication derive that benefit.
Synchronization
FT
Synchronization among tasks also has significant overhead. Compu-
tation pauses until the synchronization is complete. The overhead
is even higher if tasks execute on nodes far from each other. Syn-
chronization requirements can often be reduced by using somewhat
larger-grained tasks. At other times, breaking dependency by com-
puting partial results in each task and deferring synchronization to a
RA
critical { critical {
A$ = A$ + f(Y) A$ = A$ + f(A$)
} }
6
Load balance
1.
In assigning tasks to cores, one major concern is to keep all cores
busy until there is no more computation required. (This is mainly in
the performance-centric context. There exist other considerations. For
example, power conserving algorithms have different design goals.)
Cores idle either because task allocation is unbalanced or cores wait
too often for memory or network. A proper load balancing scheme
accounts for balancing the compute load as well as the memory and
FT
network load. In general, fine-grained tasks are easier to load-balance
than coarse-grained ones, just as it is easier to pack sand into a bag
than odd-shaped toys. On the other hand, we have seen above that
fine-grained tasks may increase the need for communication and
synchronization.
Load balancing can be built into the design when all tasks are
known in advance, and their approximate computation load can
be estimated before starting their execution. For example, if all
RA
6
solution to the problem – requires no communication or synchroniza-
tion. Thus the two objectives of increased concurrency and reduced
synchronization are often in conflict, and a trade-off is required.
1.
Sometimes tasks are implicit in the way a problem or a parallel
algorithm is described. This natural decomposition occasionally
leads to mostly independent tasks. These tasks may be decomposed
further if finer-grained tasks are required than those suggested by the
algorithm. Consider a problem that amounts to computing function
f ( X ),
f ( X ) = h( g1 ( X1 ), g2 ( X2 ), g3 ( X3 )..), (5.1)
FT
where gi and h are other functions, X is an input vector, and Xi is a
subset of X. We may assume that the output of each function is also a
vector. A natural decomposition for this problem is to create tasks Gi
computing gi from each Xi , followed by one task H that computes h.
A preliminary task for generating Xi from X would also be required.
Partitioning of data is often referred to as data decomposition or domain
decomposition. On the other hand, subdividing f into H and Gi is
referred to as functional decomposition. They often go hand in hand,
RA
Domain Decomposition
Domain decomposition partitions data and each partition relates to
a task. Partitioning may be irregular, in which case Xi is an arbitrary
subset of X. In this case, domain decomposition requires evaluating
a complex function and is a task in its own right. The more common
case, however, is a regular pattern. For example, X may be organized
as an n-dimensional matrix. Let’s consider a two-dimensional matrix
as an example. There are two basic regular decompositions: block
6
1.
Figure 5.1: Block and Cyclic decompo-
sition of domain. Each square lists the
FT
decomposition and cyclic decomposition. Figure 5.1(a) shows block
decomposition. Xij is a contiguous block of indexes. Partition (i, j)
is marked in the figure for each index of a 12 ⇥ 12 matrix. Each
block in which the corresponding data
location is included.
similarly.
Cyclic decomposition shown in Figure 5.1(b) distributes the data
round-robin to tasks. Element i of Block b corresponds to index
i ⇥ BlockSize + b in each dimension. Cyclic decomposition often
balances load among tasks better than block distribution. It is useful
when, say, the matrix is processed iteratively, and only a sub-block
140 an introduction to parallel programming
6
processor. In that case, at step i, SIMD tasks together process con-
tiguous indexes, e.g., i ⇥ BlockSize .. (i + 1) ⇥ BlockSize 12 , thus 2
Range "a..b" includes a and b
improving memory access locality.
1.
FT
Figure 5.2: Block-cyclic Domain Decom-
position
A combination of block and cyclic decomposition, shown in Figure
5.2, is another commonly used decomposition. This has the benefits
of contiguous blocks, but the blocks are smaller than block decompo-
RA
tations, like an adjacency list, are more common. Further, the graph
may require some per-vertex processing, some per-edge processing,
or even region partitioning. For example, a graph may represent a
connected set of triangles representing a surface. One may generate
tasks that process a subset of triangles, while other tasks process
a subset of vertices. Alternatively, the triangular topology may be
thought of as a general graph, and an entire connected subgraph –
triangles, vertices, and adjacencies – may be processed by a single
task. Such graph partitioning – or graph cut – is a common strategy
6
for task decomposition. The underlying assumptions are that:
1.
nodes, edges, or both).
6
of input primitives, e.g., triangles, represents a scene. The algorithm
projects the scene onto a set of pixels on the screen, producing a color
per pixel. Each pixel may be produced (largely) independently of
1.
other pixels. Thus, a decomposition based on the output pixels works
well. Similarly, two matrices may be multiplied using tasks, each of
which produces a block of the product matrix.
Functional Decomposition
While domain decomposition focuses on partitioning based on the
data that a task processes, functional decomposition focuses on the
FT
computation that a task does. Domain decomposition is often con-
ducive to data-parallelism, while functional decomposition is to task
parallelism. As seen in the example above (Equation 5.1), dividing f
into h and gi , is functional partitioning. Such dependence may even
be recursive, leading to recursive decomposition. For example:
f (X) = g( X ), if X is “lea f ”
= h( f ( X0 ), f ( X1 ), f ( X2 ).. f ( Xk )), otherwise
RA
6
Figure 5.4: Recursive task decomposi-
tion
1.
only a small fraction of the total computation. Sometimes, however,
the sizes of the tasks grow going up the tree, while the number of
tasks reduces. This requires attention. A secondary decomposition
may become necessary in that case, replacing the top few levels of the
tree with a different decomposition.
Recursive decomposition applies more generally, even when the
FT
solution is not itself expressed recursively. It builds a task hierarchy,
perhaps using the divide and conquer paradigm. A solution is de-
vised in terms of a small number of largely independent tasks. These
tasks need not be of equal size, but they should preferably communi-
cate and synchronize with each other rarely. Each task is then further
divided into component subtasks until all remaining subtasks are of
the desired granularity. The quick-sort algorithm is a classic example
of recursive decomposition. At each recursion, an unsorted list is
RA
divided into two independent sublists such that all elements in the
first sublist are smaller than those in the second sublist. One task
is generated for each sublist, which sorts that sublist, possibly by
generating more tasks recursively.
In many algorithms, tasks are designed beforehand and then
encoded directly into the parallel program. Not always, though. In-
stead, tasks may be generated dynamically as the algorithm proceeds.
This dynamic task generation may be by a master generator. More
generally, one or more initial tasks generate more tasks, which may,
in turn, generate yet more tasks and so on. For example, these may
be tasks that cumulatively traverse a solution space. This is known
D
6
state. Many of these problems can be abstracted as graph traversal
– except the graph may be generated as needed on the fly. In such
1.
FT
Figure 5.5: Exploratory decomposition
rently, and they may lead to the same state, proper synchronization is
required to ensure that an already explored part is not re-explored by
a different task.
Another question for exploratory task decomposition is whether
the discovery of each new path indeed requires a new task to explore
it. Some exploration may instead be included in the current task
itself. Such decision usually depends on the estimate of the size
of state space that the newly discovered paths lead to, or on the
estimate of the size of still unexplored space already included in
the current task. They may also depend on the location of the data
associated with the new paths. Data local to the task may be explored
D
by the task itself. For other data, new tasks may be preferred.
A special case of explorative decomposition is speculative decomposi-
tion. In an exploration, many, or even all, possible paths that must be
explored from a given task state could be known in advance. For ex-
ample, the task may be a subgraph, and the edges leading out of the
subgraph are known a priori. Not all of these ‘external’ edges eventu-
parallel program design 145
6
quire traversal is high. For example, in chess move exploration, given
a board configuration, certain future opponent moves may be highly
likely.
1.
Another method to create tasks is pipeline decomposition. This is
similar to the hardware pipeline – a task takes one chunk of input
from its predecessor task, performs processing, and forwards the
results to its successor task. It next processes the next chunk of data
fetched from the predecessor. This structure is often used to hide
communication latency by overlapping data transfer with compu-
tation. Dependency edges in a task graph representing pipeline
imply that the dependent task may begin as soon as the first chunk
FT
of data is released by its predecessor. They do not need to wait for
the predecessor to complete its processing. We refer to such task
edges as communication edges to differentiate them from regular de-
pendency edges. Communication edges usually go in both directions,
dependency is uni-directional.
In practice, task decomposition need not be limited to one of the
types described above. Rather, these decompositions can be com-
bined. For example, data decomposition and recursive decomposition
RA
6
concurrency of the task graph: the average number of tasks that may
be processed in parallel.
Another important metric is task cost variance. Tasks with similar
1.
sized tasks are usually easier to schedule. Complex scheduling
algorithms have a cost, and that impacts the overall performance. A
related property is the execution homogeneity of a task. It is easier to
schedule tasks that maintain their load characteristic during their
execution. For example, a task that remains compute intensive during
its execution may be assigned to a fast processor. On the other hand,
a memory intensive task may be scheduled on a large memory
device. Also, multiple tasks may be scheduled to execute on a single
FT
device if they have a predominance of high latency operations like
data transfer.
Task graph degree is the maximum out-degree of any task in the
graph. A task with large out-degree is a task, at the end of which a
large number of its successor tasks must be spawned. This imposes a
large scheduling overhead, which can be particularly troublesome if
the task is on the critical path.
Further, keeping the interface between tasks clean and well de-
RA
3. When does a task acquire data from its predecessors, and when
does it send data to its successors?
nicate?
6
makespan: the end-to-end time since an application is started until its
last task completes.
With increasing cluster sizes, power consumption has also become
1.
a major optimizing factor. Sometimes, increasing the makespan can
reduce the total power consumption noticeably. We do not explore
power issues in this textbook, but they are become increasingly
important, even if they make scheduling more complex.
Some problems allow tasks to be uniform, independent, and sized
arbitrarily. We have seen such an example above: matrix multipli-
cation. Mapping such tasks to P processors is simple. Size tasks in
a way to produce P processors, and distribute them round-robin to
FT
processors. Even these simple cases break down if the devices do not
all have the same capability (computation speed, memory bandwidth,
network bandwidth, etc.). We discuss general techniques next, where
the number of tasks is usually greater than the number of processors.
We assume that a task is mapped to a single device. For this
purpose we may not require that a task always be sequential. A
parallel task may occupy multiple computational devices, but it is
sufficient for this discussion to treat that set of devices as a single
RA
6
ideas apply to subsequent re-scheduling. In either case, there are two
important goals at each scheduling point:
1.
tion and communication with other tasks have low overhead.
These two goals are integrally related to each other, and often in
conflict. Both locality and utilization should be high, but the best
localization might be achieved by mapping all tasks to the same
FT
device, leading to severe load imbalance and low utilization. On
the other hand, utilization, or load balancing, is abstracted as the
bin-packing problem: group tasks into bins such that the size of each
bin is within an e factor of others, for some small and fixed value of
e. Perfect balancing may require assigning tightly coupled tasks to
separate devices.
The mapping of a task to some device may occur at any time after
it is spawned. It cannot begin to execute on that device until its task
dependencies are satisfied, and the device becomes available. (The
device becomes available when it has completed the tasks executed
earlier, unless it supports concurrent execution.) If the underlying
programming platform does not support dependencies, or if the ap-
plication program chooses to manage it directly, it spawns a task only
after its dependencies are satisfied. Similarly, if the platform does not
support mapping tasks to devices, the application program explicitly
executes the corresponding task on a specific device mapped by the
D
program itself.
In case the platform supports task graphs, a common semantics of
dependency edge is that a task may begin execution only after all its
predecessor tasks have completed. This is not true for communication
edges. Tasks with only communication edges leading to it may be
started at any time. Both communication edges and dependency
parallel program design 149
6
vide the task graph into components with a roughly equal number of
nodes in each component and the minimal cut between components.
This is similar to the algorithm for generating a task graph from a
1.
data graph. The task graph is subdivided, and each subgraph is allo-
cated to a single device. Communication edges between subgraphs,
i.e., on the cut, imply inter-device communication. Edges between
tasks in the same subgraph imply intra-device communication and
hence impose a lower latency.
Algorithms that cut a static graph are generally not applicable for
dynamic task graph. However, if tasks are dynamically generated,
they may be incrementally mapped in a greedy breadth-first fashion.
FT
In the greedy approach, the initial set of tasks that do not depend
on any other task are mapped to devices in a round-robin fashion.
Tasks that depend on this initial set are mapped next, and so on.
In mapping each dependent task, edges leading to this task from
already mapped tasks are used in assigning a device to this task. It
is assigned to a device where most of its predecessors are, subject to
load balance. We discuss this next.
RA
There are two common designs for managing the scheduling. Push
scheduling and Pull scheduling. In push scheduling, spawned
tasks are sent – or pushed – to a target device for execution. In
pull scheduling, devices themselves seek – or pull – tasks ready for
execution from some task pool.
Task scheduling can be centralized or distributed. In centralized
push scheduling, each task sends the basic information of the newly
spawned tasks to a central task scheduler executing on some device.
D
6
time as others, push scheduling can be as simple as round-robin
task distribution. On the other hand, tasks may have an affinity to
certain devices, e.g., if most input data for task A is on device i, per-
haps because its predecessor(s) executed on device i, it may have
1.
an affinity to device i. However, mapping task A to i may lead to
load imbalance. Besides, all devices may not have the same speed
or capacity. This is similar to packing different sized items (tasks)
into a set of different sized bins (devices), such that the resulting
packed-size of each bin is the same. This is an NP-complete problem6 . 6
J. D. Ullman. Np-complete scheduling
If tasks are spawned dynamically, or have bin affinity, the problem problems. J. Comput. Syst. Sci., 10(3):
384–393, June 1975. ISSN 0022-0000
is even harder. With dynamically spawned tasks, the scheduling is
FT
said to be ‘online.’ Round-robin distribution is not efficient in the
presence of non-uniformity in task size, task affinity, task depen-
dence, or dynamic creation. At the same time, in most situations, a
large scheduling overhead defeats the main purpose: complete the
application program as quickly as possible.
Several heuristics are used to perform task scheduling. In most,
an estimate of the size of the task (the time it takes) and that of the
speed of the device is required. Creating these estimates reliably is
RA
6
are many waiting tasks that could be mapped to it. It is often to-
wards the end of the application, when fewer tasks may remain to
be executed, that load imbalance begins to impact the performance.
1.
Consequently, the following adjustments to the heuristics described
above are sometimes useful:
The estimated load imbalance may be defined as the ratio tte , where te
l
is the earliest time when any device would complete its assignment
RA
and tl is the latest time when a device would complete its assignment.
(t t )
A simple example of the load imbalance metric I is: tl e , where
wait
twait is the total time required by the waiting unmapped tasks. This
allows the estimated imbalance to be weighted by the amount of
unassigned work and works well for roughly equal-sized tasks.
More complex strategies may be required for highly skewed task
sizes. For example, if there is extreme variance in the loads of tasks,
two or three large tasks may take longer than all others combined.
Subdividing large tasks is the only reasonable way to achieve load
balance in such a case. If subdivision is not feasible, such tasks must
D
6
1.
Figure 5.6: Block-cyclic domain decom-
position with reducing block size
Distributed Push Scheduling
A centralized scheduler quickly becomes the bottleneck; it does
not scale well with an increasing number of tasks and devices. On
the other hand, each device scheduler directly mapping each task
spawned at that device, and then pushing that task to its mapped
FT
device is prohibitively complex. Each device is a map-target of n 1
schedulers in an n-device system. Without a shared knowledge of
the target’s load state, n 1 independent scheduling decisions could
not be expected to balance the full load. This requires extensive
synchronization. A more common approach is to decompose the
centralized scheduler into k separate master schedulers, where k ⌧ n.
Each spawned task is then pushed to one of these master sched-
ulers. This master assignment could be statically pre-determined or
RA
Pull Scheduling
In pull scheduling, the target devices map tasks to themselves. They
D
6
mapped using a simple push strategy like round-robin. When a
device completes its mapped tasks or nears the completion of its
tasks, it requests new tasks from another device scheduler. If that
1.
other scheduler has a sufficient number of waiting tasks, it shares
some of them with the requesting scheduler. If it does not, it rejects
the request. This process is known as work stealing. If a request is
rejected, the requester may choose another scheduler to make a new
request.
Device schedulers do not generally monitor the status of other
device schedulers. This means that when a device has no remaining
tasks in its queue, it must guess which other device to request more
FT
tasks from. The usual approach is to iteratively select a random vic-
tim device to steal from, until one has sufficient remaining work to
share. This implies that near the end of the application, all devices
start to attempt to steal. On large clusters, this can lead to a signifi-
cant loss of time before all device schedulers realize that there is no
more work remaining. Early stopping heuristics are commonly used.
For example, device schedulers may reduce the probability of subse-
quent steal attempts after each rejection, or wait for some time before
RA
the next attempt to steal. This assumes that each rejection indicates
an increasing likelihood of further rejection, as a rejection implies a
lack of waiting tasks at the victim scheduler. Each device scheduler
may also monitor the number of steal requests it receives, and hence
estimate the state of other devices.
5.4 Input/Output
multiple disks, allowing parallel disk IO. Further, such storage ap-
pliances support multiple access paths – they may have multiple
controllers reached through different network routes. Thus there is
a certain degree of parallelism, but this parallelism is usually not
exposed to the client programs that read or write data.
True parallel file systems expose the parallel IO. They have each
file potentially striped across multiple storage targets (ST). There may
be a small number of dedicated storage targets, or there may be a
separate target at every node in the cluster. Programs access the
6
storage through storage servers (SS). A general architecture is shown
in Figure 5.7(a). Programs executing on devices, i.e., the clients,
thus have multiple paths to the storage targets through the multiple
1.
storage servers.
FT
Figure 5.7: Parallel IO
A file is divided into equal sized blocks, with the blocks dis-
tributed round-robin to the storage targets (see Figure 5.7(b)). The
RA
size of the block may be specified for each file by the file creator. The
application program is aware of this structure and knows which
blocks of a file reside in which target. It can exploit this knowledge
to effect parallel IO. If clients read blocks resident in different targets,
the accesses can proceed completely in parallel. Similarly, writes
are also parallel. If the output produced by different devices can
be stored in a manner that parallel write is possible, each thread
can write its part of the output file (or files). Some programming
platforms provide collective calls for efficient parallel reading and
writing by multiple threads.
Note that the reading pattern is a bit different from the writing
D
6
during these operations the compute load of the thread is generally
low. A program may hide some file IO latency by overlapping two
or more tasks on the same device. For example, the program’s input
1.
may be divided into chunks. Task A on device i may read its input
and then start processing it concurrently with task B’s read of the
next chunk on that same device.
used where available. With the help of such tools or without, effec-
tive debugging often requires writing programs in a way that aids
debugging.
One way to debug is to log critical events and state parameters
(variables) during the execution of the parallel program. After the
execution, the log is analyzed for anomalous behavior and reasons
for large waits and performance slowdown. This analysis may be
automated by scripts and programs that look for specific conditions,
e.g., large gaps in timestamp of certain events. This is also done
by an inspection of the logged text, or with the help of a graphical
visualization of timestamps or event counts. Note that multiple log
D
files are often employed – possibly one per task. Tagging the logged
events with a timestamp can help relate the approximate order of
events recorded at different nodes, particularly for visualization.
However, any such order must be taken as only a rough estimate
because clocks are not synchronized.
The importance of interactive tools like gdb that can ‘attach’ to a
156 an introduction to parallel programming
running process at any time and inspect its state cannot be overstated.
This allows one to stop and start the execution at specific lines,
events, or variable conditions. The program itself can be written
in a way to maximize such debugging control. For example, special
debugging variables may be introduced, which record complex
conditions. Consider the following listing. The code waits in a loop
when a certain inconsistent state is discovered. This is one way to
ensure that the process waits on encountering suspicious condition.
6
if(idx1 > idx2 || num1 < idx1 && idx1 < num2)
while(! debuggerReady);
Once the debugger attaches to this process, the related variables can
1.
be inspected. After the inspection is completed, the debugger may set
the debuggerReady to true to step or continue beyond this point.
One of the most important arrows in the quiver of the parallel
programmer, across design methodologies, is performance profiling.
A profiler collects run-time statistics of an executing program and
produces information like the number of times a function is called,
the total time spent executing a block of code, the amount of data
FT
communicated, etc. Profiling helps highlight an executing program’s
hotspots – the parts that take the most time. These are the parts
that the programmer must focus on early in the development cycle.
Parallel program performance has three broad components: compute
performance, memory performance, and network performance.
Computation-centric profiling tools are more common but separate
memory and network profilers also exist.
Network and memory performance are ignored at the program’s
own peril. Even if only the computation profile is available, it can
RA
5.6 Summary
6
spent in communication and synchronization by reducing their over-
head, e.g., by using high-level primitives or by overlapping waiting
threads with other computing threads, keeping the processors busy.
1.
A careful decomposition of the problem into tasks goes a long way in
controlling overheads.
Some general design principles introduced in this chapter include:
6
processors, such that their loads remain balanced. If the workload
of each task is uniform, or they can be estimated in advance, load
is easier to balance, particularly if all processors have the same
1.
capability.
Relative workloads are not always known in advance. In such
cases, an initial allocation on the basis of estimated workloads may
yet be useful. Load rebalancing algorithms are required to adjust the
allocation when some processor completes its tasks well ahead of
others. This is especially true when new work is created on the fly. In
such situations, a work-queue or load-stealing algorithm is generally
advisable.
FT
The book by Foster 8 has a broad overview of parallel program
design. The one by Xu and Lau 9 contains a broad coverage of load
8
Ian Foster. Designing and Building
Parallel Programs: Concepts and Tools for
Parallel Software Engineering. Addison-
balancing strategies. A survey by Jiang 10 provides a good study Wesley, 1995. ISBN 01575949. URL
of task mapping strategies. It is a good starting point for further https://www.mcs.anl.gov/~itf/dbpp/
text/book.html
reading. Tools like Metis 11 have been widely used for cutting large 9
Chengzhong Xu and Francis C. Lau.
task graphs into sub-graphs for mapping. Load Balancing in Parallel Computers:
Theory and Practice. Kluwer Academic
Publishers, USA, 1997. ISBN 079239819X
Exercise
RA
10
Yichuan Jiang. A survey of task alloca-
tion and load balancing in distributed
systems. IEEE Transactions on Parallel and
5.1. Compute the measures i) Critical path length, ii) Average con-
Distributed Systems, 27(2):585–599, 2016.
currency, iii) Maximum concurrency (the maximum number of doi: 10.1109/TPDS.2015.2407900
processors that can execute different tasks in parallel) for the fol- 11
George Karypis and Vipin Kumar. A
fast and high quality multilevel scheme
lowing task-graphs. Also work out a load balanced mapping of
for partitioning irregular graphs. SIAM
these tasks on 4 processors. Journal on Scientific Computing, 20(1):
359—392, 1999
(a) (b)
D
(a) Draw the task graph. Compute its critical path length and
average concurrency.
(b) Map the tasks to 16 processors, given that the initial list to sort
has 128 elements.
6
0 i < n. Task i requires row i of matrix A and all columns of
matrix B. Task i+1 requires row i + 1 of A and all columns of B.
And, so on. Distribution of A among tasks is simple: one row to
1.
each task. However, all tasks require access to all columns.
(a) Compare the design where all tasks first receive all columns
of B (before starting their computation of C’s row) with the
design where they fetch one column at a time in sequence, and
compute one element of C while fetching the next column of B
in parallel with it.
(b) If the tasks proceed in parallel at roughly similar speeds, they
FT
all require the same column of B at roughly the same time. If B
is in a cache shared by all tasks, this can be helpful. However,
suppose they do not, and B is instead distributed column-wise
among many nodes. All tasks fetch a column from the same
node causing that node to become a bottleneck. Suggest a
scheme to alleviate this contention.
5.4. In Exercise 5.3b, assume the matrices A and B are initially stored
RA
6
5.6. We want to compute the transpose of matrix A laid out in the
row-major order in File1 of a parallel file system. This amounts to
writing out A in the column-major order into File2. Design tasks
and map them to nodes. Assume collective reads and writes just
1.
as in Exercise 5.4.
3
for i = 0..n-1
for j = i+1..n-1
FT
Initialize L = n ⇥ n identity matrix, and U = A.
4 L[j,i] = U[j,i]/U[i,i]
5 for k = 0..n-1
6 U[j,k] = U[j,k] - L[j,i]*U[i,k]
5.8. Consider the task graph for reduction of n items (review Exer-
cise 2.12). There are n 1 internal nodes, i.e., tasks, required to
reduce n elements. Suppose, we divide n items into B blocks of
size Bn items each. Each block can be reduced sequentially within
D
a single task. The single result of each task can then be reduced in
a tree-like manner. This requires B initial tasks, followed by B 1
additional tasks in a tree-like graph. Assuming B ⌧ n, describe
the trade-off between choosing different values of B.
m
Bitems each, and create B tasks, with Taski performing a binary
search for the mB query items in block i. Assume for simplicity
that m is divisible by B. Discuss the impact of task granularity, the
value B, in comparison to m and n.
Note that B m in the design above. Can you create finer-grained
tasks. so that multiple tasks may cooperate in locating each query
item? Discuss the impact of this finer granularity.
5.10. Suppose our design calls for three kinds of tasks: Taska , Taskb ,
6
and Taskc . We know that all tasks of type Taska take time t a , Taskb
take time tb , and Taskc take time tc . Say, t a = 2tb = 4tc . Also given
is that tasks of each type interact heavily with other tasks of that
type. Devise a static load-balanced way to map 16 tasks of each
1.
type to a total of 8 processors.
(a) all tasks are known in advance and initially in the queue
FT
(b) tasks can be generated on the fly and threads may insert into
the queue
6
repeat N times:
for all positions (i,j) in an n⇥n array A$
temp = 0
for all positions (k,l), i-5 < k < i+5, j-5 < l < j+5, (modulo n arithmetic)
1.
temp = temp + A$[k,l]
barrier
A$[i][j] = temp / 36.0;
5.16. In exercise 5.15, the code was changed to remove the bar-
rier (the results were incorrect and ignored). The performance
Processor 0 1 2 3
was then recorded as follows:
FT Time taken 29 40 34 37
Postulate three reasons to which the difference in times can be
attributed.
RA
D
6 Middleware: The Practice of Parallel Pro-
gramming
6
Question: Where do I begin to pro-
1.
We are now ready to start implementing parallel programs. This gram? What building blocks can I
requires us to know: program on top of?
6.1 OpenMP
6
Preliminaries
1.
C/C++ employs #pragma directives to provide instructions to the
compiler. OpenMP directives all are prefixed with #pragma omp
followed by the name of the directive and possible further options for
the directive as a sequence of clauses, as shown below. Each pragma
applies to the statement that follows it, which can be a structured
block of code enclosed in {}.
The example shows the parallel pragma with one optional clause
called num_threads, with a single argument n.
RA
and are deleted at the barrier. The parent thread then continues exe-
cution of the code following the pragma statement. This is called the
fork-join model. The argument n of the num_thread clause in Listing
6.1 specifies the number of threads in this group, including the par-
ent. Thus n 1 children threads are created. We may call this group
a work-group or a barrier group.
middleware: the practice of parallel programming 165
6
system. Some control is accorded to the programmer through the
proc_bind clause, which allows certain threads to be assigned a
scheduling affinity towards a subset of cores.
1.
The parallel pragma supports other clauses. These include control
over how the address space is shared and partitioned among the
threads.
Even though OpenMP threads share the address space, making each
FT
variable visible to each thread increases the chance of inadvertent
conflicts. Hence, OpenMP supports two levels of visibility. Shared
variables are visible to an entire barrier group. Private variables are
visible only to a single thread. Hiding private variables from others
also allows the same name to be used in all threads. Otherwise, an ar-
ray of variables – one per thread – would be required. This two-level
visibility simplifies design sometimes but also reduces the flexibility
of allowing a variable to be shared between an arbitrary group of
threads. This visibility is controlled by clauses to the parallel pragma
RA
New variables declared within the parallel code block are private
by default. Variables k and l are private to each thread. This effec-
tively creates n new copies of variables k and l, respectively, in the
process’s address space at the fork time, each visible to a different
thread. Similarly, there is a local m and n, private to each thread.
What happens to these copies at the join at the end of the parallel
region? They are discarded (but see the reduction clause later in
this section) and deallocated (but see threadprivate pragma). These
copies may also be initialized by the value of the original using the
6
firstprivate clause (in place of private). For example, in the listing
below the value of k in each thread, when they start, is 11; l remains
uninitialized.
1.
Listing 6.3: Memory clauses for OpenMP parallel pragma
int l = 10, k = 11;
#pragma omp parallel firstprivate(k) private(l)
{
// k is 11 but l is uninitialized
}
FT
Operations on shared variables are not guaranteed to be sequen-
tially consistent. OpenMP, rather, advances a thread-local view of
each shared variables, which is allowed to diverge from other threads’
views. Thus incoherence between two cache copies is allowed. Mem-
ory flush primitives (see Section 4.2) are provided for the programmer
to control the consistency as required. OpenMP also allows flush to
be limited to certain variables; in that case, it is no more a pure mem-
ory fence. The flush pragma, which may appear within the parallel
RA
region, is as follows:
order between two flushes that include x, must appear to start and
complete between those flushes. When no variables are listed, the
flush is equivalent to a pure memory fence, meaning all shared
memory is effectively flushed.
A flush is implied at all synchronization points. These include the
entry and exit from the parallel region. Other synchronization con-
middleware: the practice of parallel programming 167
structs are discussed later in this section. This is a good time to bring
up certain types of compiler optimization. Recall that instructions of
a thread can be executed out of order. They may even complete their
execution out of order if such re-ordered instructions do not contain
a data race: read-write or write-write conflict. A compiler reorders
instructions to increase both cache locality and instruction level paral-
lelism. However, a compiler analyzes only sequential sections of code.
In parallel execution, some other code executed by a different thread
may cause conflicts undetected by the compiler. Recall the Peterson’s
6
algorithm (Listing 4.7). A few lines are reproduced below.
1.
1 // Assume shared ready and defer.
2 myID = omp_get_thread_num();
3 ready[myID] = true;
4 defer = myID;
Since there is no data race between lines 3 and 4, the compiler may
think them independent and reorder them. We have already seen that
the correctness depends critically on the correct order being main-
FT
tained. Some languages use the keyword volatile to indicate such
variables, instructing the compiler to keep them in order. OpenMP,
in particular, specifies that a read of a volatile variable v implies
a flush(v) preceding the read and a write to a volatile variable v is
implicitly followed by flush(v). If sequentializing all accesses to a vari-
able is not necessary, an explicit flush is a better option. The listing
below does so for the Peterson’s algorithm.
OpenMP Reduction
Instead of allowing the final values of private variables to be dis-
D
6
initialized to a value that is suitable for the reduction operation: 0 for
addition, 1 for multiplication, a large value for minima, a small value
for maxima, etc. In Listing 6.7, all private copies of k are initialized
to 0. Each thread adds its ID to its private copy of k. Thread 0 thus
1.
leaves 0 in its copy of k, and thread 1 leaves 1. At the end of the paral-
lel region, the values in the two copies of k are reduced to 0 + 1 = 1 in
this example. Finally, the result is combined with the original k using
the same reduction function. Thus 1 is added to 10 leaving 11 in k
finally.
There are a few predefined reduction operations. OpenMP pragma
to define new ones also exists (even if it is a bit awkward). An exam-
FT
ple is shown in Listing 6.8.
// Set k here
}
The declare pragma above sets up the reduction: both the oper-
ation as well the initialization of each private copy. This example
names the reduction operator as mymax, which expects integer vari-
ables. The actual operation is accomplished by calling the function
max on two partial results at a time. The function is called repetitively
as per the binary tree structure described in section 3.2. omp_in and
omp_out are internal names. The clause syntax indicates that two
D
OpenMP Synchronization
Low-level locks as well as higher-level critical sections and barriers
are supported by OpenMP. Locks do not employ directives but are
middleware: the practice of parallel programming 169
managed through function calls. This allows the program a bit more
flexibility in creating and manipulating locks compared to pragmas.
6
failing to acquire the lock, and true on success. This allows the caller
to go on to some other work without getting blocked. For example:
1.
Listing 6.9: Threads share work using non-blocking lock
omp_lock_t lock[N]
for (int i=0; i<N; i++) omp_init_lock(&lock[i]);
#pragma omp parallel // Each thread tries each lock in sequence
for(int item=0; item<N; item++) {
if(omp_test_lock(&lock[item])) {
workOnItem(item);
omp_unset_lock(&lock[item]);
}
FT
} // Otherwise some other thread has this lock; move on
Each thread iterates over a list of items, seeking to process it. It at-
tempts to acquire a lock corresponding to the item, and processes the
item if it is able to. If it fails to acquire a lock, this means that some
other thread was able to acquire it and process the item. We may
RA
6
#pragma omp critical(section1)
{ // Assume shared counter
counter ++;
}
1.
In this example, section1 is the name given to this section. A thread
executing this code block has the guarantee that no thread executes
any block also named section1. Two critical sections with different
names are non-conflicting, meaning two threads are allowed to
be in differently named critical sections at the same time. Critical
section applies to all OpenMP threads within the process, not just the
FT
members of the current barrier group. A critical section without the
name argument is globally critical. No two OpenMP thread may ever
overlap their execution in any globally critical section.
mer does not need to explicitly manage locks; that’s all. Locks and
critical sections are both blocking and slow.
The atomic pragma, in comparison, is a limited critical section, that
performs limited operations on a single shared-memory location.
These can often be performed internally using Compare and Swap
or other hardware supported feature and are more efficient than the
critical section.
Further, the atomic pragma also has the option to force sequential
consistency using the seq_cst clause:
6
#pragma omp atomic seq_cst capture
v = x++; // Atomic increment and capture, sequentially consistent
1.
the seq_cst clause, only the variable x would be flushed.
Thread creation and synchronization is the minimal facility a
shared-memory parallel programming platform needs to support.
However, it is useful for parallel programs to be designed in terms
of tasks that share the overall work and threads that execute these
tasks. The parallel pragmas discussed above do not provide a very
flexible interface to do so. We will next discuss general work sharing
FT
constructs and task management.
6
of the barrier group. The loop needs to be in a form that the compiler
can statically subdivide: single clear initialization, single clear termi-
nation, and single clear increment. In case the loop is nested, only the
1.
outermost loop is subdivided, unless the collapse clause is used. The
iteration variable for each loop is forced to be private, whether or not
it is declared outside the parallel region.
The for pragma also has clauses private and firstprivate, similar
to the parallel pragma. Thus private copies can also be created at
the beginning of the loop – one per thread sharing the loop. Again,
copies are created per thread, not per iteration. Correspondingly,
FT
they are also discarded and deleted at the end of the loop, unless
the reduction clause is included in the for pragma. Additionally,
the for pragma also supports the lastprivate clause. If a variable
appears in a lastprivate clause, the value in the last thread’s private
copy of this variable is saved into the master copy at the end of the
loop. The last thread is the thread that happens to execute the last
iteration of the loop. Thus, lastprivate pragma allows the program
to refer to a private variable after the parallel loop just like it can
RA
nowait clause. If the nowait clause is used, any thread that completes
its assigned iterations may proceed with execution of the code after
the loop. Note that a lastprivate update may not have yet happened
in this case, and a thread after its exit from the loop must not expect
it (unless it is the last thread).
The features for assignment of iterations to threads is also rich and
middleware: the practice of parallel programming 173
6
using the ordered pragma as follows:
1.
3 for(int item=0; item<N; item++) {
4 orderInsensitivePart1(item); // Concurrent with other iterations
5 #pragma omp ordered
6 doThisInOrder(item); // Called with item = 0 to N-1, in that order
7 orderInsensitivePart2(item); // Concurrent with other iterations
8 }
1 int completed = 0;
2 string resultOdd, resultEven;
3 #pragma omp parallel
4 {
5 #pragma omp for ordered(1) // One level of loop ordering
6 for(int item=0; item<N; item++) {
7 string x = workOnItem(item); // Concurrent with other iterations
8 #pragma omp ordered depend(sink:item-2) // Proceed if source-point of
174 an introduction to parallel programming
6
Since it is quite common for the for pragma to be the only statement
in the parallel region, pragma parallel for is available as a shorthand
combining the two.
1.
#pragma omp parallel for
The parallel for pragma accepts both parallel and for clauses.
Naturally, the nowait clause is not allowed on the combined pragma,
as a barrier is essential at the end of the parallel region. Without the
barrier, the master could proceed beyond the parallel region while
other threads are computing, or may still be computing when other
FT
threads have reached their ends, and are deleted. That incurs a risk
of race conditions and premature memory deallocation.
6
Sometimes, parallelization suffers because
of certain steps that cannot be parallelized –
a step that must be done serially. One way
to manage this is a sequence of fork-join
1.
primitives, as shown in the figure on the
right. This imposes significant thread creation
and deletion overhead. Instead, one may
temporarily suspend the parallelism using the
single pragma. This is also considered a type of work-sharing pragma,
one that forces all its work on one of the threads. Thus private and
nowait clauses are available. In the absence of nowait, there is an
FT
implicit barrier at the end of the single block.
themselves.
SIMD pragma
not deviate from each other. They all execute one instruction in one
clock cycle in the SIMD fashion (see Section 1.1). Thus SIMD may be
nested in the for pragma, and there exists a combined for simd pragma.
Several clauses of SIMD are similar to work-sharing pragmas.
The simd pragma has additional clauses to control the vectoriza-
tion itself. In particular, the safelen(d) clause instructs that iterations
176 an introduction to parallel programming
float sum = 0;
6
#pragma omp parallel for simd reduction(+:sum) schedule(simd:static, 8)
for(int item=0; item<N; item++)
sum += workOnItem(item);
1.
It assigns certain iterations to each thread. The iterations assigned
to each thread may now be combined into SIMD groups. The schedule
clause ensures that threads are assigned iterations in chunks of 8, so
they may be vectorized. The kind of schedule, simd:static, ensures that
if the chunk size (8 in this example) is not a multiple of simdlen, it
is increased to make it a multiple to ensure that all SIMD lanes are
utilized.
Note that the function workOnItem is presumably not a SIMD
FT
instruction. Hence, it may not make sense to vectorize this function
– unless it does, of course. A vectorizable function needs to be spe-
cially compiled using vector instructions. OpenMP has a declare simd
pragma to accomplish this. This pragma instructs the compiler to
generate vectorized code and is demonstrated in the following listing.
{
return In[item] * acos(item/N);
}
engines on the same core, which would have otherwise remained idle.
The OpenMP simd clause provides additional controls to help the
programmer guide the vectorization when automatic vectorization
may fail.
middleware: the practice of parallel programming 177
Tasks
6
for irregular problems, including graph and tree processing.
1.
}
it is executing.
Tasks can be nested, and the inner tasks follow the same semantics.
The listing below shows an example:
6
the single pragma. It could sometimes be faster for all threads to
participate in task generation. It may not be in this case though, if the
dequeue operation needs to be under mutual exclusion. The single
1.
version avoids any synchronization on the queue. No significant
slowdown due to this sequential queue processing would be incurred
if the number of threads available in the group is relatively small and
the thread performing the single block is able to generate tasks for
them quickly. Also note that the for pragma would not be suitable
for a loop of the kind shown in Listing 6.13 because the number of
iterations is not known in advance.
The task requires a private reference to the task, taski, because
FT
the task creator would go on to the next iteration of the while loop,
updating the variable taski. The task, when scheduled, calls function
workOnTask, listed below.
15 }
6
the child in this case. Tasks also have a depend clause for explicit
dependency enforcement. A task cannot be scheduled to begin until
all other tasks that it depends on are completed. Both tasks and
1.
parallel constructs allow disabling of thread and task forks using
the if clause. This allows, for example, recursive programs to stop
creating too many fine-grained threads or tasks at the lower levels
of the recursion tree, when each task becomes small enough to be
processed sequentially. The overhead of creating too many tasks can
impede speed-up.
As discussed above, OpenMP is designed for threads executing
on a single node that share variables. For a cluster of nodes, each
FT
with separate memory, it must be augmented by sharing of data
across nodes. We discuss MPI next, which focuses primarily on the
message-passing paradigm.
6.2 MPI
6
MPI entails somewhat lower-level structures. For example, a common
template is:
forall process
1.
Read a section of the input
iterate:
Exchange additional data that is required
Perform share of computation
Collate and produce partial output
barrier
Collate and produce the final output
FT
We next demonstrate some basic components of MPI with the help of
sample code.
6
MPI_Send(vec, 4, MPI_INT, destID, messagetag, MPI_COMM_WORLD);
// buffer, count, type, destination, tag, comm
}
if(ID == 0) {
1.
MPI_Status status;
MPI_Recv(vec, 4, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,
&status); // Return receipt information
// buffer, count, type, source, tag, comm
}
}
MPI_Finalize();
FT
As a general rule, all MPI names are prefixed with MPI_. MPI
functions return an integer code: MPI_SUCCESS on success, or an
error code. We do not check the return values in our sample code for
brevity.
Compare the Send and Receive functions to the primitives in Listing
4.14. A few additional arguments are used in the MPI functions.
Notice the odd usage of variable vec. It is used in both the send
and the receive. This is somewhat common with MPI programs.
RA
6
view). MPI_ANY_SOURCE and MPI_ANY_TAG are wildcards that
match any ID and any tag, respectively.
Note that the data types or size need not match. In fact, the recipi-
1.
ent is free to re-interpret the data in terms of a different type; maybe,
an array of 4 MPI_INTs in this case. The send’s count parameter
determines the actual size of the data to be sent. The receive’s size
argument indicates the maximum number of received data elements
that would fit in its buffer. If the matching send passes more data
than can fit, this is an error, which is reported in the status parameter
of the receive call. The status indicates the actual number of data
items received. It also includes the IDs and tags of the matched send,
FT
which may be initially unknown to the recipient if wildcard matching
is used. See the listing below, expanded from the example in Listing
6.15 for the recipient.
Message-Passing Synchronization
We next turn to the synchronization implicit in the communication.
As discussed in Section 4.5, because systems provide intermediate
D
6
The buffered version is called MPI_Bsend. MPI_Bsend uses
buffers provided by the program before the send call. Functions
MPI_Buffer_attach and MPI_Buffer_detach are available for buffer
1.
management. This allows the user to provide larger buffers than MPI
may allocate. If the buffer turns out to be insufficient to complete a
MPI_Bsend, the send fails.
MPI_Send is a generic version of send, also called standard mode
send. It is likely to incur lower latency than the other variants. It may
employ MPI’s or OS’s internal buffers, or wait for the receive to be
called (and thus the receive buffer to become available). For example,
it may eagerly send small messages but seek permission from the
FT
recipient before sending larger ones, allowing the recipient to provi-
sion buffers as required. MPI_Send does, in any case, guarantee that
once it returns, the message has been extracted from the send buffer
and is ‘on its way.’ This means that the sender is free to overwrite
the send buffer any time after the return from MPI_Send, and the
original message would still be received by a matching recipient.
This property holds for all the four variants of Send. Hence, they are
called blocking versions – the return is blocked until the send buffer
RA
is emptied.
There is only one MPI_Recv, as the synchronization semantics
are driven by the sender. There also exist non-blocking variants
for each of the four types of sends (and the one receive). These are,
respectively, MPI_Isend, MPI_Issend, MPI_Ibsend, MPI_Irsend, and
MPI_Irecv. These return immediately after a local set-up, without
guaranteeing any message progress. The sender may not modify
the send buffer after these return, until a later assurance of buffer
emptying. Similarly, the recipient may not start to read from the
receive buffer after MPI_Irecv immediately after the call. It must
D
wait for a later proof of receipt. These assurances and proofs are
delivered via a request object, which is returned as a part of the send
and receive calls, as follows.
MPI_Request request;
MPI_Irecv(vec, 4, MPI_INT, 1, 99, MPI_COMM_WORLD, &request);
// The recipient may proceed with code that does not require vec
184 an introduction to parallel programming
MPI_Status status;
MPI_Wait(&request, &status); // On return status has receipt info
// vec is ready to be used now.
Non-blocking communication, in
MPI_Wait blocks in this code until the receipt is complete, mean- addition to eliminating certain types
of deadlocks (discussed later in this
ing that it follows the semantics of MPI_Recv – the operation that
section), also allows processes to
generated the request. Similarly, an MPI_Wait on a send request has perform computation while waiting for
the same semantics as the original type of send. In the following ex- communication to complete, as long as
this computation is independent of the
ample, MPI_Wait has MPI_Ssend semantics: it does not return until communication. This communication-
6
the matching receive has been called. Like all versions of send do, an computation overlap is an important
technique to prevent long-latency
MPI_Wait does not return until the values in the send buffer have
operations, as communication is,
been copied out and saved in an intermediate or final buffer. from creating idle periods and hence
bottlenecks. This is called latency
1.
MPI_Issend(vec, 4, MPI_INT, 0, 99, MPI_COMM_WORLD, &request); hiding and discussed further in chapter
// May not modify vec yet 5.
MPI_Status status;
MPI_Wait(&request, &status); // returns after receive has been found
// AND data has been copied out; vec may be reused.
and sent through multiple function calls. In such cases, the array
based wait or test functions can be useful.
If the message sizes vary significantly and dynamically, the recip-
ient’s pre-allocation of a large enough buffer to hold any sized mes-
sage can be unreasonable. A peek at the incoming message allows
the recipient to know what sized buffer to provision before actually
reading the full message. There is a blocking variant, MPI_Probe,
as well as a non-blocking one, MPI_Iprobe. They need to provide
the source and tag information in order to perform the required
matching. An example is shown below:
D
int flag;
MPI_Status status;
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status)
int count;
MPI_Get_count(&status, MPI_INT, &count);
int *vec = allocateINTs(count);
MPI_Recv(vec, count, MPI_INT, status.MPI_SOURCE, status.MPI_TAG,
middleware: the practice of parallel programming 185
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
performComputation(vec);
6
could allow the type to be chosen based on a probe of the incoming
message, say, by using different tags for different types. The recipient
may read using the appropriate type then.
Understand that sends and receives eventually require synchro-
1.
nization. Fairness is not guaranteed and deadlocks can occur. If, say,
one receive matches two sends, either can be selected to provide
its data. There is no guarantee that the un-selected one would be
selected at its next match. Late coming matches could continually
get selected ahead of it, causing that un-selected match to starve.
On the other hand, if two operations match, at least one of them is
guaranteed to proceed. Apart from this, there is also a possibility
FT
of deadlock if processes have both sends and receives. Both may
depend on the other to complete, which it might not be able to in the
absence of copy-out buffers. For example, the following code could
deadlock:
If (ID == 0) {
MPI_Send(vec1, LargeCount, MPI_INT, 1, 99, MPI_COMM_WORLD);
MPI_Recv(vec2, LargeCount, MPI_INT, 1, 99, MPI_COMM_WORLD, &status);
}
RA
If (ID == 1) {
MPI_Send(vec1, LargeCount, MPI_INT, 0, 99, MPI_COMM_WORLD);
MPI_Recv(vec2, LargeCount, MPI_INT, 0, 99, MPI_COMM_WORLD, &status);
}
6
collation of data on the sender side from its data structures and then
the distribution of incoming data into the recipient structure is a pro-
gramming chore, and it would be a good idea for the programming
platform to automate much of it.
1.
The challenge here is to allow arbitrary data structures on the
sender and programmer. This is somewhat easier for regular and
semi-regular data structures, which MPI facilitates. We will later see
in Section 6.3 a more flexible approach, which supports irregular
patterns better. Regardless, the programmer must not lose sight of
the fact that this collation and distribution has a cost, even if it is per-
formed by the platform. Programs with regular data structures that
FT
share large chunks of contiguous data are communication friendly.
MPI allows defining of new types in terms of its built-in types
and other user-defined types, in contiguous or non-contiguous
arrangements. This allows specific non-contiguous parts of a buffer
to be transferred and then stored non-contiguously at the recipient.
We show a few illustrative examples.
MPI_Datatype intArray4;
MPI_Type_Contiguous(4, MPI_INT, &intArray4); // 4 contiguous ints
Listing 6.18: MPI data type with uniform size and stride
6
MPI_Datatype fColumn2x5; // 4x, 2-float blocks, with stride 5
MPI_Type_Vector(4, 2, 5, MPI_FLOAT, &fColumn2x5 );
// No. of blocks, No. of Items/block, block stride, type of items
MPI_Type_commit(&fColumn2x5);
1.
If (ID == 0) // Matrix may be float*
MPI_Send(&matrix[3], 1, fColumn2x5, 1, 99, MPI_COMM_WORLD);
If (ID == 1)
MPI_Recv(matrix, 1, fColumn2x5, 0, 99, MPI_COMM_WORLD, &status);
but also increasing complexity. The simplest type that suits the given
requirement is likely to yield the best performance. The general
constructor is MPI_Type_create_struct, which takes an array of types
and their element counts, each of which can start at arbitrary offsets.
In the following code, a new type tightly packs 2 ints, 1 float, and 1
fColumn2x5 (constructed earlier), leaving no gaps.
6
int byteOffset = {0, 0, 0};
for(int i=0; i<2; i++) {
int typesize;
MPI_Type_size(basetypes[i], &typesize);
byteOffset[i+1] = byteOffset[i] + (blockCount[i] * typesize)
1.
}
MPI_Datatype newtype;
MPI_Type_create_struct(3, blockCount, byteOffset, basetypes, &newtype);
// No. of blocks, No. of Items in blocks, start of blocks, type in blocks
so that the same function is used on both ends. For example, the
broadcast is achieved as follows:
The source of the broadcast is called the root, the process with rank
0 in this example. All members of the group must call this function.
The buffer argument for the root acts as a send buffer, and that for
D
some process may expect one data item of type intArray4 (as con-
structed in the previous section), while others expect four items of
type MPI_INT.
MPI_Bcast is blocking, and care must be taken to avoid deadlocks.
The following would deadlock unless all members of MPI_COMM_WORLD
have the same value for root in their first call (and similarly a com-
mon value in the second call). Instead, if some member’s first call
matches with another member’s second call, there could be a dead-
lock.
6
MPI_Bcast(vec, 4, MPI_INT, root1, comm1);
MPI_Bcast(vec, 4, MPI_INT, root2, comm2);
1.
Similarly, a deadlock could also occur if members use a different
order of the communicators. Like Bcast, MPI_Scatter and MPI_Gather
are asymmetric. MPI_Ibcast is the non-blocking version. That can
sometimes help. Other collectives described below also have non-
blocking variants, but only blocking examples are provided in the
listings here.
There is a difference of note between the non-collective non-
FT
blocking functions and the collective ones. The blocking-ness of a
function is not considered in matching point-to-point primitives, but
it is for collective primitives. Non-blocking collective primitives only
match other non-blocking collective primitives. Recall also that there
is no tag in collective primitives to separate message streams. This
means that collective primitives must be encountered in a consistent
order across a group. In particular, for two consecutive collective
primitives A and B encountered by a process for a communicator, all
other processes in the group must encounter the one matching B after
RA
int numProcs;
D
6
the other hand, if the root does not need its vec copied to its allvec, it
may provide the constant MPI_IN_PLACE in place of vec in its call to
MPI_Gather. Non-roots are not recipients, though. They may provide
1.
a NULL pointer for the receive buffer (allvec in this example). Their
parameters for the receive type and the count are also unused.
(a) Broadcast
FT (b) All to all: Scatter + Gather
RA
(c) Scatter (d) Gather Figure 6.4: Data Broadcast Gather and
Scatter
If all processes require a copy of the gathered data, the root may
broadcast it after gather. Or, they could all call MPI_Allgather instead
of MPI_Gather, which can accomplish it more efficiently. Another
variant, MPI_Gatherv, allows non-uniform gather: different senders
may send different number of items. MPI_Scatter performs the
reverse operation. MPI_Alltoall, as the name suggests, does both.
It allows exchange from each process to each other process in one
function. All processes scatter an array among the group and also
gather from each member into an array. This effects a transpose of
D
Each sender distributes data from its send buffer, vecout, round-
robin to the group, a block of sendcount items per recipient. Each
recipient stores the received messages in the order of ranks of their
sources. Like other collective communication, different processes may
send different parameter values for type and count, but the total data
6
sent must be equal to the total data received. MPI_Alltoall requires
that all members have data to send in equal measure. If there are
different sizes to send, MPI_Alltoallv or MPI_Alltoallw can be used. In
these cases, the recipient may not immediately know where to store
1.
incoming data from process i until it knows the sizes of data sent by
all processes with rank less than i. To remove this shortcoming, these
functions also intake explicit starting location where each rank’s data
is stored in the receive buffer.
MPI Barrier
FT
All collective operations are semantically equivalent to a set of sends
and receives, probably implemented in an optimized manner. Collec-
tive communication primitives do require that each member of the
group call those functions. There is only loose synchronization – the
calls do not need to overlap in time. Only the order among calls is
enforced. For example, the root process for an MPI_Bcast call may
return once its data is copied out. When the root returns, it has no
guarantee that matching broadcasts, i.e., matching receive events,
have started. MPI_Barrier synchronizes. It has no arguments other
RA
MPI Reduction
Sometimes, partial computation is performed in parallel. Their re-
sults need to be combined. This may be done by gathering the partial
results at one place, and then sequentially combining the partial re-
6
sults. However, recall from Section 3.2 that results could be combined
in O(log n) steps in parallel, whereas the sequential combination
takes O(n) steps, to combine n things. Similarly, collective operations
can also be completed efficiently by using a binary tree structure. It
1.
makes sense then, that reduction may be completed as a part of the
gather process itself. MPI_Reduce does that. Similarly, prefix sum
(see Section 3.6 and Section 7.1) may also be performed efficiently in
parallel using MPI_Scan and MPI_Escan.
each array, at different processes, are reduced and the result stored
in the first location of the receive buffer of the root. Similarly, the
second, third, and fourth MPI_INTs are reduced and received in the
corresponding slots in root’s receive buffer. MPI_SUM and several
other constants refer to pre-defined operations. MPI_SUM performs
addition. Users may define new operations – these are objects of type
MPI_Op and encode binary operations, which must be associative.
These operations are performed in parallel, but the rank-order is
preserved, i.e., rank0_data op rank1_data op rank2_data · · · . We will
take an example below. Naturally, the operation must be well defined
for the type of data, and the data type and count must be the same at
D
all processes.
right[i] += left[i];
}
}
MPI_Op myAdd;
int commute = 1; // Does my operation commute?
MPI_Op_create(reduceFunction, commute, &myAdd);
MPI_Reduce(partialSum, finalSum, 4, MPI_INT, myAdd, root,
MPI_COMM_WORLD);
6
type of operands, which is passed in the MPI_Reduce call, to always
be MPI_INT, and casts the pointers accordingly. A more general
function may do different things dynamically based on operand’s
1.
type. The reduction function reduceFunction is written to perform
multiple reduction operations at a time. This reduces the number of
function calls, but note that the implementation is not required to call
reduceFunction on all four elements of partialSum and finalSum in
a single call. Large vectors may be subdivided to overlap the reduc-
tion function with communication. Also, if the reduction operation
is known to be commutative, the MPI implementation may switch
the order of in and out buffers, i.e., the rank order is not strictly
preserved.
FT
The reduction may be combined with other collectives. For exam-
ple, MPI_Allreduce ensures that each member of the group receives
the reduced data. Semantically, it is equivalent to reduction to a root
followed by a broadcast. Similarly, MPI_Reduce_scatter scatters the
reduced data among the group.
RA
One-sided Communication
tive window. Once attached, this block of its address space becomes
exposed to other members of the group. It’s a window through which
any process in the group can ‘reach into’ another process’s address
space – and read from it or write into it. This is known as remote
memory access. The initiator of the operation is in charge of specifying
the precise location within the exposed buffer, which is to be accessed
through the window. The target, whose exposed part of the address
space is accessed remotely, need not participate.
A group may create multiple windows, each process attaching
6
a contiguous region of its address space to each window. Since the
semantics of one-sided communication is different from the send-
receive protocol, these are named differently: put to write into a
1.
remote address space and get to fetch from it. The following listing
illustrates get and put.
7
FT
// All members collectively create a window and attach a local buffer
MPI_Win_allocate(numBytes, sizeof(int), info, MPI_COMM_WORLD,
&win1buffer, &win1); // Allocate buffer, attach to window
// buffer size, element size, info, comm, buffer pointer, window
8 int sendable[4] = {0, 1, 2, 3}; // Private to this process’s address space
9 initializeBuffer(win1buffer); // The first integer is a scaling factor
10 MPI_Barrier(MPI_COMM_WORLD); // Wait for all to initialize their win1buffer
11 int scale;
12 MPI_Get(&scale, 1,MPI_INT, (ID-1)%numProcs,0, 1,MPI_INT, win1); // Nonblock
13 // source: buffer, count,type; target: ID,atIndex, count,type; window
RA
6
attachment of buffer is also possible using MPI_Win_create_dynamic
once followed by MPI_Win_attach and MPI_Win_detach any number
of times.
1.
In the listing above, line 12 on process i asynchronously fetches
the first element (which we know is an integer) from the window
with the previous process (i 1), modulo the group size. This value,
now in the variable scale, is used to scale all elements of the array
sendable by all processes. This scaled sendable is next sent by every
member of the group to the next process, (i + 1), modulo the group
size, to be stored starting at offset 1, meaning the four integers are
put at indexes 1 · · 4 of the array win1buffer. Since there is no data
FT
race between the get and the put, no synchronization should be
required between them.
Nonetheless, remote memory access functions are non-blocking.
Since the get on line 12 is non-blocking, the variable scale may not
actually have received the data until much after this line. MPI_Win_fence
on line 16 ensures that the data is indeed available in the variable
scale before it is used on line 18. Similarly, a return from put does
not immediately mean that the data has been saved in the remote
RA
process i may have a view of process j’s exposed address space that
differs from the view of other processes of that same address space.
Local caching is allowed. User controlled synchronization is required
to maintain consistency. This synchronization may be in the form of
group-wide memory fence (as in the previous example), pair-wise
synchronization, or locks on the target process’s window.
Group-wide memory fences are similar to shared-memory fences
and ensure that any earlier get and put operations on a window
are completed and ordered before operations that appear after
6
the fence. The flag parameter of the fence primitive is for the pro-
gram to send hints that help improve performance in certain cases.
For example, the program may indicate that there is no put event
1.
between this fence and the subsequent one by specifying the flag
MPI_MODE_NOPUT. Such hints allow the MPI implementation to
forego certain synchronizations. In all cases, outstanding get and put
calls at a process complete before the fence returns on that process.
In particular, an outstanding put must complete at the initiator be-
fore its fence returns, allowing it to reuse its buffer, but it may not
have been written at the target yet. The matching fence on the target
returns after the operation has completed there.
FT
Synchronization may also be performed in smaller groups, par-
ticularly between a single initiator and its target, using a protocol
involving four events: post, start, wait, and complete. The correspond-
ing functions are MPI_Win_post, MPI_Win_start, MPI_Win_wait, and
MPI_Win_Complete. The initiator’s post matches with the target’s
start. This handshake forms a mutual fence. Subsequent get or put
events by the initiator are strictly ordered after these fences. Later, ini-
tiator’s complete matches the target’s wait, marking the completion
RA
of all gets and puts since the post-start handshake. This is demon-
strated in Figure 6.5. The target’s local read from an exposed buffer,
D
values that the target writes into its exposed buffer before it posts are
guaranteed to be ready for the initiator to get between its start and
complete. Naturally, MPI_Win_wait is blocking. The non-blocking
variant MPI_Win_test also exists. On the initiator side, a return from
MPI_Win_complete indicates that the buffer may be reused in case
of a put, and the data has arrived in case of a get. Group-wide fences
are often simpler to use than this four-event protocol, but also less
efficient, particularly when no fence flag applies.
Lastly, MPI supports lock based synchronization. This is the most
6
like shared-memory synchronization in that the lock is with respect
to a window and can be controlled exclusively be initiator; there is
no event required on the target side, like a post, or a group-wide
1.
fence. An example is shown below. This is a coarse-grained lock –
the entire exposed buffer on the given rank is locked at once. In fact,
MPI_Win_lock_all locks all the ranks at once.
int flag = 0;
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, targetRank, flag, win1);
// lock_type, rank, flag, window
MPI_Put(sendable, 4, MPI_INT, targetRank, 1, 4, MPI_INT, win1);
MPI_Win_unlock(targetRank, win1);
FT
When a full lock is not required, the flush primitive is also avail-
able. MPI_Win_flush does not return until the outstanding get and
put for the corresponding window and rank return. Note that lock-
ing also implies a flush: at MPI_Win_lock all outstanding get and put
are flushed.
Remote access is not just limited to get and put. As on other shared-
memory platforms, read-modify-write operations are supported. For
example, MPI_Accumulate operates (using MPI_Op) on a variable
visible remotely through a window. This merges three messages –
one requesting the data, another returning the original value, and
the last to write back the updated value – into as few as one, reduc-
ing the latency significantly. Furthermore, MPI_Accumulate by two
different processes through the same window to the same location
exposed by a third process are serialized, i.e., accumulation appears
D
6
MPI_Win_flush(target, win1);
} while(oldval != expected)
// Limited to one item of certain types. There is no count parameter.
1.
MPI File IO
Any large parallel program is likely to read large input and write
large output. Parallel file systems allow multiple clients to read and
write simultaneously. It stands to reason, then, that MPI processes
may benefit from reading and writing data in parallel. In principle,
FT
the storage may be considered akin to a process, which could broad-
cast, scatter, or gather data. MPI does not provide such interfaces.
Instead, it supports a lower-level interface. MPI allows processes to
see a file as a sequence of structured data, i.e., MPI_Datatype. The
type that the file consists of is called its etype, short for elementary
type. The file is effectively treated as an array of etypes.
Collective IO on this file provides an opportunity for parallel file
access. Each process has its own view of a file. Although less com-
RA
6
1.
Figure 6.6: Two views of the same file
7
FT
MPI_Type_create_resized(int1K, lower, extent, &filetype);
// Resize to artificially create holes
MPI_Type_commit(&filetype);
8
9 MPI_File fh;
10 MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
11 // Com, file name, access mode, implementation hint, file handle
12 MPI_File_set_view(fh, rank*byteBlock, MPI_INT, filetype, "native", MPI_INFO_NULL);
13 // file handle, start, etype, filetype, data rep, hint
RA
blocks, based on its view. The view consists of 1024 blocks of ints
separated by the blocks of other processes. The starting point of the
file is also offset according to the rank. In reading and writing (and
seeking), the holes in the view are skipped. For example, if a process
were to read 1028 integers, its file handle would point to the fifth
integer of the second block in its view, meaning the next read would
read from that point. Each process’s file pointer is independent of
other processors, and proceeds according to that process’s reads or
writes. MPI also supports global file pointers, i. e., shared file pointers:
6
one process’s read advances the shared file pointer for everyone.
File read and write are analogous to receive and send. Unlike
file open and set view, MPI_File_read and MPI_File_write are
1.
not collectives. Collective versions do exist: MPI_File_read_all and
MPI_File_write_all. These allow the system to efficiently perform
combined and parallel IO on behalf of multiple processors. The
file access is sequentially consistent in this case. With separate file
pointers, consistently can be demanded through the use of function
MPI_File_set_atomicity, albeit at the cost of performance. The tradi-
tional weak consistency through the collective MPI_File_sync is also
supported. IO completion does not cross sync boundaries, and syncs
FT
are seen in the same order by all processes. File open and close are
sync primitives by default.
The IO primitives in Listing 6.24 are all blocking. Non-blocking
IO is similar to non-blocking send/recv and is accessed through
functions like MPI_File_iwrite and MPI_File_iread.
6
MPI_Comm_group(MPI_WORLD_COMM, &allGrp);
MPI_Group_incl(all, 3, incl_ranks, &first3Grp);
MPI_Comm_create(MPI_WORLD_COMM, first3Grp, &first3Com);
1.
In this example, each member of the group of MPI_WORLD_COMM
creates a new group. These groups’ memberships coincide – ranks
0, 2, and 4 with respect to group all. These members will have ranks
0, 1, and 2, respectively. Members of a group are always ranked.
For new groups, these ranks are created by maintaining an order
consistent with the constituent groups. Thus, the set operations are
not commutative: the members of the first group are ordered before
that of the second.
FT
MPI_Comm_create in the listing above creates a communicator
first3Com. Communication can then occur within the context of
first3Com, which will involve the members of group first3Grp. Not
all communicators have a single group. Communication may also
be from one group to another. Such inter-group communicators are
called inter-communicators are particularly useful for client-server
style work subdivision. The main focus of this discussion is intra-
RA
group communicators.
Group creation is a local activity. Each process may define its own
groups. However, members of the group that may communicate on
a communicator must all take cognizance. Communicator creation
is group-wide collective. All MPI_WORLD_COMM participants
in the example above must call MPI_Comm_create. Their groups
need not be identical, but each needs to be a subset of the group
corresponding to the original communicator. In particular, different
subsets of the original group may create their own communicators.
For consistency, it is important that in
D
6
MPI_Comm_create_group is more efficient – it is collective with respect
only to the new communicator. If p’s newgroup includes q, q must
also call MPI_Comm_create_group along with p.
1.
MPI Dynamic Parallelism
Instead of reorganizing already existing processes, MPI also supports
dynamic creation of processes. MPI_Comm_spawn is a collective
that creates children processes, but a single member of the parent
group, the root, determines the spawn parameters. Other members
must still call MPI_Comm_spawn to receive the handle to the inter-
FT
communicator, with which they may communicate to the child
group.
int root = 0;
MPI_Info info;
MPI_Comm child_com;
int spawnerrors[numPROC];
MPI_Comm_spawn(command, argv, numPROC, info, root
MPI_COMM_WORLD, &child_com, spawnerrors);
RA
MPI_Init(&argc, &argv);
MPI_Comm parent_com;
MPI_Comm_get_parent(&parent_com);
if (parent == MPI_COMM_NULL) {
// This is a top level process
middleware: the practice of parallel programming 203
6
before its MPI_Finalize.
1.
Point to point, and even collective, communication is somewhat low
level. The program must keep track of explicit ranks to communicate.
Sometimes it is easier to communicate directly in terms of data
relationships. For example, if an n ⇥ n matrix is divided into m ⇥ m
blocks and distributed block-wise among processors, one might want
to receive the left extremal columns from the right block, or the right
extremal column from the left block (as shown in Figure 6.7). The
FT
RA
neighbors are known, standard send and recv can proceed as before.
MPI defines two types of topologies: d-dimensional grid or a
general graph. The following listing imposes a 2D grid topology,
using MPI_Cart_create, a collective primitive. All processes of a
communicator must call this function with the same parameter
values. Grids can wrap-around in torus configuration, using the
periodic boolean flag, specified separately for each dimension. The
topology creation functions return a new communicator, to which the
topology is associated. For optimization of communication, an MPI
6
implementation may re-number the processes, providing a new ID in
the new communicator, unless the caller requests that the processes
retain their ranks.
1.
Listing 6.25: MPI Processes in a Grid Topology
int ID0;
MPI_Comm_rank(MPI_COMM_WORLD, &ID0);
MPI_Communicator newcomm;
int mxm = {m, m}, periodic = {true, true}, rerank = true;
MPI_Cart_create(MPI_COMM_WORLD, 2, mxm, periodic, rerank, &newcomm);
// initial comm, dimension, Wrap-around?, rank rename?, new comm
int recv[4];
FT
MPI_Neighbor_allgather(&ID0, 1, MPI_INT, recv, 4, MPI_INT, newcomm);
6.3 Chapel
6
these locales are set up by providing argument when the execution
begins. These locales are referred symbolically as an array Locales in
the source code. Tasks are assigned to locales for execution.
Chapel also provides the illusion of a single monolithic address
1.
space across multiple processes and nodes. The actual data remains
distributed among nodes under a layer of abstraction called Parti-
tioned Global Address Space (PGAS, for short). The PGAS abstraction
maps certain addresses to the local memory, or the given process’s
address space, and certain others to a different process’s. Language
support is required to do this seamlessly because traditional lan-
guages only map variable names to local addresses. Library-based
memory.
FT
PGAS tools also exist; they provide function based access to non-local
6
the current execution.) One may subsequently query the location of
an index using distributedA[i][j].locale. There are other built-in
distributions, analogous to MPI datatypes. Similarly, gather and
scatter can be effected by reading from and writing to appropriate
1.
locations in the array distributed across locales. Separate from dmaps,
scope-based allocation of variables on a certain locale is also sup-
ported. Before we take an example of that type, some understanding
of the execution model is required. We discuss this first.
Chapel Tasks
FT
Chapel’s syntax differs somewhat from C/C++. (Parts of it are
similar to Python.) A study of Chapel’s documentation would be
required for readers trying to use Chapel. The goal of this section,
apart from introducing the notion of PGAS, is to get a taste of a par-
allel programming language that seeks to let the programmer focus
mainly on algorithm design and not on the low-level bookkeeping –
hopefully, with little loss of performance. Some details about Chapel
first (compare these to OpenMP):
RA
• Variables (and constants) have static types, but they can be in-
ferred and need not always be declared.
middleware: the practice of parallel programming 207
The task forker used the sync keyword to wait for its tasks (and
their nested tasks). We demonstrate the tasking, sync, and locales
capabilities in the example below. The following code sequentially
dequeues tasks from a queue, creating a task per item, to be executed
at one of the available locales. Compare this listing to 6.13 and 6.14,
which accomplishes similar results with OpenMP.
6
sync {
while((taski = taskQ.dequeue()) != nil) {// Process next item on the queue
begin { on Locales[loc] workOnTask(taski); } // Create task on a locale
loc = (loc + 1)%numLocales // Distribute Round-robin
1.
}
} // sync implies: wait for all tasks generated in the block
// Initialization
D
// Use
while(! qvar.compareExchange(false, true)); // Set to true, if false
do_criticalSection();
qvar.write(false);
208 an introduction to parallel programming
Chapel does not include explicit private and shared variable designa-
tion. Other than the explicit location using domain dmaps described
earlier, the location and scope of variable can also be implicit.
6
4 var second = 2; // This variable is on Locales[1]
5 // The following loop executes on Locales[1]
6 coforall loc in Locales { // Create concurrent tasks, 1 per iteration
7 on loc { // On Locales[loc]
1.
8 var local = distA[0] + distA[here.id*8] + second; // Fetch non-local
9 }
10 } // An implicit join with children tasks here.
11 }
12 on distA[local] do {computeSomething();} // Compute wherever the data is
this has limited utility. As argued before in this book, the structure of
parallel algorithms can be significantly different from that of sequen-
tial algorithm. While some pre-determined sequential patterns can
be converted into an efficient parallel program, it remains impractical
in a general setting. Chapel does not set out to derive such paral-
lelism, but rather to allow the programmer to devise the parallelism
middleware: the practice of parallel programming 209
and then express it at a high level. It still has some way to go before
its runtime is as efficient as hand-tuned MPI applications in com-
munication. Not all of its main features have been included in this
section. For example, it does have equivalents of reduction, single,
barrier, and other synchronization primitives. It also has modern
programming language features like iterators, zippering of iterations,
promotion of functions from scalars to vectors, etc.
6.4 Map-Reduce
6
OpenMP, MPI, and Chapel were all designed primarily with compute-
intensive workloads in mind. They focus on ways for the program
1.
to distribute arbitrary computation. In contrast, the map-reduce
paradigm was designed with more data-centered computation in
mind. This paradigm focuses on the distribution and collation of data
with a small number of primitives: map and reduce.
Map-reduce is built around the idea of large-scale data-parallel
computation – each data item is operated upon. This is the map
operation. For generality, the map primitive is not one-in one-out:
FT
it may also generate data items. The program is nothing but a map
function such that map(item) ! {item set}. By itself, the map paradigm
is quite limited; it is suitable only for purely data parallel solutions.
In data analysis, statistical properties of the data items are usually
required. These are often computed used reduction. That forms the
second step of the map-reduce paradigm. The program includes a
reduce function such that reduce({item set}) ! item. The final item is
the result.
Admittedly, mapping each data item and then reducing the entire
RA
Map(K, V ) ! list(Ki , Vi )
Map and Reduce are functions with fixed input and output patterns
and user-provided implementation. Given a single <key-value> pair,
210 an introduction to parallel programming
6
electronic item K.
Given the two primitives, a rather complex analysis may be done
by chaining together a series of map-reduce operations.
1.
Parallel implementation
Map-reduce is a high-level programming model. The program needs
no reference to hosts, locations, processes, or threads. While one
can implement general solutions using map-reduce, it works best
where underlying operations are naturally similar to map and reduce.
The programmer does not need to provide parallel constructs as the
FT
parallelism is built into the map and reduce primitives. The input is
expected to be a set of <key-value> pairs, with a Map operation to
be performed on each. All these maps are independent of each other
and may be performed in parallel. The results of the map have to
be sorted by the Keys. Sorting, as we will see later, parallelizes well.
Once the values are sorted into bins, one bin per key, each bin may
be reduced in parallel. Thus the Map and Reduce functions need
to be merely sequential. The parallelism comes from having a large
RA
processes that execute the Reduce for each key and “shuffle” the
corresponding values to each Reducer.
6
stores, machine learning frameworks, etc. We will limit our discus-
sion to the basic map-reduce program structure as implemented in
the Hadoop framework 4 , 5 . Hadoop is a library-based utility widely 4
Tom White. Hadoop: The Definitive
Guide. O’Reilly Media, Inc., 2009. ISBN
1.
available with Java as the base language.
0596521979, 9780596521974
5
Apache Software Foundation. Hadoop
project, 2020. URL http://hadoop.
Hadoop apache.org
outValueType outvalue;
for (inValueType inval : invalues) { // Iterate over invalues
accumulate(outvalue, inval);
}
// Possibly iterate producing multiple outvalues
context.write(outkey, outvalue);
}
}
6
produce and accumulate are the only user functions required in a stage.
An opaque Context handle is used to generate the output by both
mapper and reducer. Two classes, myMapper and myReducer in this
1.
example, implement map and reduce, respectively. These classes are
registered with the framework using a provided class Job before the
job is launched.
An application may chain multiple map-reduce stages by using
a sequence of Jobs with input and output sett accordingly. Hadoop
also supports a two-step reduction. The keys emanating from a single
mapper may be reduced at the mapper itself. Cross-mapper keys are
then reduced at a reducer. This strategy, of combining the values at
FT
a mapper first decreases the size of the data shuffled from mappers
to reducers. Thus, Hadoop may well be called a map-combine-reduce
framework. The program provides a combiner class, just like it
provides the Reducer class. For many applications, the reducer class
may also double as the combiner class.
5. GPUs of the day, due to their relatively low memory size and
indirect access to disk storage, are poor at context switching and
virtual paging. This imposes significant limits on the program.
6
OpenMP GPU Off-load
1.
OpenMP. It provides a simple programming model based on the
computation off-load paradigm, which suits the clear separation be-
tween CPUs and GPUs at both architectural and OS levels. Programs
start as a part of a CPU process, and specific functions are designated
to be executed on the GPU. This may be thought of as a variant of
RPC, as shown in Figure 6.8. We will refer to the GPU part of the
code as the device part and the CPU part as the host part. Both belong
to the same process. Hence, they can conceivably share a common
FT
address space. Sharing variables between the host and the device is
not always efficient, however. A more common strategy is to treat the
host and each GPU on a node as distributed-memory processors with
explicit copying of shared data. In OpenMP terms, this is similar to
the device code always using private variables. In some GPU architec-
tures, inter-GPU sharing is efficient. Shared memory may sometimes
be practicable for that part. OpenMP does not expose this shared
style, though.
RA
6
Unlike MPI_Win’s explicit get and put primitives, the map clause
of the target pragma is used to create the linkage between the device
variables and their host counterparts. Map options allow original
1.
variables’ values to be copied to the corresponding device variables at
the beginning of the task. They also allow variables to be copied back
from the device at the end of the task. Note that OpenMP implemen-
tations are allowed to omit physical device copies, and directly share
the original copies instead, given that the device and the host func-
tions share the same address space. Variables shared in this manner
are copied to the device on access by a device instruction and may
be cached on the device. The usual caveats about data races apply. It
FT
is, hence, useful to think of the host variables and their correspond-
ing device variable to be separate copies that are synchronized only
before and after the task, depending on the map options. At other
times, they may diverge from each other. Mapped variables generally
should not be accessed in the host code concurrently with the device
code. Traditional clauses private and firstprivate may be used instead
of map. These variables are always copied to the device.
The following listing illustrates the off-loading style.
RA
The code above off-loads a task containing the parallel for loop to
device 0. (Device 0 is the initial default. Function omp_get_num_devices
may be used to determine the number of attached devices.) The task
is in-line by default; there is an implicit barrier at the end of the con-
struct. The threads created by the enclosed parallel pragma execute
on the device, while the host task or thread encountering the con-
middleware: the practice of parallel programming 215
struct waits. If the nowait clause is specified in the target pragma, the
target task is forked and scheduled for later asynchronous execution,
while the parent task continues beyond the pragma. Note that sepa-
rate pragmas must be used for each device – by using separate code,
or by iterating over a block of code, using different device ID in each
iteration.
Arrays left, right, and result are originally accessible to the
encountering host task. (Target tasks must not encounter target
pragmas.) The to parameter in the map clause indicates that the
6
device’s private copies of arrays left and right are initialized from
the original host values. The from parameter does the opposite: at
the completion of the device code, the values in the result array are
1.
copied from the device variable back to the host variable. If both side
transfers are required, map(tofrom: . . .) may be used instead.
If a variable is used in the device code but is neither listed as a
private (or firstprivate) nor in a map clause, implicit copy rules apply.
Scalar values (int, float, etc.) are firstprivate by default, meaning
they are copied from the host to the device for each target task they
are used in. Non-scalars (arrays, structs, objects, etc.) are mapped
tofrom, if not listed on any map, private, or firstprivate clauses. All
FT
scalars may also be mapped in both directions by using the default-
map(tofrom:scalar) clause.
In addition to maps on the target pragma, a variable may also be
persistently mapped, so it does not need to be re-copied for each
device task. We explore this next.
}
#pragma omp end declare target
device code.
Sometimes, mapping – whether implicit or explicit – of each
variable at each target task generation can be wasteful. Not all target
tasks require every variable to be copied in or copied back. Rather, it
may be possible that input variables are copied in before the first task
in a sequence of tasks, and the output variables are copied out after
the last task in the sequence. Target data pragma solves this problem.
Map clauses on the target data pragma apply to its entire code block,
which may contain target pragmas, as shown below.
6
Listing 6.30: OpenMP GPU off-load with reduced data copying
1 // Initialize: int size; float X[size], Y[size]; Allocate: *diff
1.
2 #pragma omp target data map(alloc:diff[0:size]) map(to:X[0:size])
3 map(tofrom:Y[0:size])
4 {
5 #pragma omp target // size is firstprivate for task
6 {
7 #pragma omp parallel for
8 for (int i = 1; i < size-1; i++) // All GPU threads share size
9 diff[i] = (Y[i+1] - Y[i-1]) / (X[i+1] - X[i]);
10 }
11
12
13
// host code can go here
FT
#pragma omp target // size is again firstprivate
{
14 #pragma omp parallel for
15 for (int i = 1; i < size-1; i++) // All GPU threads share size
16 Y[i] += shift(diff[i], X[i+1]-X[i]);
17 }
18 }
RA
second task. Finally, only Y must be copied back to the host after the
second task. This is controlled by the target data pragma (line 2) that
contains the two tasks in its block.
Both tasks rely on the device data mapped by the target data
pragma, which is the primary regulator for its listed maps. In this
example, line 2 maps X to the device, ensuring that the device version
middleware: the practice of parallel programming 217
6
enclosed pragma like so:
1.
FT
Figure 6.9: Pointer Mapping
It is worth noting that a pointer really has two aspects: the address
value in the pointer variable itself and the data stored at that address.
For example, in the code above, the pointer ⇤di f f may contain the
value A, meaning the floating-point array values are stored starting
at address A. See Figure 6.9. Mapping diff to the device, and thus
RA
initializing the device copy of diff also with the value A would be
incorrect, unless the device directly accesses the host memory. A is
the address of the data on the host. Hence, the data at A must itself
be mapped to the device. And the device’s diff must be initialized
with the B, to which host A is mapped. In OpenMP, mapping of
diff maps both the pointer it’s referred data. The pointer, being
a scalar, maps as firstprivate and the array maps as per the map
option specified, i. e., alloc in the example above. This necessitates that
the original A must exist on the host and have a known size, even
though it is never accessed there.
D
6
Y[i] = 0.5 * (X[i+1] + X[i]);
}
1.
It may be enclosed within a target pragma or combined with it.
The clause num_teams(count) requests count teams. Each team has
the same number of threads. This number is limited by the GPU
architecture, but clause thread_limit(count) may request smaller teams.
There is no synchronization possible between two teams executing
on the device, except the implicit barrier at the end of the task. The
teams pragma, by itself, replicates the entire task to the master thread
FT
of each team. The distribute pragma allows work sharing instead. The
distribute pragma must be followed by a for loop; it distributes the
iteration of the loop among the teams, quite like a for pragma does
among the threads of a single team. The clause dist_schedule may be
specified with the distribute pragma to control which iterations are
allocated to which team. Still, only the master thread of each team
gets that team’s share of iterations. The parallel for allows the master
thread of each team to further share its load with the threads of its
team. Finally, the simd prgama ensures that each thread uses SIMD
RA
In this case, the outer loop is distributed among the teams – rather
D
the master thread of each team. Each master thread then encounters
the parallel for construct, which is shared by its team of threads.
A thread can query its rank within its team using the function
omp_get_thread_num described earlier. omp_get_team_num returns
the rank of the calling thread’s team.
middleware: the practice of parallel programming 219
CUDA
CUDA is a more mature GPU programming framework than
OpenMP. However, it focuses on lower-level constructs than OpenMP.
This finer program control is often able to extract higher performance.
Like OpenMP, CUDA uses the off-load model; device functions are
off-loaded to, and execute on, the specified GPU, identified by its
rank, i. e., device ID. CUDA has two main components.
The first is a C-like programming language, called CUDA. It
6
contains a small number of extensions to C/C++, but most of its
functionality is exposed through functions. CUDA programs require
a CUDA compiler, which may in turn leverage a C compiler for
translating the standard C/C++ parts. Device functions are cross-
1.
compiled to be executed on the GPU. Host functions are compiled
to the CPU. Both parts of the program are stored in a common exe-
cutable, which includes instructions to load the device code on to the
device as required.
The other component is the CUDA runtime environment, which
allows CPU-executed code to interact with GPUs. It includes data
communication, GPU code transmission, execution setup, and thread
FT
launch, etc. Like OpenMP runtime, CUDA runtime provides func-
tionality that implements CUDA constructs and exposes functions
that a program may use to query GPU information as well as con-
trol GPU behavior. Like MPI functions, CUDA functions return an
error code on any error and the constant cudaSuccess on success. We
will not check this returned code in our illustrations, but it is a good
practice to do so.
RA
6
number of threads to create in each team. Each thread is passed the
function’s parameters by value, and each thread executes the function
body. The team of threads is called a block of threads in CUDA. The
1.
block is further organized into groups of up to 32 threads; each
group is called a warp. Threads of a warp execute together in SIMD
fashion. It is useful to note that the thread terminology of CUDA
differs from that of OpenMP. The entire CUDA warp is equivalent to
one OpenMP device thread. In CUDA. Thus, the program directly
controls each SIMD lane. The threads of the warp may diverge, to
execute different code.
Variables left, right, and result in the listing above are declared
FT
on the host, and point to memory allocated by the host code. Both
the pointer and the address it points to are visible on both the host
and the device. cudaMallocManaged is used instead of the standard
C/C++ malloc or new. cudaMallocManaged allows CUDA runtime
to ensure that the allocated address is efficiently accessible on the
device, but malloc and new are also accessible on both host and de-
vice. Thus, left, right, and result are truly shared between the host
and the device in the example above. This means that any concurrent
RA
6
dim3 team_size(4, 8, 16); // 4x8x16, z dimension is 4, x is 16
1.
ranks, execute in a warp. The serialized rank of a thread is:
6
of left and right, respectively, to the specified device. If the data is
not initially written on the host, no actual transfer takes place. Once
the data is pre-fetched to a device, access is local within a kernel
and hence does not incur a long latency. To prefetch to the host, the
1.
special device ID cudaCpuDeviceId must be used. The last parameter
(NULL) of the function cudaMemPrefetchAsync is a stream, which we
will discuss shortly.
There also exist lower-level interfaces, where the transfer is per-
formed explicitly by the host program, using one-sided communi-
cation. Refer to cudaMemcpy and cudaMemcpyAsync. Such explicit
transfer may also become necessary for host variables not allocated
FT
using cudaMallocManaged. For global or static variables, one may
declare them to be on device, similar to OpenMP declare target.
The variable dev_var and the function dev_func are both declared
to be device entities. Device functions may be called from other
device functions including the __global__ kernels. The __managed__
keyword is also available and indicates that a variable is shared
between the devices and the host.
__managed__ int dev_var;
D
Concurrent Kernels
Kernel launch is non-blocking on the host. The same CPU thread or
different threads may each launch multiple kernels on to multiple
GPU devices. However, each device may execute only one kernel
at a time by default. The next kernel to that device waits until the
previous completes. CUDA has an abstraction called streams that
middleware: the practice of parallel programming 223
6
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
1.
// Set up left, right, result, num_teams, team_size
thread_func1<<<32, 256, 0, stream1>>>();
thread_func<<<num_teams, team_size, 0, stream2>>>(left, right, result);
cudaStreamSynchronize(stream1); // Wait for events in stream1 to complete.
// May use result here
// Later, after streams are no more needed:
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
FT
Both kernels above (thread_func and thread_func1) may execute
concurrently, as they are in different streams. The third parameter
in the <<<>>> construct request an additional block of memory
private to each block and shared by all threads of the block. We will
discuss this shortly. cudaStreamSynchronize may be used on the host
to wait for only a specific stream. cudaDeviceSynchronize waits for all
outstanding execution on the device instead. Memory allocation and
transfers may also be associated with specific streams.
RA
In the listing above, the host thread first launches the kernel thread_func1
in stream1. This kernel may begin to execute immediately on the de-
D
vice. The host thread then associates the pre-fetch of memory areas
*left and *right to a stream2 before launching thread_func in that
stream. All three are non-blocking calls on the host. The pre-fetching
can occur concurrently with the execution of thread_func1 as they
are in different streams. The execution of the thread_func kernel
follows the pre-fetch on stream2, thus ensuring that the access to
224 an introduction to parallel programming
6
follows:
1.
cudaMemcpyHostToDevice indicates the direction in which the transfer
is to take place. This parameter is required for legacy reasons. Con-
temporary CUDA runtime is able to infer the direction of transfer
and cudaMemcpyDefault may be used instead.
Kernels may also be launched from within device code, meaning
any thread of a kernel may recursively launch a child kernel. Threads
FT
of the parent kernel may execute concurrently with the child thread
but wait for the execution of the child to complete before exiting
themselves.
CUDA Synchronization
CUDA supports atomic operations, memory fences, and execution
barriers. Additionally, warps execute in synchrony, except when
threads diverge due to scheduler’s decisions or due to conditional
RA
branch in the code. Note that no two thread of a warp may execute
different instructions in the same clock. When threads of a warp
diverge and start executing different parts of the code, they are
no more in lock-step. Rather, subsets diverge. Subsets take turn to
execute their instruction, leaving some lanes in the warp un-occupied.
For example, in
else
even_work();
6
atomicAdd(&var, value);
1.
respect to other devices and CPU, e.g., atomicAdd_system. Variants
also exist for atomic operations with respect to other threads of
the block, e.g., atomicAdd_block. The atomic operation is a powerful
synchronization primitive, but it also serializes threads and does not
scale well. GPU kernels commonly employ hundreds and thousands
of actively executing threads. Performance impact can be significant
if many of them perform an atomic operation on the same address at
FT
roughly the same time. Block level synchronization is likely to have
less slowdown. Warp level operations are even more efficient, and
several such synchronization primitives exist.
Relative to other atomic operation, compare and swap (atomic-
CAS) is more flexible and may be used to implement more complex
synchronization.
__syncthreads() is a block-wide barrier. There exist consensus
type variants as well, which perform a form of voting. For example,
RA
variable val local to each thread of the warp, using the tree reduction
algorithm shown in Section 3.2. After five iterations of the loop, the
thread in lane 0 contains the reduced value in its val.
6
mask 0xffffffff at all iterations. The right half of the threads could be
inactivated at each step, but changing the mask requires additional
instructions. The warp-wide barrier on line 2 before the loop ensures
that all warp threads are converged and start the loop in lock-step.
1.
There is no kernel-wide barrier in CUDA, but there exists an
evolving abstraction called cooperative groups, which is an arbitrary
group of threads. Threads within a group may barrier-synchronize,
even if they are not in the same block, as long as the entire group is
resident on the GPUs.
Recall that careful ordering of memory accesses is required if
one thread reads the value written by another. Memory fences are
FT
required due to weak consistency semantics in CUDA, just like
OpenMP and MPI one-sided communication. Warp-wide, block-wide
and system-wide memory fences are supported. For example, a
__threadfence() function call by any thread i ensures that its accesses
to device memory before the fence are all ordered before its accesses
after the fence. In particular, memory-writes by thread i before its
fence appear also to all other device threads to have completed before
any writes by thread i after that fence (see Figure 6.10). __thread-
RA
and the CPU. Note that unlike OpenMP, a cache flush is not implied
in CUDA memory fences. Some implementations of CUDA do not
provide complete cache coherence and variables must be declared
volatile to disable caching. This can lead to performance degradation.
Synchronization primitives __syncthreads and __syncwarp include
an implicit memory fence and (do not require the use of volatile).
middleware: the practice of parallel programming 227
6
1.
FT
available SMs for execution. Since the scratchpad is local to the SM,
Figure 6.11: Shared memory on GPU
SM
6
The last parameter of the launch construct, the stream, is missing in
this example and defaults to Stream 0. This dynamic allocation is
accessed as an extern within the kernel function.
1.
extern __shared__ int *shtemp;
shtemp[threadIdx.x] = some_array[base+threadIdx.x];
// Launch ensures that this is not out of bounds
coalescing.
The accesses to device memory and SM-local memory behave
differently from each other. Device memory is accessed using the
standard cache-hierarchy, and device memory atoms are contiguous
addresses. Thus, if threads of a warp access contiguous addresses,
the coalescing efficiency is good: either all the accesses can be satis-
middleware: the practice of parallel programming 229
fied directly from one or two cache-lines, or from one or two memory
atoms (which are brought into cache-lines).
Device memory atoms are aligned, meaning they begin at 32, 64,
or 128 byte boundaries. Variable addresses in CUDA already begin
at atom boundaries, courtesy of the compiler. Thus, indexes used in
a warp may coalesce well if they are also aligned. For example, if the
indexes used by a warp’s threads are contiguous and thread 0 of the
warp uses an address that is 128-byte aligned, coalescing is effective.
In other words, our working example is efficient because warp i starts
6
at offset i ⇥ 32 ⇥ sizeo f (int) for each array, and cumulatively accesses
32 ⇥ sizeo f (int) bytes. All these bytes belong to a single 128-byte
aligned atom (if each array begins at an aligned address).
1.
result[threadIdx.x] = left[threadIdx.x] + right[threadIdx.x];
A[blockIDx.x][threadIdx.x] *= B[blockIDx.x][threadIdx.x];
For any given row, the thread ID is used as the column number in
warps 0..31, 32..63, etc. They are aligned only if each row begins at
an aligned address. See cudaMallocPitch to allocate an array with
aligned rows.
SM-local scratchpad memory is a bit more elaborate and multi-
ported. Not only is it able to service a 128-byte contiguous block of
memory, but it can also service more complex patterns. To under-
D
there are bank conflicts, instead, the accesses must be serialized. The
following accesses are conflict-free.
6
// Operate on any part of shA
The loop iterates over columns of the matrix. A warp’s threads all
1.
store values in column i at iteration i. Array shA is stored row-major.
Assuming shA[0][0] resides in Bank 1, shA[0][1] 2 Bank 2, and so
on, as shown in Figure 6.12. Assuming 32 banks, shA[0][30] 2 Bank
31 and shA[1][0] 2 Bank 2. Thus, each column is distributed across
banks; column 0 is highlighted in the figure. This means that even
for a row-major ordered matrix, column order access is efficient.
FT
Figure 6.12: SM-memory bank address-
RA
ing
The example in Listing 6.39 also exhibits a common use of block
memory. It uses block-shared memory as a user-controlled cache to
make device memory accesses efficient. Suppose the device array M
needs to be accessed in a row-major order, which does not coalesce
well for row-major ordered matrices. The code above allows M to be
read in contiguous coalesced chunks into the faster shared memory.
The rows are written into columns in the shared memory, without
causing bank conflict, and thereafter yielding a row-major order. The
__syncthreads() function call ensures that all required parts of M are
brought in to the shared memory before the kernel starts to process
D
it.
False Sharing
We discussed how memory requests within a warp are coalesced for
efficient operation. It is possible to arrange that each warp’s accesses
are contained, for example, in a single cache-line. Sometimes, how-
middleware: the practice of parallel programming 231
ever, cache-lines can create a hazard, and such hazard is not limited
to GPU threads, but also CPU threads.
When two independently scheduled threads access different
memory locations, which happen to map to the same cache-line,
their accesses can interfere with each other, leading to significant
performance slow-down. This happens because, when thread i writes
variable v, which resides in some cache-line, the entire line is marked
‘dirty,’ in all caches. When thread j executing on another core reads
or writes variable w, which happens to map to the same dirty cache-
6
line, a cache-coherent memory system delays the access until the line
is ‘clean’ again. The line is cleaned by writing thread i’s modified
cache-line into main memory and re-reading the line into thread
1.
j’s cache. The two threads end up falsely sharing variable v and w
and impacting each other’s performance. For example, consider the
listing below.
Thread 0 Thread 1
for(int i=0; i<N; i++) for(int j=0; j<N; j++)
point[i].y += 0.1; s += point[i].nbr;
A cache-line can hold multiple Points. The two loops above should
be able to exploit locality in their reference to service several access
RA
6.6 Summary
6
programming. the chapel language. Int. J. High Perform.
Comput. Appl., 21(3):291—-312, August
2007. ISSN 1094-3420
Exercise
11
Philippe Charles, Christian Grothoff,
1.
Vijay Saraswat, Christopher Don-
awa, Allan Kielstra, Kemal Ebcioglu,
6.1. Use the SIMD construct of OpenMP for Jacobi iteration as fol- Christoph von Praun, and Vivek Sarkar.
lows. A is an n⇥n 2D array of floats. X10: An object-oriented approach to
non-uniform cluster computing. In
1 forall i,j, 1 < i,j < n-1 Proceedings of the 20th Annual ACM
SIGPLAN Conference on Object-Oriented
2 A[i][j] = 0.25 *
Programming, Systems, Languages, and
3 ( A[i][j+1] + A[i][j-1]+ A[i-1][j] + A[i+1][j]) Applications, OOPSLA ’05, page 519–538,
New York, NY, USA, 2005. Association
for Computing Machinery
FT
Note the apparent race condition in the code above. Make sure
that the old values of A are added on line 3 always, never the new
values.
12
Tarek El-Ghazawi and Lauren Smith.
Upc: Unified parallel c. In Proceedings
of the 2006 ACM/IEEE Conference on
Supercomputing, SC ’06, page 27–es,
New York, NY, USA, 2006. Association
6.2. Redo Exercise 6.1 with CUDA.
for Computing Machinery. ISBN
0769527000
6.3. Implement function 13
J. Nieplocha, R. J. Harrison, and R. J.
Littlefield. Global arrays: a portable
PrefixSum(Sum, A, n)
"shared-memory" programming model
for distributed memory computers.
RA
In Supercomputing ’94:Proceedings
to compute the prefix sum of A in Sum. Use OpenMP. A is an inte- of the 1994 ACM/IEEE Conference on
ger array in the address-space of the caller’s process. Assume up Supercomputing, pages 340–349, 1994
to 16 shared-memory processors are available. n is the number 14
Michael Voss, Rafael Asenjo, and
James Reinders. Pro TBB. Apress, 2019.
of elements in A, and may be between 1 and 230 . Test your per- ISBN 978-1-4842-4397-8
formance with values of n equaling, respectively, 210 , 215 , 220 , 225 , 15
Jeff Bezanson, Alan Edelman, Stefan
230 . Karpinski, and Viral B Shah. Julia: A
fresh approach to numerical computing.
6.4. Redo Exercise 6.3 with MPI, with 2-24 nodes. SIAM Review, 59(1):65–98, 2017. doi:
10.1137/141000671. URL https://epubs.
siam.org/doi/10.1137/141000671
6.5. Given a matrix A laid out in file File1 in row-major order, create
file File2, where the transpose of A is written in row-major order.
D
6
(e) Each processor group in Exercise 6.5d also shares a GPU.
Create tasks and map them to the devices. Be sure to consider the
memory availability of devices in task sizing. You may use CUDA
1.
for GPU, OpenMP for shared memory programming, and MPI for
message passing.
6.7. Given a matrix A laid out in file File1 in row-major order and B
laid out in file File2 in column-major order, write A ⇥ B in file File3
in row-major order. Run the program on different matrix sizes:
FT
210 ⇥ 210 to 240 ⇥ 240 on 1 to 1024 processors and analyze its scaling
behavior. Implement for all five scenarios in Exercise 6.5.
6
6.11. Given a list of 230 integer elements in an array Elements in the
address space of one node N0 , implement MPI-based Quicksort
using 8, 32, 128, 512, and 1024 nodes, respectively. The sorted list
1.
should appear at N0 on completion. The input array Elements does
not need to be saved. Analyze the profile to check which parts
take the most time and why. Analyze the efficiency and scaling.
6
Question: What do parallel algorithms
This chapter introduces some general principles of parallel algorithm look like?
design. We will consider a few case studies to illustrate broad ap-
Question: How do parallel algorithms
proaches to parallel algorithms. As already discussed in Chapter 5,
1.
differ from sequential algorithms?
the underlying goal for these algorithms is to pose the solution into
parcels of relatively independent computation, with occasional in-
teraction. In order to abstract the details of synchronization, we will
assume the PRAM or the BSP model to describe and analyze these
algorithms. It is a good time for the reminder that getting from, say,
a PRAM algorithm to one that is efficient on a particular architecture
requires refinement and careful design for a particular platform. This
FT
is particularly true when ‘constant time’ concurrent read and write
operations are assumed. Concurrent read and writes are particularly
inefficient for distributed-memory platforms, and are inefficient for
shared-memory platforms as well. It requires synchronization of
processors’ views of the shared memory, which can be expensive.
Recall that PRAM models focus mainly on the computational
aspect of algorithm, whereas practical algorithms also require close
attention to memory, communication, and synchronization overheads.
RA
PRAM algorithms may not always be practical, but they are easier
to design than those for more general models. In reality, PRAM
algorithms are only the first step towards more practical algorithms,
particularly on distributed-memory systems.
Parallel algorithm design often seeks to maximize parallelism
and minimize the time complexity. Even if the number of actually
available processors is limited, higher parallelism translates to higher
scalability in practice. Nonetheless, the work-time scheduling prin-
ciple (Section 3.5) indicates that low work complexity is paramount
for fast execution in practice. In general, if the best sequential com-
D
plexity of solving the given problem is, say, To (n), we would like the
parallel work complexity to be O( To (n)). It is a common algorithm
design pattern to assume up to To (n) processors and then try to
minimize the time complexity. With maximal parallelism, the target
time complexity using To (n) processors is O(1). This is not always
achievable, and there is often a trade-off between time and work
236 an introduction to parallel programming
6
algorithms, and very much like PRAM algorithms, it focuses on the
computational aspect of the solution. Directly applying this principle
to map a PRAM algorithm onto a limited number of processors
1.
does not always exhibit the best performance. For example, in the
context of the BSP model, communication overheads are lower if the
virtual processors that inter-communicate substantially are mapped
onto the same physical processor. For shared-memory machines,
different PRAM algorithms can lead to different synchronization
overheads. This can make an algorithm that is apparently faster on
paper actually slower in practice. Hence, even though the theoretical
algorithm may naturally suggest a task decomposition, it may need
FT
to be adjusted to account for the hardware architecture.
While the focus of this chapter is on parallel algorithmic style of
thinking and ab-initio design, it also presents a few cases of oppor-
tunistically finding inherently parallel steps in known sequential
algorithms.
We have already discussed the reduction algorithm in Section
1.6, which is an example of a parallel algorithm organized as a bi-
nary tree of computations. This is an oft-occurring paradigm, where
RA
6
This is an efficient sequential algorithm – it takes O(n) steps, and no
algorithm may take fewer steps asymptotically. However, each step
depends on the previous computation, precluding any meaningful
1.
parallelism.
Prefix-sum is a good example of problems where an efficient
sequential algorithm does not admit parallelism, but a fresh parallel
design affords significant parallelism. Trying to factor out and reuse
common computation is an important tool in sequential algorithm
design. That is precisely what causes the dependency, however.
Instead, a parallel algorithm is designed to subdivide a problem into
FT
independent parts, even if those parts repeat some computation.
For the prefix-sum problem, the main question is how to com-
pute s[i ] without the help of s[i 1]. An extreme way to break the
dependency is as follows.
6
Parallel Prefix-Sum: Method 1
We first break the dependency chain only between s[ n2 ] and s[ n2 + 1],
allowing s[ n2 + 1] to start to be computed before s[ n2 ] is available. This,
1.
in turn, breaks the dependency of all s[i ], i > n2 , on any s[i ], i n2 .
both halves are computed, the full prefix-sum for i > n2 remains to be
computed. However, now that the correct value of s[ n2 ] is known, the
partial sums can be completed in a single parallel step in O(1) time
with O(n) work as follows,
7.4).
By itself, the trick above does not lead to an improved complexity,
because the prefix-sum still needs to be computed for each half,
within which the dependencies remain unbroken. One sequential
problem has been subdivided into two sequential problems, each of a
smaller size. Using the divide and conquer paradigm, one can divide
parallel algorithms and techniques 239
6
s[ n2 ] can be accessed by all processors concurrently. This analysis
holds for CREW PRAM model and other models allowing concurrent
read. Exclusive read, as in EREW PRAM, or a broadcast of s[ n2 ] to all
1.
n
2 virtual processors, as in BSP, would require additional time. (See
Exercise 7.2.)
Once the two partial prefixes are known, they can derive the full sum
quickly from each other, as on line 6 above. This variant of top-down
interleaved decomposition has the same time and work complexity
as the previous block decomposition. (See Exercise 7.3.) However, it
does not suffer from the bottleneck of broadcasting s[ n2 ] to the entire
second half. Please note that the recursive structure of the overall
algorithm (as of all the algorithms in this section) is similar to that of
Method 1.
The first f orall sums the input values in pairs. The next step recur-
sively computes the prefix-sum on these pair-wise sums, which are
n
2 in number. This recursion will also proceed in the same bottom-up
fashion. The second f orall computes the final prefix-sum from the
prefix-sum of the pairs. The structure of the algorithm is depicted in
6
Figure 7.1. In Method 3, the dependencies are removed by reducing
the input set first (in parallel, of course). The recurrence relations for
1.
row of boxes denotes the state of the
array s after each step. The numbers
in the bottom row constitute the input
values. The first and last steps are each
fully parallel and completed in O(1) by
O(n) processors.
t(n) = t
⇣n⌘
FT + O (1)
⇣2 n ⌘
W (n) = W + O(n)
2
This yields the optimal work complexity of O(n), while retaining the
time complexity of O(log n). An unrolling of the recursive statement
shows that the structure of the solution is similar to the binary-tree
based computation of Section 1.6: a reduction going up the tree
RA
Figure 7.2(a) shows the steps of the upward reduction pass, and
Figure 7.2(b) shows the downward completion pass. Each level is
shown as a row of boxes, which depict the values in the array after
each step. In the upward pass, pairs are summed at each step, with
parallel algorithms and techniques 241
the number of such sums halving at each step. The sum so produced
at the last step of the upward pass is the sum of all values, which
is also the prefix-sum s[n]. In the downward pass, the prefix-sum
values are computed at each level from the prefix-sum evaluated
at the level above. The bottom-most level then computes the final
prefix-sum.
6
1.
(a) Upward pass (b) Downward pass Figure 7.2: Prefix-Sum algorithm in
two passes up and down a binary tree.
Having found an efficient algorithm to reduce the dependency The values in the dark font are the
in the prefix-sum, we will soon see that prefix-sum can, in turn, ones updated at that step. Hence, the
be used as a subroutine to break similar dependencies in other number of active processors at each step
are indicated by the number of values
problems. Prefix-sum is more generically called a Scan, which has no in the dark font in that row.
FT
connotation of summing. Exclusive scan is defined as a scan where the
ith element is not included in the ith result, i. e.,
i 1
si = Â dj.
j =0
1 i = j = 0
2 while(i < n and j < n)
3 if(list1[i] < list2[j])
4 output list1[i++]
5 else
6 output list2[i++]
7 while(i < n)
8 output list1[i++]
242 an introduction to parallel programming
9 while(j < n)
10 output list2[j++]
Let the output list be called list3. This algorithm takes O(n) steps.
It is inherently sequential because only after the result of the compar-
ison of the pair list1[i ] and list2[ j] is known at line 3 in an iteration,
that the pair to compare in the next iteration is determined.
6
We first consider breaking this dependency similarly to the previous
section. The standard binary subdivision of list1 and list2 into two
halves each, followed by the merger of each pair does not yield two
1.
independent sub-problems. However, for each half, it is easy to
determine the block of the other list that it needs to merge with, so
that the recursive sub-problems do become independent.
of elements in list that are less than x.2 Let us use the shorthand 2
Note that this usage of the term is
rank1 for Rank (m1, list2), the rank of m1 in list2. m1 here is the slightly different from that in Chapter 6.
Figure 7.3). Hence, the two smaller sets of elements are the smallest
rank1 + n2 elements of list3. This implies that the first part of list3
can be obtained by merging the first n2 elements of list1 with the first
rank1 elements of list2. The second part of list3 can be obtained by
merging the remaining elements of list1 with the remaining elements
of list2. These are two independent merge sub-problems. The lists
to merge need not have the same length any more. In the following
listing, we assume list to have n1 and n2 elements, respectively. The
merged list is produced in list3.
D
list3[0..ni/2+ranki-1] =
Merge listi[0..ni/2-1] with listj[0..ranki-1]
list3[n/2+rank..n1+n2] =
Merge listi[ni/2..ni] with listj[ranki..nj]
6
EREW PRAM model is:
✓ ◆
3n
t(n) t + log n
4
✓ ◆ ⇣n⌘
3n
1.
W (n) W +W + log n
4 4
6
positions of list1, and list2e comprises
the even positions of list2. Rank e 1 and
Rankee 2 are the rank lists with respect
to list1e and list2e .
1.
FT
demonstrates how to compute Rank1 and Rank2 from Rank e 1 and
Rank e 2. Rank e 1 and Rank e 2 may be computed recursively, or simi-
larly to the bottom-up variant of the parallel prefix-sum algorithm.
Say, Rank e 1[i ] = re is the rank of element list1[2i ] in list2e . The fig-
ure shows this element as x2i . This implies that re elements of list2e
are smaller than xi , meaning list2[0..2re 1] are smaller than xi and
list2[2re ] > xi . Hence, Rank ( xi , list2) is also 2re if xi < list2[2re 1]
and 2re + 1 otherwise. (Note that list2[2re 1] is not included in list2e ,
RA
and hence it was not compared in the recursive merging of list1e and
list2e .) Ranks of all elements of list1e in list2 and those of elements
of list2e in list1 can be computed in this manner in parallel with each
other, taking O(1) time under CREW PRAM model. Thus, the work
complexity to compute Ranklist(list1e , list2) and Ranklist(list2e , list1),
given Ranklist(list1e , list2e ) and Ranklist(list2e , list1e ) is O(n) . Con-
current read is required because, say, Rank( xi+1 , list2e ) may also be re .
In that case, list[2re 1] would be required in the computation of both
Rank( xi+1 , list2) and Rank( xi , list2).
We next compute the ranks of the odd-index elements of list1 and
list2. These are at list1[2i + 1] for index i of list1e , and similarly for
D
list2. (Note that 2i + 1 may reach beyond the end of list1; these edge
effects are easy to handle, but we ignore them here for simplicity of
description. One may assume the value • at such indexes.) Rank1[2i +
1], which is a shorthand for Rank(list1[2i + 1], list2), can be computed
from Rank1[2i ] and Rank1[2i + 1], which are known from the previous
step. Recall that list1 is sorted, and hence list1[2i ] < list1[2i + 1] <
parallel algorithms and techniques 245
list1[2i + 2]. Suppose Rank1[2i ] and Rank1[2i ] are equal; call them r.
This means Rank1[2i + 1] must also be r, since list2[r 1] < list1[2i ]
and list2[r ] > list1[2i + 2] and hence list2[r 1] < list1[2i + 1] <
list2[r ]. We may not always be so lucky though. For example, in
Figure 7.4, Rank ( x6 , list2) is 3, and Rank( x8 , list2) is much higher, say
n. To find Rank ( x7 , list2), we must find which elements in the range
y3 ..yn 1 are smaller than x7 . A binary search would find that index
but would take too long; we seek an O(1) algorithm.
Realize, however, that we already know the ranks of elements
6
Rank(y j , list1) for all even j, 3 j < n. These ranks are all either
7 or 8 in the example. In fact, we are looking for the index k such
that Rank(y j , list1) = 7 for j k and Rank (y j , list1) = 8 for j > k.
1.
In this example, k = 8. Thus k can be computed in O(1) time if
processor j for each even value of j checks if Rank(y j , list1) + 1 equals
Rank(y j+1 , list1). The processor – and there is exactly one – that finds
it true may now compute Rank(y j + 1, list1) as well as Rank ( x7 , list2)
in this example, and more generally Rank ( xr1 , list1), where r1 is
Rank(y j , list1). Rank( xr1 , list2) needs to be computed only if r1 is
even. This is detailed in Listing 7.11.
FT
Listing 7.11: Parallel merge 2: "Bottom Up?" dependency breaking
Merge(list1e , list2e ) // Create Rank1e and Rank2e
in parallel // First compute rank of even elements from Ranke
forall processor p, 0 <= p < n and p%2==0
if(list2[Rank1[p]+1] < list1[p])
Rank1[p] = 2*Rank1[p] + 1
else
Rank1[p] = 2*Rank1[p]
forall processor q, 0 <= q < n and q%2==0
RA
else
Rank2[Rank1[p]] = p+1
forall processor q, 0 <= q < n and q%2==0
if(Rank2[q] == Rank2[q+2])
Rank2[q+1] = Rank2[q]
else if(Rank2[q]+1 == Rank2[q+2]) and Rank2[q]%2 == 1
if(list1[Rank2[q]] > list2[q+1])
Rank1[Rank2[q]] = q+2
246 an introduction to parallel programming
else
Rank1[Rank2[q]] = q+1
The recurrence relations for the algorithm in Listing 7.11 for the
CREW PRAM model is:
⇣n⌘
t(n) t + O (1)
⇣2 n ⌘
W (n) W + O(n)
2
6
This implies t(n) = O(log n) and W (n) = O(n). Work is optimal, but
can time complexity be improved? Let us investigate.
1.
Parallel Merge: Method 3
Recall that the main task is to compute the ranks of every element of
list1 and list2 in each other. Each rank can be potentially computed
independently of the other ranks. One natural way to partition this
task is to subdivide one of the lists, say list1, into sublists list1m , 0
m < P, for a given P. Employing P = n processors, each processor
may complete its ‘merger’ in O(log n) time by performing a binary
FT
search for the singleton element of list1m in list2. Partitioning a
problem into sub-problems to solve is a common parallel algorithm
design technique.
A closer inspection indicates that Rank (list1m , list2) Rank (listm+1 , list2).
Separate binary searches for list1m and list1m+1 disregard this rela-
tionship, each proceeding independently of the other. On the other
hand, we do not want the search for list1m+1 to wait until that for
list1m is complete. On the other side, trying to reduce the time com-
plexity further by performing faster searches for list1m may require
parallel algorithms and techniques 247
6
x < list P [ p]. This condition is true for at most one value of p, given
that elements in list are unique. If the condition does not hold for
any processor p, it implies x > list P [ P 1], i.e., x lies in the last block.
1.
In O(1) time with O( P) work, we thus determine the block of list in
which x may lie. We recursively allow all P processors to find the
rank of x in that block next. This extends the sequential binary search
into a P-ary search.
if x > list[L+P*n_p]
with P processors: Search in range {L+p*n_p}..R
else
forall processor p in 1..P-1
if list[L+(p-1)*n_p] < x < list[L+p*n_p] // Test for equality to find x
with P processors: Search in range {L+(p-1)*n_p}..{L+p*n_p}
// x not in this processor’s block. Return.
log P
the efficiency with P processors is proportional to O( P ). Nonethe-
less, this algorithm scales up to P = n, and takes time O(1) with n
processors, which equals what brute-force search would take with n
processors.
The P-ary search algorithm above demonstrates one other algorith-
mic technique. It generalizes the binary-tree computation structure,
248 an introduction to parallel programming
6
every kth element of list1 into list1k , meaning list1k [i ] = list1[ik]. A
large value of k reduces the size of the recursive sub-problem. On the
other hand, a large k also leave a large number of ranks remaining to
1.
be computed after the sub-problem is solved.
p p
Suppose k = n. list1k and list2k are each of size n. As a
p
result, we can find the rank of each element of list1k in list2 using n
processors for each search:
p
Listing 7.14: Parallel Merge 4: n subdivision
p
// Rank n elements of list1 in list2
p
rootn = n
FT
forall processor p in 0..rootn-1
with rootn processors P-ary Search list1[p*rootn] in list2[0..n-1]
p
n processors can find Rank(list1k [i ], list2) for any i in O(1) time
p p
using O( n) work. Since there are n elements in list1k , n proces-
sors can compute all of Ranklist(list1k , list2) in O(1) time, with O(n)
work. We can similarly compute Ranklist(list2k , list1) in O(1) time,
with O(n) work. This seems good; except much work remains – we
p
RA
in list2 are known after the P-ary search. Call them ri and ri+1 . We
p p
know that ranks of all elements list1[ x ], where i n < x < (i + 1) n,
are also in the range ri ..ri+1 . This means that we can decompose the
p p
merger into smaller mergers: Merge list1[i n..(i + 1) n 1] with
list2[ri ..ri+1 1]. This sub-problem can be large if ri ⌧ ri+1 .
However, in that case, just as in parallel merge method 2, the
range list2[ri ..ri+1 ] contains elements from list2k whose rank in list1
p
are known. Those elements delineate blocks with no more than n
p
elements each. Moreover, their ranks in list1 are not more than n
6
p p
apart, as they all lie in the range (i n)..((i + 1) n 1). This ensures
that we may now independently merge pairs of blocks of list1 and
p
list2, respectively. The number of such pairs is at most 2 n as at
p
1.
least one block of each pair has n elements. Thus the recurrence
relation for complexity is:
p
t ( n ) = t ( n ) + O (1)
p p
W (n) = nW ( n) + O(n)
This means that t(n) = O(log log n) and W (n) = O(n log log n). This
W (n) is not optimal, even if the time complexity is now lower. A
FT
subtle point to note: each recursive sub-problems computes the ranks
only with respect to its block of elements. For example, in Figure 7.5,
the recursive sub-problem computes the rank of list2[k + 1], the ele-
ment shown as ⌅, in the part of list1 marked by $. If this computed
rank is srank, Rank(list2[k + 1], list1) is srank + Rank (list2[k], list1).
remains log log n). Recall that if Rank (list1k [i ], list2k ) is r, list1k [i ] =
list1[ik ] lies between list2k [r 1] and list2k [r ], i.e., between list2[(r
1)k] and list2[rk ]. There are only k 1 elements between list2[(r 1)k ]
and list2[rk], and hence a single processor can locate list1k [i ] in O(k)
steps. nk processors can, in parallel, compute the ranks of nk elements
of list1k in list2. In parallel with these processors, nk processors can
compute Ranklist(list2k , list1) in O(k) time.
Similar to Merge Method 3, we now have 2 nk pairs of lists to merge,
each with no more than k elements. With k = log log n, each of these
6
mergers can be completed by a single processor in O(log log n) time,
requiring O(n) total work.
Thus the total time complexity of the optimal merge algorithm is
1.
O(log log n), and its work complexity is O(n) on CREW PRAM. This
is work-time optimal. Time complexity of any work-optimal PRAM
algorithm to merge two sorted lists with n elements is W(log log n).
In fact, the lower bound to merge sorted lists on an EREW PRAM is
W(log n). 3 . 3
T. Hayashi, K. Nakano, and S. Olariu.
Work-time optimal k-merge algorithms
on the pram. IEEE Transactions on Parallel
7.3 Accelerated Cascading: Find Minima and Distributed Systems, 9(3):275–282,
1998
FT
This section demonstrates a technique called Accelerated Cascading,
which is designed to first reduce the depth of the computation tree,
leading to an algorithm with lower time complexity at the cost of
increased work complexity. That algorithm can then be combined
with a work-efficient algorithm, which may have a slightly higher
time complexity. It works by recursively partitioning the problem
into sub-problems. It is a generalization of the binary tree computa-
tion structure and partitioning, except the number of sub-problems is
RA
not two (or a fixed number), but a function of the problem size itself.
p
For example, one may partition a problem of size n equally into n
sub-problems at each level.
We will use accelerated cascading to solve the problem of finding
the minima of an unsorted list of values. (Assume these values are
comparable to each other.) The regular binary-tree structure works
well for this problem:
Listing 7.15 requires O(log n) time and O(n) work. Work complex-
parallel algorithms and techniques 251
6
// Find if list[i] is the minima
smallerthan[i] = false
forall processor p in 0..(n-1), p != i
if(list[i] > list[p]) // Found a smaller element
1.
smallerthan[i] = true
p
is O(n n), with the time remaining O(1). In the second step, the
p
minima of the n block minima can be computed by repeating the
same algorithm. The second step requires O(1) time and performs
O(n) work. Let us call this algorithm Minima1.
Minima1 computes the minima of a list containing n elements in
1
O(1) time with O(n1+ 2 ) work. Minima1 was derived by employing
the more work-expensive algorithm on smaller blocks of the data.
Can we reduce this further by re-applying the same idea? The answer
is yes. We call this general technique accelerated cascading.
6
p
Suppose we employ Minima1 on blocks of size n. The first step
1 1
requires n 2 parallel invocations of Minima1 on blocks of size n 2 each.
1
The second step finds the minima of the n 2 block-minima, again
1 3 1
1.
using Minima1. The resulting total work is n 2 n 4 = n1+ 4 . The time
taken is that in two invocations of Minima1. This is the Minima2
algorithm.
This could go on. After k successive operations, the work com-
1+ 1
plexity achieved is n 2k . But, can this really go on indefinitely? Let
us take a closer look. If we use Minima2 on the original problem,
p
we require running n instances of Minima1 in parallel, each on a
p
block of n elements.
p
n elements into
FT
pp Each instance ofpMinima1,
n blocks of size
p
in turn, divides its
n each. This looks like the
binary tree algorithm structure, except the number of sub-problems
created at each level is not a fixed 2. Rather, it is the square-root of
the size of the problem at that level. Each of those sub-problem’s size
is also the square-root of the levels’ problem size.
in O(1) time, we know the O(log log n) levels can each be computed
in O(1) given sufficient processors at each level. Note that there are
n
2 computation nodes at step 0 (leaf level) and 1 node at the root. In
n l
general, there are l computation nodes at level l, with 22 elements
22
processed per node. Given that (n2 ) work is required to find the
l l
minima of n items, (22 )2 work is required to find the minima of (22 )
parallel algorithms and techniques 253
6
quentially. One processor block add up to O(log log n) work. We next
n
apply the fast minima algorithm on the log log n block minima, taking
O(log log n) time and O(n) work. That is work-optimal and has a
1.
better time complexity than the algorithm with the basic binary tree
structure.
Solutions to List ranking in this section and Euler tour and Con-
nected components in the next sections demonstrate the parallel
FT
algorithmic technique known variously as Pointer Jumping or Recur-
sive Doubling. It is particularly useful for traversal of paths in lists
and graphs.
Such traversal starts at a “root node” and follows pointers until a
specific node, or the end of the path, is encountered. For example, to
find connected components in a graph, one may perform a breadth-
first or a depth-first search starting at some arbitrary node, labeling
all reached nodes with the starting node’s label. The main idea is to
start exploring paths from all nodes in parallel, later Short-circuiting
RA
rank = 0
while(current != NULL) {
current.rank = currentrank
current = current.next
currentrank = currentrank + 1
}
254 an introduction to parallel programming
6
// Find the rank of all n nodes. headnode is the first node of a linked list
forall processor p in 1..(n-1)
if(nodelist[p] == headnode)
nodelist[p].rank[p] = 0
1.
else
nodelist[p].rank[p] = 1
skipnext[p] = nodelist[p].next
for step = 0..log(n-1)
forall processor p in 0..n
if(skipnext[p] != NULL)
nodelist[skipnext[p]].rank = nodelist[skipnext[p]].rank + nodelist[p].rank
skipnext[p] = skipnext[skipnext[p]]
FT
skipnext is initially a copy of the next reference of each node. This
copy is required because we later modify this reference to short-
circuit certain nodes, and do not want to destroy the original linked-
list. There are log n steps in the main loop of Listing 7.19. At each
step, all processors update the currently estimated rank of its next
node with reference to its own rank’s current estimate. Then, each
processor short-circuits its next reference by ‘jumping’ it to its next
node’s next reference.
RA
6
For example, in breadth-first traversal, all children of a node may be
traversed in parallel with each other.
In this section, we consider a depth-first traversal, particularly of
1.
a binary tree. This traversal is also called an Euler tour of the binary
tree. It proceeds as follows:
i.e., its rank, may be computed in parallel with other node’s ranks.
We will determine the rank by turning the binary tree structure
into a veritable list, which encodes the traversal order. As Figure
6
initial values of rank to 0 for these nodes (see Listing 7.19).
1.
Let us next see how to use pointer jumping to derive a simple algo-
rithm to find connected components in an undirected graph. (The
basic idea also applies to directed graphs.) Let us assume that graph
G is given as a list of edges edgelist, where ith edge edgelist[i ] is a pair
(u, v), where u and v are integers identifying two vertices, respec-
tively. Let us call the number of edges m, and the number of vertices
n.
FT
The goal of the problem to find connected components is to as-
sign a label label [u] to each vertex u, which identifies its connected
component. If vertex u has a path to vertex v, they are in the same
connected component. If there is no such path, they are in different
components. In other words, for all edges (u, v), label [u] = label [v].
Further, no such edge (u, v) may exist that label [u] is different from
label [v]. Vertices in different components have different labels. (Can
you make labels be a component ID? See Exercise 7.19.)
RA
label[p] = p
Repeat until any label changes
forall processor p in 0..m
(u,v) = edgelist[p]
if(label[u] > label[v])
label[u] = label[v]
parallel algorithms and techniques 257
Note that processors for two edges incident on a vertex u may both
write two different values to label [u] in the same step. An arbitrary-
CRCW PRAM would allow any one of these writes to succeed. List-
ing 7.21 works under this model. It terminates with a correct labeling.
The relabeling stops only when all edges have the same label on both
its vertices. Since all vertices start with unique labels, and no edge
exists between any pair of vertices in two different connected compo-
nents, no such pair may have the same label. Also, if vertices u and
v in the same component have different labels, it means at least two
6
adjacent vertices on the path u to v have different labels. However, no
two vertices connected by an edge are allowed to have different labels
by the algorithm above.
1.
How many steps are required before labels converge? At each
edge where there is no convergence yet, the label of higher-labeled
vertex reduces by one. The initially smallest labeled vertex of each
component never changes its label. Call that vertex the root of the
component. The root’s label is taken by its immediate neighbors
first and then by their neighbors until it diffuses through the entire
component. This process might suggest that the root’s label reaches
the entire components in O(P ) steps , where P is the maximum
FT
length of the path from the root to any vertex in its component.
This is not strictly true because root’s label does not necessarily
reach its neighbor u in one step, since another edge’s processor may
succeed writing its value in label [u]. Note, however, that the label
can only reduce to a neighbor’s label. Hence, in at most d(u) steps,
label [u] changes to label [root], where d(u) is the degree of u. This is
true for any vertex v along the path from the root. Hence, the total
complexity is O(P + d( G )), where d( G ) is the degree of the graph.
RA
Since all m processors may be active for all steps, the total work
complexity is O(m(P + d( G ))).
Ignoring the effect of the degree, the progress of label along dif-
ferent paths from the root appears to be similar to list ranking. It
is reasonable to expect a similar pointer jumping would take time
logarithmic in the length of the path. The difference here is that the
graph is not a linear structure like a list, and we need to determine
which way to jump. The labels we generate impose a direction to
jump. Let the label-tree be formed by directed label-edge from vertex
u to vertex label [u]. This is a forest in general, and in the beginning,
D
each vertex is an isolated tree with the vertex’s label set to itself.
In the next algorithm, processors are associated with graph edges.
For simplicity, we associate processors for both edges (u, v) and
(v, u). Each processor attempts to merge two adjacent label-trees
corresponding to its associated edge (if certain conditions are met).
A pair of label trees T1 and T2 are said to be adjacent if there is an
258 an introduction to parallel programming
6
1 // Set label(u) == label(v), iff u and v are in the same component. Graph has m edges, n vertices
2 forall processor p in 0..n
3 label[p] = p
4 forall processor p in 0..m
1.
5 active[p] = true
6 while(there is an active processor) {
7 forall p in 0..m, active[p] == true
8 (u,v) = edgelist[p]
9 if inStar(u) and label[u] > label[v] // See Listing 7.23
10 label[label[u]] = label[v] // Hook star’s root to the smaller root
11 if inStar(u) and label[u] != label[v]
12 label[label[u]] = label[v] // Hook star’s root to the other if not hooked on line 10
13
14
15
16
if not inStar(u)
else
active[p] = false
FT
label[u] = label[label[v]] // Pointer jumping
17 }
The processor for edge (u, v) in the listing above first checks if u is
a part of a star. If it is a part of a star, it attempts to hook its parent,
i.e., its root, to the parent of neighbor v. v need not be in a star. The
RA
adjacent tree must not be a star at 11, because any star that is hooked
ceases to be a star. Further, any star S2 adjacent to S1 at line 9 could
not remain unhooked on line 11 because S2 did have at least one
adjacent star with a smaller root: S1.
If a star has no adjacent tree, it does not hook. In that case, that
star is one of the graph’s final connected components. See Figure
parallel algorithms and techniques 259
6
0 as the root.
1.
7.9 for an illustration. The vertex identifiers are shown in the ovals.
FT
Figure 7.9(a) shows the state of the algorithm at some step, when
there are three label-trees. The left-most tree is not a star, and the
other two are. Edges connect tree 2 to both tree 1 and tree 3; it is
adjacent to both trees. The processor associated with edge (7, 4)
attempts to hook tree 2 to tree 3, while the one associated with edge
(6, 3) attempts to hook it to tree 1. In arbitrary-CRCW PRAM, one of
the writes succeeds on line 9. Let’s say the second one. On that line,
tree 3 does not hook to tree 2 because tree 3 already has a smaller
RA
label. Instead, it hooks to tree 2 on line 11. Note that before that line,
tree 2 becomes a part of tree 1. There is a single tree remaining after
the two hooks, and it is not a star. A single step of pointer jumping
on line 14 turns the tree into a star, which is the final connected
component. The algorithm terminates in the next iteration.
How does the algorithm terminate, though? Since common write
is allowed, processors may set a shared variable anyactive4 to false 4
We dispense with the $ suffix for
at the beginning of every iteration on line 6. Every active processor shared variables in this chapter; vari-
ables are shared by default in PRAM
then sets anyactive to true at the end of the iteration. If no processor
sets anyactive, it remains false, and all processors terminate. The
other step that is not detailed in Listing 7.22 is how to determine if
D
4 star[u] = false
5 star[label[label[u]]] = false // u’s grandparent is also not a star
6 star[u] = star[label[u]] // If its parent was marked non-star, u is not star
6
may remain marked stars. However, if the root has even a single
grand-child, it is marked non-star on line 5. This is the evidence for
child-less children of the root to be marked star on line 6. Figure
7.9(a) demonstrates this. The processor for edge, say, (2, 1) marks
1.
node 2 as non-star first on line 4. This processor next marks the
grandparent of node 2, i. e., node 0, non-star on line 5. Finally, node 3
is marked non-star by the processor for edge (3, 0) on line 6.
The algorithm terminates in time O(log P ). The distance of each
node in a non-star tree to its root halves in each iteration (except that
of the root and its children). Once the tree becomes a star, it must
hook to another tree with a new root, and the distances continue
FT
to halve. Since all m processor may remain active until the end, the
work complexity is O(m log P ).
Sometimes it may be necessary to count the number of connected
components and assign contiguous identifiers instead of a root’s label.
This can be easily achieved by using a prefix-sum in O(log n) time
using O(n) work.
7.7
D
Pipelining: Merge-sort
Basic Merge-Sort
Recall from Section 7.2 that merge can be completed in O(log log n)
time with O(n) work on CREW PRAM. Merge-sorting a list of n com-
parable elements begins by merging n2 pairs of singletons, followed
by n4 pairs of 2-element lists, and so on until the last step merges 1
pair of n2 -element lists. It proceeds as follows:
6
Listing 7.25: Relabel Connected components contiguously
1 // Sequentially Merge-sort a list with n element
1.
2 for step = 0 to ceil(log(n))-1 // Assume n is a power of 2
3 numpair = 2log(n) step 1
4 listlen = 2step // Adjust last pair’s len if n is not a power of 2
5 for pair = 0 to numpair-1
6 p0 = pair*2*listlen
7 merge list[p0..p0+listlen-1], list[p0+listlen..p0+2*listlen-1]
On the other hand, it appears that the steps of the loop on line
2 must proceed sequentially. After all, the lists at level i are not
available until step i 1 is complete. That would imply that log n
D
ing that improves the parallelism, and hence the time complexity,
without increasing the work complexity.
Pipelining amounts to incrementally performing otherwise se-
quential steps by decomposing them into sub-steps. Sub-steps can be
performed on a part of the input, without waiting for the entire input.
Such pipeline is possible if the steps also produce the output incre-
mentally, a part at a time. Merging algorithms described in Section
7.2 satisfy this general requirement. Recall, for example, that Merge
Method 2 merges two sorted lists by first ranking the even elements
6
and then using the results to rank the odd elements. Thus it does
not need the values of the odd elements at the start. Consequently,
it produces the rank of the even elements first, and then the ranks
1.
of the odd elements later. Further, the entire algorithm is recursively
applied, as illustrated in Figure 7.11 (for level 3 of Figure 7.10). The
part of the list that is processed exponentially evolves to the entire
list.
the first and the middle elements are active. They are, respectively,
the even and the odd elements at that sub-step. For each merger of
two lists, a sub-step computes the ranks of each list’s active elements
with respect to the active elements of the other list. In sub-step j, 2 j
elements are active, doubling with each sub-step. The rank of all
active elements at sub-step j can be computed in O(1) from the ranks
computed in sub-step j 1 (see Merge Method 2).
That opens up the possibility of pipelining sub-steps. The newly
active members required at sub-step j of level i may be produced by
level i 1 any time before sub-step j of level i, and not necessarily
before its sub-step 0. The pipeline would be perfect if those active
D
6
and processor allocation must account for this. In the non-pipelined
version of the merge-sort algorithm, only one level is active at a time,
possibly simplifying processor allocation.
1.
Pipelined Merges
dren into its evolving list. A node becomes active and starts merging
two ticks after elements start to appear in its children’s lists. Three
ticks after the children complete their lists, the parent also completes
its merger and deactivates. The activation and deactivation happen
level by level, from the leaf level to the root.
Thus level i activates at tick 2i + 1, and completes at tick 3i. This
means that 3 ticks after level i is complete, level i + 1 also completes.
Thus there are O(log n) ticks. We will see that the algorithm takes a
constant time per tick. The incremental merger is similar to Merge
Method 2, except a more general sublist is processed at each tick.
We will refer to the final list produced by node x by L( x ). The two
D
6
Figure 7.12: Pipelined merge: ticks 3
to 11. Ticks 0-2 and 4 have no mergers
and are skipped. The state of the active
levels are shown. Inactive levels whose
1.
lists are used by active levels are also
shown, but greyed out. The sublisting
of the children is indicated with L1 , L2 ,
or L4 . Children’s elements selected
in the sublists are shown with a dark
outline. The elements added to a level
at successive ticks are shown as circles,
triangles, and squares, respectively. At
tick 9, both levels 3 and 4 are active.
Note that the state of a level’s lists
is shown at the end of each tick. For
FT example, level 4 processors at tick
9 merge L4 (3, 8), the sublists of the
children’s lists produced at tick 8, and
shown in the tick 8 figure.
k = 4 if t < 3i 1
D
k = 2 if t = 3i 1 (7.2)
k = 1 if t > 3i 1
6
the last element, we let e2 = •.
Two sorted lists list1 and list2 with n elements each may be
merged in O(1) time using O(n) work if Ranklist( X, list1) and
1.
Ranklist( X, list2) are given and X is a 4-cover for both list1 and
list2. We first “invert" X’s ranks by computing Ranklist(list1, X ) and
Ranklist(list2, X ). Listing 7.26 shows how to compute Ranklist(list, X ).
given Ranklist( X, list). Since we know r = Rank ( X [i ], list) for ele-
ments of X, we also know Rank(list[r ], X ). We only need to derive
the ranks of the other elements near list[r ] in list. An example is
shown in Figure 7.13, explaining the steps of Listing 7.26.
1
FT
Listing 7.26: c-cover merge
// Compute rankx = Ranklist(list, X) given xrank = Ranklist(X, list),
2 forall processor p in 0..(n-1) // n elements in list
3 rankx[p] = -1 // Unfilled value
4 forall processor p in 0..(|X|-1) // |X| is the number of elements in X
5 if(i == 0 or xrank[i-1] != xrank[i])
6 if (list[xrank[p]] == X[p]) // Do not include X[p] in rank of list[xrank[p]]
7 rankx[xrank[p]] = p
else // Included X[p] is rank of list[xrank[p]]
RA
9 rankx[xrank[p]] = p+1
10 if (p < n-1 and xrank[p+1] > xrank[p]+1) // Also rank the next element of list
11 rankx[xrank[p]+1] = p+1
12 forall processor p in 0..(n-1)
13 if (rankx[p] == -1) // Not filled yet
14 for i = p-1 to p-3 // Look up to 3 steps to the left for the first filled rank
15 if(xrank[i] != -1)
16 rankx[p] = rankx[i]
17 exitloop
18 if(i < 0) // No filled rank found (xrank[-1] is effectively 0)
19 xrank[p] = 0;
D
Like our prior assumption, the elements within each list remain
unique. However, X may have some elements that appear in list and
others that don’t. Since we define r = Rank( x, list) as the number
of elements in list that are strictly less than x, we must differentiate
between these two cases. If x == list[r ], all elements to the left of x
in X are less than list[r ], e.g., X [6] = 21 in Figure 7.13. On the other
266 an introduction to parallel programming
6
The elements at those positions of list
are ranked in X at line 7 or 9 of Listing
7.26. These ranks are shown below list
case, list[r + 1] is greater than x. list[r + 1] is also less than the element in bold font. The ranks computed on
1.
line 11 are shown in light font. Finally,
to the right of x in X, as long as that element has a rank different the ranks of the remaining elements
from that of x. Hence, Rank (list[r + 1], X ) is one more than the index of list are computed at line 16 and are
shown above list.
of x in X. Remember that two consecutive elements of X may have
the same rank, for they have 0 elements of list between them. For
example, the ranks of X [2] and X [3] are both 5. The listing lets the
processor associated with the last of these equal elements of X to set
the inverse rank (on line 5).
FT
The processor associated with x also sets the ranks of list[r ] and
list[r + 1] on line 11. All the elements to the right of list[r + 1] that do
not get a reverse rank in the previous step also have ranks equal to
that of list[r + 1]. For example, r = 1 for X [0]. list[r + 1] = list[2] has
rank 1 (one more than the index of X [0]). The rank of the element to
the right of X [0] has rank s = 4 in list, meaning list[s 1] = list[3] is
definitely less than X [1]. However, all such elements to the right of
list[r + 1] until index s 1 are greater than X [0]. Hence, they all must
have the same rank as that of list[r + 1]. This is completed on line 16
RA
6
list1[3]. rankx2b = Rank( X [rank1 + 1], list2) = 5 implies that
list2[rankx2b] list1[4]. Some elements between index rankx2a
and rankx2b may be smaller than list1[3], but there may be at most c
1.
such elements; 3 in this example.
Hence, one processor allocated to position 3 of list1 may compute
its rank in O(c) time. If c is a constant, the work complexity is O(n).
This can be done in EREW PRAM with a bit of care. More than one
element of list1, e.g., list1[2] and list1[3], may have the same rank in
X: xrank1 = 1. The processors assigned to find their respective ranks
in list2, both read Rank( X [1], list2). Since there are at most c such
elements in list1, they may be serialized to eliminate any need for
FT
concurrent read. This would require the processor assigned to every
position to determine its serial order. One way is to count the number
of elements to the left of its position in list1 that have the same rank
as its own. We skip those details here. The following is a simpler
CREW PRAM version.
The listing above indicates that these ranks can be computed tran-
sitively: Ranklist( L1, L3) can be computed from Ranklist( L1, L2) and
Ranklist( L2, L3). Further, if L1 and L2 have a c-cover relationship,
D
6
and Ranklist( L(i, t), Lk (i.le f t, t)) and
Ranklist( L(i, t), Lk (i.le f t, t)) are known before tick t.
Property ⇧2 ensures that Lk (i.le f t, t) and Lk (i.right, t) can be merged
1.
in O(1) time with O(| L(i, t + 1)|) work at tick t using 4-cover merger.
We need property ⇧1 to find the Ranklists of ⇧2. We will prove the
following property, of which Properties ⇧1 and ⇧1 are corollaries:
⇧⇤: If L4 (i, t) has m elements in some range l..r, L4 (i, t + 1) has
no more than 2m in that range. This would imply that if a and b are
two consecutive elements of L4 (i, t), meaning it has 2 elements in the
range a..b, L4 (i, t + 1) has no more than 4. That would guarantee that
FT
L4 (i, t) is 4-cover of L4 (i, t + 1) for any i and t.
The statement above can be proven by induction on i. Note that
once the lists at a node’s children are complete, the next three ticks
at the parent are similar to three steps of Merge Method 2. Every
22 th, then every 21 th, and finally every 20 th element of the children’s
lists are merged. ⇧⇤ holds trivially in these cases. The following
proof, hence, focusses on the ticks when the children lists are not yet
complete when sublists are formed by every 4th element. We refer
RA
6
samples that bound the non-samples that may lie in the range a..b.
Let s1 samples come from the left sublist L4 (i.le f t, t 1), and s2 from
L4 (i.right, t 1). s1 + s2 4m 1. By the inductive hypothesis, no
1.
more than 2s1 samples exist in L(i.le f t, t) in the range a..b, and no
more than 2s2 samples in L(i.right, t). Since L(i, t + 1) is formed by
merging L4 (i.le f t, t) and L4 (i.right, t), it cannot contain more than
2(s1 + s2 ) = 8m 2 elements in the range a..b. This guarantees that
its sublist L4 (i, t + 1) may not contain more than 2m elements in the
range a..b, and L4 (i, t + 1) is a 4-cover of L4 (i, t + 1).
Thus, we can say that property ⇧1 holds, and Lk (i, t) is a 4-cover
of Lk (i, t + 1) for all k in the pipelined merge progression. Since this
FT
applies to all levels, clearly Lk (i.le f t, t 1) is a 4-cover of Lk (i.le f t, t).
Further, considering that L(i, t) was formed at tick t 1 by merg-
ing Lk (i.le f t, t 1) and Lk (i.right, t 1), we know that between
any pair of consecutive elements of L(i, t) there are no more than
two elements of Lk (i.le f t, t 1) and no more than two elements
of Lk (i.right, t 1). For example, consider consecutive elements
z and x in L(i, t) in Figure 7.15. There cannot be more than 2 el-
ements in Lk (i.le f t, t 1) in the range z..x. Were z and x both in
RA
merge algorithm described earlier. Ranklist( L(i, t), Lk (i.le f t, t)) can
be transitively computed from Ranklist( L(i, t), Lk (i.le f t, t 1)) and
Ranklist( Lk (i.le f t, t 1), Lk (i.le f t), t) in constant time with linear
work, as described next. Ranklist( L(i, t), Lk (i.right, t)) can be simi-
larly computed.
If L(i, t) is null, it means there were no elements in Lk (i.le f t, t 1)
and Lk (i.le f t, t 1), and this is the first time a child’s list contains k
elements. Since k 4, this merger can be done at each such node
in O(1) time with O(1) work. In the general case, when L(i, t) does
6
exist, this means it was a merger of sublists Lk (i.le f t, t 1) and
Lk (i.right, t 1), using X = L(i, t 1) as 4-cover. This means we com-
puted Ranklist( L(i, t), Lk (i.le f t, t 1)) and Ranklist( L(i, t), Lk (i.right, t
1.
1)). We store them in lists lrank, and rrank, respectively, at node i.
Also, since we use L(i, t 1) as a cover for computing L(i, t), we com-
pute Ranklist( L(i, t 1), L(i, t)) (see c-cover merger algorithm earlier).
This can provide, in O(1) time with O( Lk (i, t)) work, ranks for their
4-covers: Ranklist( Lk (i, t 1), Lk (i, t)). We store them in erank at node
i.
In Listing 7.28, le f t.erank refers to the list erank of the left child
and right.erank to that of the right child. Note that unlike Merge
FT
Method 2, the pipelined mergers cannot happen in-place. Multiple
rank arrays are required. Some consolidation is possible, and the
pipelined merger can be implemented on EREW PRAM. 4-cover
merge must be implemented in a way that the lrank, rrank, and erank
are read by all active processors first and then updated at the end.
We skip a detailed presentation of that.
completed after 3 log n ticks, and the root node has the sorted re-
sult. The work at each level is also easy to count. Note that a level
deactivates when complete. A level is complete when it receives all
the elements of the list to be sorted. The lowest active level, call it l,
receives all n elements in a span of 3 next ticks. This means that up
to n2 processors remain active at this level, for we merge two lists of
parallel algorithms and techniques 271
6
Among sequential sorting algorithms, Radix-sort is known to be
particularly efficient for sorting small integers. The main idea of the
algorithm is to divide each element, rather the sorting key, into small
parts. The algorithm iterates over parts, the least significant part
1.
first. Each part takes a fixed number of values. For example, a 32-bit
integer naturally consists of 32 1-bit parts.
Suppose, in general, there are d parts, each taking D values. For
each part, the entire list is divided into D buckets, which can be
accomplished sequentially in O(n) time for a list of size n. The
sequential complexity to complete radix-sorting is O(dn). Radix-sort
relies on the different parts being sorted in a strict sequence. Hence,
buckets.
FT
the only step that may be parallelized is the sorting of n items into D
3 bsum = parallel prefix-sum bit[i] using $n$ processors // We count LSB as bit 0
4 forall processor p in 0..(n-1)
5 if(bit[i] of list[p] == 0)
6 rank[p] = p - bsum[p]
7 else
8 rank[p] = n - bsum[n-1] + bsum[p] - 1
9 list[rank[p]] = list[p]
be ranked after all those with bit[i ] = 0. The total number of elements
with bit[i ] = 0 is n bsum[n 1]. This is shown on line 8.
Since each prefix sum takes O(log n) time with O(n) work, the
total time complexity for radix-sorting d-part keys is (d log n), and
the total work complexity is (dn).
Another version of Radix-sort iterates over the bits in the reverse
order, most significant part first. For certain platforms, that version
would be more suitable. In the most significant part first scheme, the
elements with bit[i ] = 0 and bit[i ] = 1 are recursively divided into
6
two buckets. Thus the first bit subdivides list into two buckets, the
next bit further subdivides each bucket into two, and so on. More
importantly, each bucket can be sorted independently of the other
1.
buckets. Once a bucket becomes small enough, it may be sorted
sequentially, while other processors continue to subdivide buckets.
We leave those details as an exercise.
9 pdest = index[n-1]
10 forall processor p in 0..(n-1)
11 if(side[p] == 1)
12 index[p] = index[p]-1
13 else
14 if(p == pivot)
15 index[p] = pdest
16 else if(p > pivot) // For large elements after the pivot position,
parallel algorithms and techniques 273
6
sum can be derived from the small side,
which is shown in final index.
Partition for quick-sort amounts to asking for each element of list1,
if its less than the pivot (call them small) or greater than it (call them
1.
large). The small and large sets are then sorted independently of
each other. The pivot, if it is a part of list1, goes in the middle. This is
trivially parallelizable with O(1) time and O(n) work. All processors
must agree on the pivot. This requires concurrent read capability to
complete in O(1).
It is not sufficient to know which set an element belongs in. These
sets must be formed as well. Quick-sort separates the two sets in-
FT
place. We can do the same in parallel in the PRAM model. We need
to find non-conflicting indexes for elements to transfer to such that
all the elements smaller than the pivot get smaller indexes than the
rest. One can simply determine the rank of each element within its
set. Line 8 in Listing 7.30 computes the prefix-sum of the list side and
stores it in index. The elements less than the pivot have a 1 in side.
Thus, index [ p] contains the number of small elements to the left of
position p, plus 1 for itself. index provides a contiguous numbering
RA
6
in. Suppose, we simply let it use that information to determine how
to proceed. Depending on whether list1[ p] is in list1s or list1l, If p
can determine its subset’s next pivot, it can test again. For example,
1.
if list1[ p] is in list1s, and the pivot for list1s is vs , p needs to compare
list1[ p] with pivot vs next to determine which subset of list1s list1[ p]
lies in. It can go down the tree, comparing with a sequence of pivots.
Thus, the initially small subset would be recursively divided into
small-small and small-large subsets, and so on. If we can continue
this process, labeling small as bit 0 and large as bit 1 at each test, the
smallest element will see a sequence 0, 0, 0 . . . and the largest would
see 1, 1, 1 . . .. The bit for the first partition appearing first. This pivot
FT
itself is in neither set and may be assigned an end symbol indicating
the termination of its bit sequence. Processor p terminates when
list1[ p] is chosen as the pivot. Questions remain:
• How does each process receive an appropriate pivot in each
round?
In principle, one may use any pivot. The ideal pivot for any set is
its median. We generally rely on ‘good’ pivots by randomly choosing
one of the elements in a set as its pivot. Pivots outside the range
of values in a set are ‘bad,’ as they lead to up to n levels in the
quicksort-tree. If we can find the minimum value m and the max-
imum values M within a set, M+ 2
m
may be a good choice. Section 7.3
explains how to find the minima in (1) time with O(n2 ) work with
the common write facility in CRCW PRAM. However, that algorithm
requires the knowledge of the full set, as each element is compared
with every other element of the set. Another possibility is to evolve
a consensus. If processor p knows the subset (or subsubset, etc.) its
D
6
Processor 5, the parent processor, does
not lie on either side and, hence, does
How to pre-arrange the location per subset? One way to designate not compete to write. The same process
unique location for each subset is its position in the quick-sort tree continues at other levels.
1.
itself. See Figure 7.17. We count the levels downward. The first
partition is at level 0, which creates two sublists, one contains all
elements in the left subtree of the parent (say, the small sublist) and
the other in the right subtree. Let parent[ p], le f t[ p], and right[ p]
contain the indexes, respectively, of the parent, the left child, and
the right child for element list1[ p]. parent[ p] doubles as the pivot
for element list1[ p]. Initially, all elements use the same pivot, and
parent[ p] = root, one selected pivot. Once the sublists separate,
FT
parent[ p] changes accordingly.
As described above, parent[ p] (i.e., pivot) is set by the arbitrary
write of CRCW-PRAM by processors of each sublist. Processors p
with elements smaller than list1[ parent[ p]] compete and write to the
le f t[ parent[ p]]. One succeeds. Processors with elements larger than
the pivot compete to become the right child. Again, one wins. All
the writers read back to check if they won writing. The winner is the
pivot. Losers need to continue the process. For simplicity, Listing
RA
7.31 lets the chosen pivot also continue to process until the final
termination. However, they are the only elements in their list – so
they continue to repeat the same computation.
6 while(! done)
7 forall processor p in 0..(n-1)
8 done = 1
9 if(list[p] < list[parent[p])]
10 leftchild[parent[p]] = p // Write winner becomes the left child
11 if(leftchild[parent[p]] != p) // Lost write. Retry with new parent.
12 parent[p] = leftchild[parent[p]]
13 done = 0
276 an introduction to parallel programming
6
0. Thus, the total work complexity is O(n log n). Recall that concur-
rent writes incur a cost even in modern shared-memory platforms.
Nonetheless, the algorithm described above is illustrative of a gen-
1.
eral application of leader election (which is a form of the consensus
problem).
getting elements larger than the previous sublist. Then, each sublist
can be sorted by one of the P processors independently of others.
Both of these methods are facilitated by sample-sort.
Sample-sort is based on the creation of a smaller list of size O( P),
which is a cover for the initial list, not unlike the pipelined merge-
sort. The main idea is to be able to find P disjoint ranges of values
{ Ri , i 2 0..( P 1)} so that the upper limit of range i is less than the
lower limit of range i + 1, and the total number of elements in each
range is roughly equal. Once { Ri } is known, it is trivial to determine
the range in which each element list[i ] lies. Thus, in O(log P) time
and O(n log P) work, we can compute P lists r [ j], j 2 0..( P 1). r [ j][k ]
D
contains the range numbers that element k at processor [j] lies in.
Ranges are also called buckets.
Next, we collect all elements in bucket i at processor i. This is the
partition step of Quick-sort. Each bucket can then be independently
sorted. Alternatively, we can first sort the Pn elements at each proces-
sor, and then distribute one bucket per processor. The so collected
parallel algorithms and techniques 277
6
3 locally sort list[p] containing n/P elements available at processor p
4 sublist[p] = {list[p][i*n/(P*P)]} i in 1..(P-1) // Take P-1 separators
5 slist = sort(subllist[p], p in 0..P-1) using P processors // All separators
6 range = {slist[i*(P-1)]} i in 1..(P-1) // P-1 evenly sampled splitters
1.
Listing 7.32 assumes that the input list is equally distributed
among P processors; list[ p] comprises the set of elements at processor
p. Each processor initially sorts its set to find well-separated samples.
Each chooses P 1 separators. These P ⇤ ( P 1) total separators are
again sorted, possibly using all P processors, in order to next find
P 1 globally well-separated splitters. These splitters are put in the
FT
list range[0..( P 2)]. Two consecutive elements define a range. We
may assume range[ 1] = •, and range[ P 1] = +• as default
splitters. Thus, there are P ranges. We also allow each range to
be open at its upper end to ensure that there is no overlap among
consecutive ranges.
range is a 2n
P cover of list, meaning there are no more than P
2n
6
sublist takes O( P) steps at each processor. The parallel sort of P
sorted lists, one per processor, and each of size P 1, appears to be
similar to the original problem and may be performed recursively.
1.
However, this step is not the bottleneck for the entire algorithm if P is
small compared to n. Pair-wise merge, as in the merge-tree of Figure
7.10, suffices. The height of the tree is log P. At level i, 2iP+1 processors
send 2i ( P 1) elements each to their siblings, and those 2iP+1 siblings
merge two lists of size 2i ( P 1) each.
The computation time at level i is O(2i P), and work is O( P2 ). For
the BSP model, O( 2iP+1 ) messages are sent at level i. Hence, the total
time complexity of merging at line 5 is O( P2 ), and work complexity
FT
is O( P2 log P). O( P) messages are sent. Sampling slist into range at
one processor takes O( P) time.
Finally, to complete sample-sort, the list range is broadcast to all
processors requiring O( P) messages. After all the processors receive
range, they form P buckets each, taking O( Pn log P) time each. Then,
each processor sends its elements of bucket i (if any) to processor i.
In the BSP model, each processor sends a single message to possibly
all P 1 other processors, leading to the communication cost of
RA
task for weighted graphs is to find its skeleton – the minimum span-
ning tree. In this context, weighted graphs have weights associated
with their edges. The spanning tree of a graph is a tree that includes
all its vertices and a subset of its edges. The tree must be connected
and without any cycles by definition. (A cycle is a path that does
not repeat any vertex.) The weight of a spanning tree is the sum of
parallel algorithms and techniques 279
the weights of all its edges. Many spanning trees may be formed
in a graph. The minimum spanning tree (or MST) of a graph is one
whose weight is no greater than the others. Figure 7.19 shows the
edge-weights of an example graph. The MST is shown in solid edges.
Non-MST edges are in dashed lines.
6
Prim’s algorithm6 is a greedy sequential algorithm to compute 6
R. C. Prim. Shortest connection
1.
MST of a given weighted undirected graph G with n vertices and networks and some generalizations.
The Bell System Technical Journal, 36(6):
m edges. It incrementally builds MST by adding one vertex and its 1389–1401, 1957
connecting edge at a time, starting with an arbitrary vertex. At each
step, it selects the edge with the least weight among those connecting
any vertex v 2 MST constructed so far with any vertex w 2 G MST.
In Listing 7.33, when vertex minv is added to MST, its neighbor in
the MST is specified by Parent[minv]. In other words, (Parent[minv],
FT
minv) is the connecting edge. The first vertex added to the MST has
a null Parent. An auxiliary array Cost[w] maintains the weight of
the least-weight edge connecting vertex w to the evolving MST. Cost
is not known in the beginning, and an overestimate is maintained.
These estimates are lazily improved when a neighbor of w is included
in MST. When (Parent[minv], minv) is added to MST, the weight of
that edge, which is Cost[minv], is smaller than the weights of all
other edges connecting any vertex in MST with any vertex in G
RA
MST.
Inductively, if the current edges in MST are in the final MST,
edge (Parent[minv], minv) must also be in the final MST. Otherwise,
the path in the final MST from minv to Parent[minv] would have
to go through another edge (w, v) where w 2 G MST and v 2
MST. However, then the cost of that final MST could be reduced by
replacing edge (w, v) with (Parent[minv], minv).
3 Parent[v] = null
4 MST = null
5 GV = {Vertices in G}
6 for i in 1..n
7 minv = vertex with min Cost[v] 8 v 2 GV // Break tie arbitrarily
8 GV = GV - minv
9 MST = MST [ (Parent[minv], minv)
280 an introduction to parallel programming
6
12 requires up to deg[minv] decrease-key operations in the priority
queue, deg[minv] being the degree of vertex minv. Of course, these
deg[minv] edges connected to minv are updated only once when minv
is added to MST. Fibonacci Heaps7 require O(1) amortized time per 7
Michael L. Fredman and Robert Endre
1.
decrease. Relaxed Heaps8 can complete each decrease in O(1) time Tarjan. Fibonacci heaps and their uses
in improved network optimization
in the worst case. Both require O(log n) time for each extraction of algorithms. J. ACM, 34(3):596–615, jul
the minimum Cost vertex on line 7. This adds up to O(m + n log n) 1987. ISSN 0004-5411. URL https:
//doi.org/10.1145/28869.28874
sequential time for the entire algorithm. Note that for dense graphs 8
James R. Driscoll, Harold N. Gabow,
where, say, m > n log n, this time is bounded by O(m). For sparser Ruth Shrairman, and Robert E. Tarjan.
graphs, n log n dominates. Relaxed heaps: An alternative to
fibonacci heaps with applications to
Let’s see how Prim’s algorithm admits parallelism. The initializa- parallel computation. Commun. ACM, 31
FT
tion on line 1 can be completed on EREW PRAM in O(1) time and
O(n) work. The loop on line 6 is inherently sequential and requires n
iterations. Maybe, the priority queue operations can be parallelized: a
(11):1343–1354, nov 1988
7 for i in 1..n
8 minv = ExtractMin (GV) // Break tie arbitrarily
9 inMST[minv] = true
10 Append (Parent[minv], minv) to MST
11 forall v in 1..n
12 if inMST[v] == false && Cost[v] > EdgeWeight[minv][v] // EdgeWeight stored in adjacency matrix
13 DecreaseKey(GV) for v from Cost[v] to EdgeWeight[minv][v]
14 Cost[v] = EdgeWeight[minv][v]
15 Parent[v] = minv
6
Listing 7.34 takes O(n log n) time and O(n2 ) work, given that all
n processors are busy for all n iterations. That is not efficient, partic-
ularly for sparse graphs. Maybe, a more specialized data structure
can help: a parallel priority queue based on Binomial Heaps9 allows
1.
9
Gerth Stølting Brodal. Priority queues
O(1)-time O(log n)-work extraction and decrease-key operations on on parallel machines. Parallel Computing,
25(8):987–1011, 1999
CREW PRAM (see Section 7.11).
The total time on line 8 is then O(n) with O(n log n) work. How-
ever, the inner loop on line 11 does not meet those bounds. We
need to restructure this loop to focus on the actual number of edges
incident on vertex minv, as in Listing 7.35. We assume that array
adj[v] (of size deg[v]) stores the identifier of every vertex w adjacent
FT
to vertex v. Similarly, EdgeWeight[v][ j] stores the weight of edge
(v, adj[v][ j]).
6
keys at its children. The parallel priority queue Q is represented as a
1.
forest of Binomial heaps with the constraint that there are between 1
and 3 binomial heaps of every rank r, r Qr , the rank of the priority
FT
queue. In particular, let ni ( Q) be the number of trees with rank i in
Q. ni ( Q) = {1, 2, 3} for i = 0..Qr . This ensures the Qr (1 + log n) if
n items are stored in the queue. We associate the rank of a Binomial
heap also with its root, meaning the root of a rank r heap has rank
r. Also, among a given set of nodes, the one with the minimum key
is called the minimum node in short. The forest Q has the following
ordering property on the key values at its nodes.
Minimum Root: The minimum root of rank r in Q is smaller
RA
h1.key< h2.key for this operation. This forms a new heap of rank r + 1
with root h. This linking is an O(1)-time operation by a single pro-
cessor assuming h2 can be appended to h1.child in O(1). In addition,
roots h1 and h2 must be removed from Q.root[r ] and h appended
to Q.root[r + 1]. This can be accomplished by, say, using a linked
list structure for root and child. In principle, this requires O(1)-time
parallel algorithms and techniques 283
6
1. One linking is done for the heaps of each rank, except the mini-
mum root of any rank r never participates in Link. This means that
linking is only carried out if at least 3 roots of rank r exist. This
1.
ensures that the Minimum Root Property is not perturbed by Link.
continues to hold.
The parallel linking and unlinking are specified in Listing 7.36 and
D
7.37.
Remove(Q.root[r], max)
Remove(Q.root[r], nextmax)
Add(Q.root[r+1], L)
6
Remove(Q.root[r], min)
Add(Q.root[r-1], L1)
Add(Q.root[r-1], L2)
1.
To insert an element to Q, a new singleton node e is created with
the element, and e is inserted to Q.root[0] and Parallel Link of Listing
7.36 is invoked on Q. The minimum element is always listed in
Q.root[0] (which can be kept sorted by the key). To remove the node
with the minimum element, call it Q.root[0][min], we simply remove
the node from Q.root[0]. However, this may violate the Minimum
Root property. The erstwhile second smallest root need not be in a
FT
rank 0 tree. In that case, the minimum root at rank 1 must be the new
smallest, as it is guaranteed to be smaller than the smallest roots at
higher ranks. One invocation of Parallel Unlink of Listing 7.37 can
bring the new minima to rank 0. This can lead to 4 roots of rank 0,
however. One invocation of Parallel Link (Listing 7.36) brings it down
to 2.
log m is O(log n). As a result, the total work remains O(m log n). We
assume m n 1. Otherwise, a spanning tree of G does not exist.
2 Cost[v] = •
3 Parent[v] = null
4 inMST[v] = false
5 MST = null
6 GV = Build Priority Queue on Cost
7 for i in 1..n
8 repeat
9 minv = ExtractMin (GV) // Break tie arbitrarily
10 until inMST[minv] is false
11 inMST[minv] = true
6
12 Append (Parent[minv], minv) to MST
13 forall v in 1..degree[minv]
14 if inMST[v] == false && Cost[v] > EdgeWeight[minv][v]
15 Cost[v] = EdgeWeight[minv][v]
1.
16 Parent[v] = minv
17 Insert(GV, (v , Cost[v]))
7.12 Summary
RA
6
Division into two sub-problems at a time is quite standard in the
sequential domain. That is often valuable in parallel algorithms as
well and manifests as a binary computation tree. However, digging
1.
a bit deeper into the work-scheduling principle, subdivision into
more than two problems at a time is useful in the parallel domain.
Furthermore, in parallel algorithms, it may not always pay to carry
out the recursion all the way to the top of the recursion tree, where
increasingly fewer processors are employed. Nonetheless, if the
total work remaining at the top of the tree is small, it matters little.
Indeed, we can exploit that fact by using a less work-efficient and
more time-efficient algorithm at the top levels.
FT
Accelerated cascading is a powerful design pattern for parallel
algorithms. It allows us to combine a divide and conquer solution
with low time complexity but high work complexity with one that
has a higher time complexity and a lower work complexity. Usually,
the lower time complexity is obtained by subdividing the problem
much more aggressively than into two each time. For example, if
p
we subdivide a problem of size n into O( n) each time, the height
of the tree shrinks to doubly-log in n: log log n. If we subdivide into
RA
two (or a fixed number) at a time, the height is O(log n). However,
the shrinkage in height can come at the cost of work complexity.
By using the slower algorithm at the lower levels of the tree, we
can quickly reduce the problem size. Employing the higher-work
complexity algorithm on the smaller problems then adds up to lower
total work.
Pointer jumping is another handy tool for graph and list traversal,
where multiple processors can proceed with multiple traversals in
parallel, exploiting each other’s traversals. Pipelining is a useful tool
when an algorithm is divisible into parts that need to be performed
D
situations.
Exercise
7.1. Give an O(1) EREW PRAM algorithm to find the index of the
single 1 in BITS, a list of n bits. There is at most one 1 in BITS. If
no 1 exists in BITS, the output must be n.
7.2. Analyze the time and work complexity of the recursive dependency-
6
breaking parallel algorithm to compute the prefix-sum introduced
as Method 1 in Section 7.1.
7.3. Prove that the algorithm in Listing 7.5 computes the prefix sum.
1.
Analyze its time and work complexity.
7.4. Modify the algorithm in Listing 7.7 computes the exclusive pre-
fix sum. (Try not to first compute the prefix-sum before computing
the exclusive sum from it).
7.5. Re-pose all the three methods for parallel prefix-sum computa-
tion discussed in Section 7.1 under the BSP model. Analyze their
FT
time and work complexity, and compare their performance.
7.9. Given a list BIT of n 1-bit values, find the lowest such index
i that B[i ] = 1. If no bit is 1, the answer is n. Find an O(log n)
time algorithm with O(n) work for EREW PRAM. Find an O(1)
time and O(n) work algorithm for common-CRCW PRAM. (Hint:
p
Subdivide BIT into blocks of n.)
6
an O(1) time algorithm with O(n2 ) work to compute ANSV on
Common-CRCW PRAM. (Hint: Use Exercise 7.9.) Find an O(log n)
time algorithm with O(n) work to compute ANSV for EREW
PRAM.
1.
7.11. Compute Prefix-minima M given input integer list D with
n elements (M[i ] as the minimum of all D [ j] among j < i) in
O(log log n) time using O(n log log n) work on Common-CRCW
PRAM. (Hint: Use Exercise 7.10 to devise accelerated cascading.)
of m edges E, where ith edge E[i ] is a pair of integers (u, v), u <
n, v < n indicating that vertex number u and vertex number v
have an edge between them. Given P PRAM processors, compute
the list RANK such that RANK [ j] is the level of vertex number i
in breadth-first search of G starting at vertex 0. You may use any
PRAM model.
7.17. The selection problem is to find the kth smallest element in a list
6
List of n unsorted elements. Devise a parallel selection algorithm
taking O(log2 n) time and O(n) work on CREW PRAM. (Hint:
Consider recursively reducing the problem of selecting from n1
unordered items to a problem of selecting from no more than 3n4 1
1.
items in log n1 time with O(n1 ) work.)