Sorting Bitonic Sort
Sorting Bitonic Sort
Abstract
Sorting is an important component of many applications, and
parallel sorting algorithms have been studied extensively in the last
three decades. One of the earliest parallel sorting algorithms is
Bitonic Sort, which is represented by a sorting network consisting
of multiple butterfly stages.
This paper studies bitonic sort on modern parallel machines
which are relatively coarse grained and consist of only a modest number of nodes, thus requiring the mapping of many data
elements to each processor. Under such a setting optimizing the
bitonic sort algorithm becomes a question of mapping the data
elements to processing nodes (data layout) such that communication is minimized. We developed a bitonic sort algorithm which
minimizes the number of communication steps and optimizes the
local computation. The resulting algorithm is faster than previous
implementations, as experimental results collected on a 64 node
Meiko CS-2 show.
1 Introduction
Sorting is a popular Computer Science topic which receives much
attention. As a parallel application, the problem is especially
interesting because it fundamentally requires communication as
well as computation [ABK95] and is challenging because of the
amount of communication it requires. Parallel sorting is one example of a parallel application for which the transition from a
theoretical model to an efficient implementation is not straightforward. Most of the research on parallel algorithm design in the 70s
and 80s has focused on fine-grain models of parallel computation, such as PRAM or network-based models, where the ratio of
memory to processors is relatively small [BDHM84, JaJ92, KR90,
Lei92, Rei93, Qui94]. Later research has shown, however, that
processor-to-processor communication is the most important bottleneck in parallel computing [ACS90, CKP+ 93, KRS90, PY88,
Val90a, Val90b, AISS95]. Thus efficient parallel algorithms are
more likely to be achieved on coarse-grain parallel systems and in
most situations algorithms originally developed for PRAM-based
models are substantially redesigned.
One of the earliest parallel sorting algorithms is Bitonic Sort
[Bat68], which is represented by a sorting network consisting of
multiple butterfly stages of increasing size. The bitonic sorting
network was the first network capable of sorting n elements in
O(lg2 n) time and not surprisingly, bitonic sort has been studied
extensively on parallel network topologies such as the hypercube
and shuffle-exchange which provide an easy embedding of butterflies [Sto71]. Various properties of bitonic networks have been
investigated, e.g. [Knu73, HS82, BN89], and recent implementations and evaluations show that although bitonic sort is slow for
large data sets (compared for example with radix sort or sample
sort) it is more space-efficient and represents one of the fastest
alternatives for small data sets [CDMS94, BLM+ 91].
In order to achieve the O(lg2 n) time bound, the algorithm assumes that each node of the bitonic sorting network is mapped onto
a separate processor and that connected processors can communicate in unit time. Therefore the network size grows proportionally
to the input size. Modern parallel machines, however, have generally a high communication overhead and are much coarser grained,
consisting of only a relatively small number of nodes. Thus many
data elements have to be mapped onto each processor. Under such
a setting optimizing a parallel algorithm becomes a question of
optimizing communication as well as computation.
We derive a new data layout which allows us to perform the
smallest possible number of data remaps. The basic idea is to
locally execute as many steps of the bitonic sorting network as
possible. We show that for the last lg P stages of the bitonic sorting
network which usually require communication the maximum
number of steps that can be executed locally is lg N
P (N is the data
size, P is the number of processors). Our algorithm remaps the
data such that it always executes lg N
P before remapping again,
thus executing the smallest possible number of remap operations.
Compared with previous approaches our algorithm executes
less communication steps and also transfers less data. Furthermore,
by taking advantage of the special format of the data input, we show
how to optimize the local computation on each node. We develop
an efficient implementation of our algorithm in Split-C [CDG+ 93]
and collect experimental results on a 64 node Meiko CS-2. We also
investigate the factors that influence communication in a remapbased parallel bitonic sort algorithm by analyzing the algorithm
under the framework of realistic models for parallel computation.
Finally, we compare our implementation of bitonic sort against
other parallel sorts.
2 Bitonic Sort
Bitonic sort is based on repeatedly merging two bitonic sequences
to form a larger bitonic sequence. The following basic definitions
were adapted from [KGGK94].
Definition 1 (Bitonic Sequence) A bitonic sequence is a sequence
of values a0 ; : : : ; an,1 , with the property that (1) there exists an
index i, where 0 i n 1, such that a0 through ai is monotonically increasing and ai through an,1 is monotonically decreasing,
or (2) there exists a cyclic shift of indices so that the first condition
is satisfied.
,
Bitonic
Sequence
8
Step 3
BM
BM
BM
BM
BM
BM
BM
BM
Sorted
Sequence
Stage 2
BM
Stage 4
2
2
Stage 3
BM 8
BM 4
BM 16
2
2
BM
2
2
BM 8
BM 4
8
Step 2
Stage 1
Step 1
,
,
b c
3 Optimizing Communication
As we saw from the cyclic-blocked implementation a good data
distribution can dramatically reduce the communication require-
000
001
010
011
100
101
110
AAAAAAAAAAAAAAAAA
A
AAAAAAAAAAAAAAAAA
AAAAAAAA
A
AAAAAAAAAAAA
AAAAAAAA
AAAAA
AAAAAAAA
AAAAAAAA
AAAAA
AAAAAAAA
AAAA
AAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAA
b = lg n
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAA
AAAA
AAAAAAAAAAAA
AAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
b = lg n
111
2
Step 1
2
Step 2
Step 1
8
Step 3
8
Step 2
Step 1
Node
Address
1st
Merge Stage
2nd
Merge Stage
3rd
Merge Stage
Stage
AAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
A
AAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAA
AAAAAAAA
AAAA
AAAAAAAA
AAAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
A
AAAAAAAAAAAA
AAAAAAAAAAAA
AAAA
AAAAAAAA
AAAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAA
AAAA
AAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
lg n + k
b = lg n
Relative Address
Figure 4: Inside Remap and the corresponding absolute and relative address bit pattern. The absolute address specifies the row
number of the bitonic sorting network, the relative address specifies the processor (the shaded part) and local offset within the
processor after the remap.
Absolute Address
Start Remap
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAAA
A
AAAA
AAAA
AAAA
AAAAAAAA
AAAAAAAA
AAAAA
AAAAAAAA
A
AAAA
AAAA
AAAA
A
AAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAAA
A
AAAA
AAAAAAAAAAAAA
a=s
AAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAA
AAAA
AAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAA
AAAA
AAAA
AAAA
AAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAA
AAAA
AAAAAAAAAAAA
b
t
lg n
A
Stage
lg n + k
AAAA
AAAA
AAAA
AAAAAAAA
AAAAAAAA
AAAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
A
AAAAAAAA
AAAAAAAA
AAAAA
AAAAAAAA
AAAAAAAA
AAAA
AAAA
AAAA
AAAA
AAAA
AAAA
AAAAAAAA
AAAAAAAA
AAAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAAAAAA
AAAA
A
AAAA
AAAA
AAAA
AAAA
AAAA
AAAAAAAA
AAAAA
AAAAAAAA
AAAA
AAAAAAAA
AAAAAAAA
AAAAAAAAA
A
AAAAAAAA
lg n + k + 1
Relative Address
a
b
t
=
=
=
s lg n
s s < lg n
lg n , a
s , lg n s lg n
s + k + 1 s < lg n:
0
lg n (the
a = lg n; b = 0; t = lg n:
Given the absolute address of a node, the relative address where
the node is remapped is computed as presented in Figure 4 (for
s lg n) and Figure 5 (for s < lg n).
Data Layout
Number of remaps
Blocked
Cyclic-Blocked
Smart
lg
P (lg P +1)
n lg P (lg2P +1)
1
2n P,
P lg P
n lg P
2 lg P
lg P + 1
2
Number of messages
lg
P (lg P +1)
, 1)
, , lg P
2 lg P (P
3(P 1)
2
Lemma 2 The smart layout remaps the data such that it can execute exactly lg n steps locally.
The previous lemmas allow for a simple formulation of our parallel
bitonic sorting algorithm:
Algorithm 1 (Smart Layout Parallel Bitonic Sort)
The parallel bitonic sort algorithm for sorting N elements on
P processors (n = N=P elements per processor) starts with a
blocked data layout and executes the first lg n stages entirely local.
For the last lg P stages it periodically remaps to a smart data
layout and executes lg n steps before remapping again.
Clearly, the smallest number of communication steps is achieved
if we use a remapping strategy that performs the smallest number
of data remaps. Assuming that we dont replicate the data then, as
we showed previously, for the last lg P stages of the bitonic sorting
network lg n is the maximum number of steps that can be executed
locally. The following theorem summarizes our observations:
Theorem 1 Algorithm 1 uses the smallest possible number of data
remaps.
3.2 Communication Complexity Analysis
Our smart data layout minimizes the number of communication
steps, but the total communication time of the algorithm depends
on other factors as well. Analyzing the algorithm under a realistic model of parallel computation which captures the existing
overheads of modern hardware reveals that the important factors
that influence the communication time are: the total number of
remaps, the total number of elements transfered (volume), and the
total number of messages transfered.
We study three versions of bitonic sort algorithm using three
different data layouts: the blocked, cyclic-blocked and the smart
data layouts. We analyze the communication complexity of the
algorithm with respect to the three metrics (the total volume and
number of messages are considered per processor). The formulas
for all three metrics are summarized in Table 1. For the smart data
layout we considered the practical case when lg P (lg2P +1) lg n
and in the case of number of messages we use a lower bound.
Observe that with respect to the total number of elements transfered
and the number of remaps the smart data layout version is the best.
We refer the interested reader to [Ion96] for a more thorough
analysis on how these three abstract metrics determine the actual
communication complexity under the LogP and LogGP models of
parallel computation.
4 Optimizing Computation
In this section we show how we optimize the local computation
by replacing the compare-exchange operations with very fast local
sorts.
<s<
The first a steps after the remap (within stage lg n + k): Here
the input on each processor consists of 2b bitonic sequences
of length 2a . At the end of this phase, i.e. at the boundary
between stages lg n + k and lg n + k + 1 these sequences
are sorted. Furthermore, the data on each processor at the
end of this phase consists of two sorted sequences, the first
one increasing and second one decreasing.
For the first lg n stages since the keys are in a specified range
we used radix-sort which takes O(n) time. For the last lg P stages
by using only bitonic merges we have reduced the computation
complexity to O(n) for each stage. Since we have O(lg P ) computation phases, the complexity of the local computation for the
entire bitonic sort algorithm is O(n lg P ).
5 Experimental Results
Min
Max
60
50
Blocked-Merge
Cyclic-Blocked
Blocked-Merge
Smart
Cyclic-Blocked
40
Smart
30
20
40
20
10
0
0
200
400
600
800
1000
1200
200
400
600
800
1000
1200
Figure 7: Total execution time (left) and execution time per key (right) for different implementations of the bitonic sort algorithm on 32
processors.
30
30
Bitonic32
Bitonic16
Radix32
Radix16
Sample16
20
10
Sample32
20
10
0
0
200
400
600
800
1000
1200
200
400
600
800
1000
1200
Figure 8: Execution time per key per processor for sample, radix and bitonic sort on 16 processors (left) and 32 processors (right).
and sample sort in particular, is also the subject of more recent
studies [BHJ96, HJB96]. Notably, the authors implementation of
sample sort is invariant over the set of input distributions.
6 Related Work
Bitonic sort and sorting networks have received special attention
since Batcher [Bat68] showed that fine-grained parallel sorting
networks can sort in O(lg2 n) time using O(n) processors. Since
then a lot of effort has been directed at fine-grain parallel sorting
algorithms (e.g. see [BDHM84, JaJ92, KR90, Rei93, AKS83,
Lei85, Col88]).
Many of these fine-grained algorithms are not optimal, however,
when implemented under more realistic models of parallel computation. The later make the realistic assumption that the data
size N is much larger than the number of processors P . Now the
goal becomes to design a general-purpose parallel sort algorithm
that is the fastest in practice. One of the first important studies
of the performance of parallel sorting algorithms was conducted
by Blelloch, Leiserson et al. [BLM+ 91] which compared bitonic,
radix and sample sort on CM-2. Several issues were emphasized
like space, stability, portability and simplicity.
These comparisons were augmented by a new study by Culler et
al. [CDMS94]. Column sort was included and a more general class
of machines was addressed by formalizing the algorithms under the
LogP model. All algorithms were implemented in Split-C making
them available to be ported and analyzed across a wide variety of
parallel architectures. The conclusion of this study was that an
optimized data layout across processors was a crucial factor in
achieving fast algorithms. Optimizing the local computation was
local sorts. Furthermore, we have analyzed three fundamental metrics that influence the communication time of a parallel algorithm
(the number of remaps, the total number of transfered elements,
and the number of messages) and we have shown that the total
communication time is dependent upon all three metrics and minimizing just one of them is not sufficient to obtain a communication
optimal algorithm. Our experimental results have shown that our
implementation is much faster than any previous implementation
of parallel bitonic sort and for a small number of processors or
small data sets our algorithm is faster than other parallel sorts such
as radix or sample sort.
Overall, we hope that our techniques will be further refined and
applied for a larger class of algorithms. We feel that the applicability area of our methods is larger than parallel computing and it
extends to memory hierarchy models and numerical computations
involving data sets under various layouts.
[HJB96]
[HS82]
Acknowledgments
[Ion96]
M. F. Ionescu. Optimizing Parallel Bitonic Sort. Masters thesis, also available as Technical Report TRCS96-14, Department of Computer Science, University of California, Santa
Barbara, July 1996.
[JaJ92]
References
[Knu73]
[KR90]
[KRS90]
C. Kruskal, L. Rudolph, and M. Snir. A complexity theory of efficient parallel algorithms. In Theoretical Computer
Science, 1990.
[Lei85]
[Lei92]
F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufman,
1992.
[PY88]
[BDHM84] D. Bitton, D. J. DeWitt, D. K. Hsiao, and J. Menon. A Taxonomy of Parallel Sorting. Technical Report TR84-601, Cornell
University, Computer Science Department, April 1984.
[Qui94]
[BHJ96]
[Rei93]
J. H. Reif. Synthesis of Parallel Algorithms. Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1993.
[SS95]
[Sto71]
[Val90a]
[Val90b]
L. G. Valiant. Parallel Algorithms for Shared-Memory Machines. In J. van Leeuwen, editor, Handbook of Theoretical
Computer Science. Elsevier Science Publishers, 1990.
[ABK95]
[ACS90]
[AISS95]
[AKS83]
[Bat68]
K. Batcher. Sorting Networks and their Applications. In Proceedings of the AFIPS Spring Joint Computing Conference,
volume 32, 1968.
D.A. Bader, D.R. Helman, and J. JaJa. Practical Parallel Algorithms for Personalized Communication an d Integer Sorting.
ACM Journal of Experimental Algorithmics, 1(3), 1996.
[BW97]
[CDG+ 93]