The Performance of A Selection of Sorting Algorithms On A General Purpose Parallel Computer
The Performance of A Selection of Sorting Algorithms On A General Purpose Parallel Computer
SUMMARY
In the past few years, there has been considerable interest in general purpose computational
models of parallel computation to enable independent development of hardware and software.
The BSP and related models represent an important step in this direction, providing a simple
view of a parallel machine and permitting the design and analysis of algorithms whose perfor-
mance can be predicted for real machines. In this paper we analyse the performance of three
sorting algorithms on a BSP-type architecture and show qualitative agreement between exper-
imental results from a simulator and theoretical performance equations. 1998 John Wiley &
Sons, Ltd.
1. INTRODUCTION
Parallel machines are in widespread use but most applications use architecture dependent
software which is thus not portable and quickly becomes obsolete with a new generation
of parallel machines. Some researchers[1–4] argue that a parallel model of computation is
required which separates hardware and software development so that portable software can
be developed for a range of different parallel architectures.
Parallel models of computation can broadly be classified into two categories: PRAM
(shared memory) based models and network (nonshared memory) models. The former
type has been extensively used for analysing the complexity of parallel algorithms. In this
idealised model processors operate completely synchronously and have access to a single
shared memory whose cells can be accessed in unit time. Although these assumptions ease
the design and analysis of parallel algorithms they usually result in algorithms which are
unsuitable for direct implementation on current parallel machines. Network models are
based on real parallel machines but, by including realistic costs for operations, they make
analysis more difficult. In particular, by including the topology of the architecture in the
model, algorithms map with different degrees of difficulty on to machines with different
topologies. The bulk-synchronous parallel (BSP) model developed by Valiant[4] attempts
to bridge the gap between these two types of model.
∗ Correspondence to: R. D. Dowsing, School of Information Systems, University of East Anglia, Norwich
NR4 7TJ, UK. (e-mail: rdd@sys.uea.ac.uk)
† Present Address: Informatics Department, Universidade Federale de Goias - IMF, Caixa Postal 131,
Campus II, CEP:74001-970, Goiania - GO - Brazil.
Contract grant sponsor: CNPq, Brazil; Contract grant number: 200721/91-7.
The BSP model – and related models – define a general purpose computational model in
which the programmer is presented with a two-level memory hierarchy. Each processor has
its own local memory and access to a common shared memory. Logically, shared memory
is a uniform address space, although physically it may be implemented across a range of
distributed memories. The model abstracts the topology of the machine by characterising the
communication network via two parameters (L and g)[4], which are related to the latency
and bandwidth, respectively. Global accesses are costed using these two parameters and the
lower the values the lower the communication costs. This is the case in the idealised PRAM
model where global operations are assumed to take the same time as local operations. In
existing parallel machines, however, these values are considerably higher and dependent
on the access patterns to global memory. Valiant[5] has shown that by introducing some
random behaviour in the routing algorithm (two-phase randomised routing) it is possible
for real machines to maintain good bandwidth and latency. This enables real machines to
reflect a two-level memory hierarchy, thus hiding physical topology from the programmer.
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 317
or receptions, the reciprocal of g being the bandwidth per processor. The parameter g is
similar to that in the BSP model, in that it provides a measure of the efficiency of message
delivery. However, since there is no implicit synchronisation in the LogP model, the notion
of supersteps performing h-relations does not apply to this model. The model assumes that
the network has a finite capacity, i.e. each processor can have no more than L/g outstanding
messages in the network at any one time. Processors attempting to exceed this limit are
stalled until the message transfer can be initiated. This is in contrast to the BSP model,
where any balanced communication event can be done in gh time. The performance of
LogP algorithms can be quantified by summing all computation and communication costs
of the algorithm. Communications are costed in terms of primitive message events. For
example, the cost for reading a remote location is 2L + 4o. Two message transmissions are
required, one requesting and another sending the data. In each transmission each processor
involved in the operation spends o time units interacting with the network and the message
takes L time units to get to its destination. The cost of a writing operation is the same,
although in this case the response is the acknowledge required for sequential consistency.
This analysis assumes the data fits into a single transmission block. When dealing with a
block of n such basic data, the cost becomes 2L + 4o + (n − 1)g, assuming o < g. This
is because, after the first transmission, subsequent transmissions have to wait g time units.
The LogP model encourages the use of balanced communication events so as to avoid a
processor being flooded with incoming messages – a situation where all but L/g of the
sending processors would stall.
The WPRAM model views the BSP and LogP models as architectural models. The
following description is restricted to this level of abstraction.∗ The WPRAM model attempts
to extend the BSP model to a more flexible form. One important difference is that barrier
synchronisation is supported at the message-passing level using point-to-point routing.
Message passing can also be used to implement any synchronisation operation. This makes
the WPRAM model closer to the LogP model. However, while the BSP and LogP models
are applicable to a broad range of machine classes, the WPRAM model has been designed
for a class of scalable distributed memory machines. This means that network latency should
increase at a logarithmic rate with respect to the number of processors, i.e. D = O(log(p)),
and that each processor should be able to send messages into the network at a constant
frequency, i.e. g = O(1). The global parameters D and g are similar to L and g defined in
the LogP model. However, instead of an upper bound given by a constant, the parameter D
represents a mean delay which increases logarithmically with the number of processors. It
is modelled by the mean network delay resulting from continuous random traffic and it thus
includes the effects of switch contention and contention for shared data at the destination
processor. The parameter g is the same as that of the LogP model, though no limit is
imposed on the total number of outstanding messages a processor can have in the network.
Because the network is capable of handling a constant maximum frequency of accesses per
processor, the analysis of WPRAM algorithms is facilitated since the programmer does not
have to be concerned with a network limit capacity, as is the case with LogP algorithms.
Besides the global parameters L and g, the WPRAM model defines a number of other
machine parameters. These parameters have been incorporated in a simulator[6] so that the
execution time of WPRAM programs can be obtained. This allows one to determine the
∗ In a higher level, the WPRAM model provides a number of operations which, by taking advantage of
the network properties, are guaranteed to be scalable. The use of these operations makes the WPRAM model
particularly suitable for the design and analysis of scalable algorithms.
1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
318 R. D. DOWSING AND W. S. MARTINS
3. WPRAM SIMULATOR
The WPRAM simulator mentioned earlier was used to obtain the experimental results
given in this paper. The simulator is based on the interaction of processes that are used
to represent both the nodes of the target machine and the user processes. Algorithms are
implemented using a programming interface and can be subsequently executed directly on
the simulator. This way the sequence of operations generated by the program drives the
simulator (execution-driven discrete-event simulation). The WPRAM target architecture
for the simulator is a distributed memory machine, which supports uniform global access by
the use of data randomisation;∗ see Figure 1. The simulator includes a detailed performance
model which costs operations based on measured performance figures for the T9000 trans-
puter processor and simulations of the C104 packet router[6]. Local operations modelled
by the simulator include arithmetic calculation, context switching, message handling and
local process management. Messages entering the network are assumed to be split up to
guarantee that no one message ties up a switch for long periods of time, while global data
are assumed to be randomised based on the unit of a cache line. Global operations are costed
based on the high-level parameters g and D. The value for g and the performance figures
mentioned above were obtained from the Esprit PUMA Project, while D was derived from
∗ The use of data randomisation, where data are distributed throughout the local memories, obviates the need
for randomised routing.
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 319
a simulation of the C104 router carried out at Inmos. Work on validating the simulator
results has been carried out at Leeds University, with a close match being found between
theoretical predictions and experimental results[7].
The simulator is written in C and provides a rich programming interface to execute
algorithms written in C. The programming interface consists of a set of library procedures
which support process management, shared data access and process synchronisation[6].
Only a small subset of these library calls were used in this research, including procedures
for process management (fork, join, my node and my index), read and write procedures
for data access and a procedure to barrier synchronise processes.
4. PERFORMANCE ANALYSIS
In analysing the performance of the algorithms, both the computational performance of the
individual processors and the communication performance of the interconnection network
have been taken into account. Local computations were modelled by run-time functions
resulting from fitting the known form of the functions to the observed times. Local sort
was modelled as ks n log n and local merge as km n.∗ The values of ks and km were
found to be both equal to 1.7µs. As for global communications, only two parameters,
Ti and Tg , were used. These parameters capture two distinct situations that occur in the
global communications of the algorithms studied. Ti represents the time for read/write of
individuals items from/to global memory in a situation where the issuing processor has
to wait for a response – either a data (read) or an acknowledgement (write). Tg is used
when several independent read/write operations are required; pipelining is used to improve
communication performance by overlapping computation and communication, implying a
cost of Tg n to read/write n elements. While Ti is dominated by send/receive overheads and
network delay, these costs are insignificant for blocks operations (modelled by Tg ), where
processor bandwidth is the determining factor. For the machine under study the values of Ti
and Tg , for integer data items, were found to be 47µs and 5.7µs, respectively. In the analysis
a small number of processors is assumed so that Ti ’s value can be taken as a constant. Also,
the synchronisation costs of the algorithms have been found to be small, compared to the
other communication costs, and were thus not included in the equations. Although these
assumptions restrict the analysis, they allow the derivation of simpler equations whose
results approximate well to those obtained experimentally.
The following Sections describe the performance analysis of a selection of sorting
algorithms. The implementations make use of explicit memory management with the
processes not operating directly on global data, but, instead, copying the data to local
memory and then operating locally. This was necessary due to the high access times
incurred for global data access. It has been found that the ratio between global to local
access time is roughly 50:1. Measurements have also shown that, whenever the number
of accesses involved in a computation step is greater than half the elements involved in
this step, it is faster to copy all elements to local memory and do the computations locally.
For each algorithm, performance equations are derived based on the parameters previously
described. The performance of the algorithms is then analysed for different input sizes
and number of processors. The results obtained theoretically are then compared to those
obtained from the simulator .
1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
320 R. D. DOWSING AND W. S. MARTINS
5.1.1. Analysis
Computation: The computation time of the Parallel Mergesort algorithm can be calculated
by summing the contributions of the initial sorting phase with the log p merges of the
merging phase, giving
logX
p−1 n
n
comp(n, p) = sort + merge i (1)
p 2
i=0
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 321
By considering that the n/p-element sequences are all sorted using a sequential version of
mergesort, i.e. sort(n/p) = ks n/p log(n/p), and that a linear merge algorithm is used in
the second phase, i.e. merge(n) = km n, this equation can be simplified to
n n n
comp(n, p) = ks log − 2km + 2km n (2)
p p p
1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
322 R. D. DOWSING AND W. S. MARTINS
quentially in local memory. The even-numbered processors write their sequences to global
memory. Then each odd-numbered processor reads one of these sequences to its local
memory and does the merging with its own sequence. The second merge is performed by
processors 1 and 5. Finally, processor 1 reads the sequence written by processor 5, and does
the final merge, writing the result back to global memory.
The communication time of this algorithm is given by the following equation:
log(p)−1
X h n n i
n n
comm(n, p) = read + write + read i+1 + write i (3)
p p 2 2
i=0
Since the time to copy blocks of data to/from global memory is given by read(n) =
write(n) = Tg n, this equation can be simplified to
n
comm(n, p) = 3n − Tg (4)
p
The total execution time of the Parallel Mergesort algorithm is then the sum of (3) and (4):
n n n n
Texec (n, p) = ks log − 2km + 2km n + 3n − Tg (5)
p p p p
or, substituting the coefficients ks , km and Tg for their values, the total execution time, in
microseconds, is given by
n n n
Texec (n, p) = 20.5n − 9.1 + 1.7 log (6)
p p p
5.1.2. Results
Figure 4 shows the predicted and measured time to sort 1K, 5K and 10K data elements,
using 2, 4, 8, 16 and 32 processors. Each point in the graph represents the average of five
runs using data generated randomly.
The measured times correspond closely to the predicted times; they are all within 2%
of the predicted times. As expected from equation (6), if n is kept constant, the execution
time decreases slightly but soon reaches a point where additional processors make little
difference. For a fixed n, as p increases, Texec tends to 20.5n (in agreement with Figure 4).
Increasing the number of processors results in less time to do the sequential sort since the
sequence assigned to each processor is smaller but requires more work in the merge phase.
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 323
Figure 5 illustrates the algorithm for eight processors; data sets are represented by letters
and processors by numbers. Each processor initially sorts n/p data items and at every
following stage each processor finds n/p data items in a set of data items and merges them.
For example, for a system of two processors, the data would initially be split into two data
sets and each processor would sequentially sort its own data set. After this each processor
would, in parallel, find the n/2 smallest data items and the n/2 largest data items in the
1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
324 R. D. DOWSING AND W. S. MARTINS
complete set. This is equivalent to partitioning the two data sets each into two segments
where the number of data items in the pair of segments containing the smaller values is
the same as the number of values in the pair of segments containing the larger values.
The two segments containing the smaller values are then merged, as are the two segments
containing the larger values. The resultant data sets are then concatenated to form the result.
The calculation of the partition position in a data set is complex; the algorithm is defined
in [15].
5.2.1. Analysis
Computation: As expected, the computation time of this algorithm does not have any linear
component in n. Initially, each processor sorts n/p elements and then they all participate
in each step of the merge phase, each one producing exactly 1/pth of the final merged data.
The computation time is given by
n n
comp(n, p) = sort + merge log p (7)
p p
Communication: Figure 6 illustrates how such implementation can be done. Each processor
starts by reading n/8 elements, sorting them and writing the result to global memory. The
merge phase has, in this case, three steps. In each step, the processors merge and write
n/8 elements. The partition’s boundaries are calculated using a binary search. The decision
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 325
whether to copy the entire segments to be searched to local memory or to read the elements
individually as they are required is dependent on the machine parameters and data input size.
For the machine (WPRAM simulator) and data input size (1K, 5K and 10K) considered, it
was verified experimentally that the first option is better only when the number of processors
is small (less than four). Hence it was decided to use the second alternative, where elements
are read individually. The search space size varies in each step of the merge phase; thus,
for the example considered, 2 log(n/8) are read in the first step, 2 log(n/4) in the second
and 2 log(n/2) in the final step. In each of these steps, once the partition’s boundaries have
been calculated, the corresponding n/8 elements are read, merged and written back. The
communication time of this algorithm is given by
n n
comp(n, p) = read + write
p p
log p h n i
X n n
+ read 2 log i + read + write (9)
i=1
2 p p
The total execution time of the Balanced Mergesort algorithm is then found by adding (8)
to (10):
n n
Texec (n, p) = ks log + 2kg + (km + 2) log p + (2 log n − log p − 1) log(p)ki
p p
(11)
By substituting the coefficients ks , km and Tg for their values, equation (11) can be
simplified to
n n n n
Texec (n, p) = 11.4 + 1.7 log + 3.7 log p + 94 log n log p − 47(log p)2 − 47 log p
p p p p
(12)
which gives the total execution time in microseconds.
5.2.2. Results
The predicted and measured time for the Balanced Mergesort algorithm is shown in Figure 7.
As with the Mergesort algorithm, the results were obtained for input sizes of 1K, 5K and
10K, using 2, 4, 8, 16 and 32 processors. As can be seen from the graph, the measured
values are all within 1% of the predicted values.
The Balanced Mergesort algorithm removes the bottleneck existing in the previous
algorithm by distributing the work more equally among the processors. Thus the execution
times are much lower than those obtained previously.
1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
326 R. D. DOWSING AND W. S. MARTINS
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 327
5.3.1. Analysis
Computation: The computation time of the PSRS algorithm is calculated by summing up
the contributions of each phase: sorting of the initial n/p elements, sorting of the samples,
calculation of the partitions and merging of the p subsequences:
n n n
comp(n, p) = sort + sort(p ) + (p − 1) log
2
+ merge log p (13)
p p p
Communication: Figure 9 shows the action of each processor during the execution of the
PSRS algorithm. Initially every processor reads its n/8-element sequence and sorts it. The
regular samples are then selected and written to the global memory. Next, processor 1 reads
all the samples, sorts them and writes the splitters to the global memory. Every processor
then reads the splitters and uses them to find its partitions. The indexes delimiting the local
partitions are then written to the global memory. All these indexes are subsequently read
by each processor, and the n/8-element sequence, sorted previously, is written back to the
global memory. Knowing the indexes delimiting its partitions processor i now keeps its ith
partition and reads p − 1 other partitions, one each from the other processors.
The communication time of the PSRS algorithm is given by
n n
comm(n, p) = 2 read + 2 write + read(p) + 3 write(p) + 2 read(p2 ) (15)
p p
1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
328 R. D. DOWSING AND W. S. MARTINS
5.3.2. Results
The predicted and measured time obtained for the PSRS algorithm is shown in Figure 10.
The same range of input sizes (1K, 5K, 10K) and number of processors (2, 4, 8, 16, 32)
were used as the previous examples. From the graph it can be seen that the measured
and predicted times are within 5% of each other, the difference being more accentuated
when the number of processors is small. This is because equation (15) was obtained by
assuming that each processor reads a total of n/p elements in the distribution phase,
which is an overestimation. Although the exact number of elements read in this phase is
data dependent (the partitions depend on the input data), a better approximation could be
obtained with [(p − 1)/p][read(n/p)] since roughly 1/p of the data are already in local
memory. However, this would complicate even more the expression which gives the total
execution time – equation (18).
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 329
The behaviour of the curves obtained for the PSRS algorithm differs from the curves for
the previous algorithms in that they reach a minimum and then increase. This is because the
PSRS algorithm works best when the number of samples is much less than the number of
elements to be sorted. As the number of processors increases, the overheads of organising
the recombination of the sorted segments outweighs the advantages of performing the
sorting in parallel. However, for a relatively small number of processors, this algorithm
produces better results than the previous ones.
6. COMPARISON
The computation and communication contributions, together with the total execution time,
of the three algorithms, are shown in Figure 11 for the case of 1K elements. As can be seen,
the Parallel Mergesort has the worst performance and benefits least from an increase in the
number of processors. Because its merging steps have to deal with an increasing number of
elements, the resulting communication costs are kept quite high. The Balanced Mergesort
gives the best performance overall. By dividing the work to be done in the merging steps
more equally, the communications costs are substantially reduced. For fixed input data,
as the number of processors increase there is a decrease in the number of elements each
processor has to work with. This results in less time spent computing and moving data in
each step, but more steps are required to have the work done. Parallel sorting using regular
sampling has good performance but this is critically dependent on the number of processors.
When compared to the Balanced Mergesort, its best performance is slightly inferior but is
achieved using fewer processors. Thus, if there are not many processors available and their
number can be tailored to the size of the data set, regular sampling is a good technique
to be used. Otherwise, Balanced Mergesort is better since its performance improves more
uniformly with the number of processors.
1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
330 R. D. DOWSING AND W. S. MARTINS
Computation, communication and total execution times of the three sorting algorithms
Figure 11.
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 331
7. CONCLUSIONS
In this paper we have shown how to predict the performance of some well-known algorithms
on a general purpose parallel architecture using simple analysis of the communication
and computation costs. By deriving performance equations for the algorithms we have
shown that similar results to those produced by the simulator, which includes a much
more detailed performance model, can be obtained. Such analysis can be helpful when
developing algorithms for real parallel machines, since approximate execution times can
be obtained without the step of coding and executing the algorithms. This advantage is
evident with the regular sorting algorithm where the optimum number of processors can be
determined theoretically instead of by trial and error. However, the performance analysis
developed is only applicable when the communication requirements of the program are
statically predictable.
From the sorting algorithms implemented, the Parallel Mergesort has the worst perfor-
mance because it does not fully utilise all the available processors. The Balanced Mergesort
overcomes this problem by making sure that each processor has the same amount of work
during the execution, and thus improves the performance. Parallel sorting using regular
sampling makes better use of the processors than Parallel Mergesort but relies on a single
processor to sort the regular samples which results in a bottleneck when the number of
processors is large. Its performance is thus critically dependent on the number of proces-
sors and, unless this number can be tailored to the size of the data set, Balanced Mergesort
produces better performance.
The BSP (automatic mode) and the WPRAM models provide a global shared memory
abstraction which provides a single address space and automatically maps data to processors
by the use of hash functions. However, when the cost of accessing global memory is high,
this facility cannot be utilised efficiently. To overcome this problem, data locality can be
exploited by copying global data to (unhashed) local memory and including synchronisation
points in the program so as to keep memory coherent, i.e. taking advantage of weak
coherency. This was the technique used in the sorting implementations described. However,
even though substantial improvements in performance can be obtained with this approach,
too much burden is placed on the programmer, who has to be concerned with memory
management in addition to algorithm design.
ACKNOWLEDGEMENTS
We would like to thank Jonathan Nash for the WPRAM simulator. One of the authors,
W. S. Martins, acknowledges the support of both the Depart. Estatı́stica e Informática,
UFG, Goiânia, Brazil, and CNPq, Brazil, research grant number 200721/91-7.
REFERENCES
1. D. E. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian and
T. vonEicken, ‘LogP: Towards a realistic model of parallel computation’, Proceedings of the 4th
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPOPP, San
Diego, California, Vol. 28, ACM Press, May 1993, pp. 1–12.
2. W. F. McColl, ‘General purpose parallel computing’, in A. M. Gibbons and P. Spirakis (Eds.),
Lectures on Parallel Computation, Cambridge International Series on Parallel Computation,
Vol. 4, Cambridge University Press, Cambridge, UK, 1993, pp. 337–391.
1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
332 R. D. DOWSING AND W. S. MARTINS
3. J. M. Nash and P. Dew, ‘Parallel algorithm design on the XPRAM model’, Proceedings of the
2nd Abstract Machines Workshop, Leeds, 1993.
4. L. G. Valiant, ‘A bridging model for parallel computation’, Commun. ACM, 33, 103–111 (1990).
5. L. G. Valiant, ‘General purpose parallel architectures’, in J. van Leeuwen (Ed.), The Handbook
of Theoretical Computer Science, North Holland, 1990.
6. J. M. Nash, ‘A study of the XPRAM model for parallel computing’, PhD thesis, University of
Leeds, 1993.
7. J. M. Nash, M. E. Dyer and P. M. Dew, ‘Designing practical parallel algorithms for scalable
message passing machines’, Proceedings of the 1995 World Transputer Congress, September
1995, pp. 529–541.
8. S. G. Akl, Parallel Sorting Algorithms, Academic Press, Orlando, FL, 1985.
9. A. Borodin and J. Hopcroft, ‘Routing, merging and sorting on parallel models of computation’,
J. Comput. Syst. Sci., 30, 130–145 (1985).
10. R. Cole, ‘Parallel merge sort’, SIAM J. Comput., 17(4), 770–785 (1988).
11. F. P. Preparata, ‘New parallel sorting schemes’, IEEE Trans. Comput., C-27(7), 669–673 (1978).
12. B. Abali, F. Ozguner and A. Bataineh, ‘Balanced parallel sort on hypercube multiprocessors’,
IEEE Trans. Parallel Distrib. Syst., 4(5), 572–581 (1993).
13. Q. F. Stout, ‘Sorting, merging, selecting and filtering on tree and pyramid machines’, Proceedings
of the International Conference on Parallel Processing, IEEE, New York, August 1983, pp. 214–
221.
14. C. D. Thompson and H. T. Kung, ‘Sorting on a mesh-connected parallel computer’, Commun.
ACM, 20(4), 263–271 (1977).
15. R. S. Francis and I. D. Mathieson, ‘A benchmark parallel sort for shared memory multiproces-
sors’, IEEE Trans. Comput., 37(12), 1619–1626 (1988).
16. X. Li, P. Lu, J. Schaeffer, J. Shillington, P. S. Wong and H. Shi, ‘On the versatility of parallel
sorting by regular sampling’, Technical Report TR 91-06, March 1991, Department of Computing
Science, University of Alberta, Edmonton, Alberta, Canada.
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.