Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Engineering In-place (Shared-memory) Sorting Algorithms

Published: 31 January 2022 Publication History

Abstract

We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account.
Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings.
Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm.
Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.

References

[1]
Lars Arge, Michael T. Goodrich, Michael J. Nelson, and Nodari Sitchinava. 2008. Fundamental parallel algorithms for private-cache chip multiprocessors. In 20th Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 197–206. https://doi.org/10.1145/1378533.1378573
[2]
Martin Aumüller and Nikolaj Hass. 2019. Simple and fast blockquicksort using Lomuto’s partitioning scheme. In 21st Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 15–26. https://doi.org/10.1137/1.9781611975499.2
[3]
Michael Axtmann. 2020. NUMA Array. https://github.com/ips4o/NumaArray. Accessed: 2020-09-01.
[4]
Michael Axtmann. 2020. (Parallel) Super Scalar Sample Sort. https://github.com/ips4o/ps4o. Accessed: 2020-09-01.
[5]
Michael Axtmann, Timo Bingmann, Peter Sanders, and Christian Schulz. 2015. Practical massively parallel sorting. In 27th Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 13–23. https://doi.org/10.1145/2755573.2755595
[6]
Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. 2017. In-Place parallel super scalar samplesort (IPSSSSo). In 25th European Symposium on Algorithms (ESA), Vol. 87. LIPIcs, 9:1–9:14. https://doi.org/10.4230/LIPIcs.ESA.2017.9
[7]
Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. Sept. 2020. Engineering In-place (Shared-memory) Sorting Algorithms. Computing Research Repository (CoRR). arxiv:2009.13569.
[8]
Huang Bing-Chao and Donald E. Knuth. 1986. A one-way, stackless quicksort algorithm. BIT Numerical Mathematics 26, 1 (1986), 127–130.
[9]
Timo Bingmann. 2018. Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools. Ph.D. Dissertation. Karlsruher Institut für Technologie (KIT). https://doi.org/10.5445/IR/1000085031
[10]
Timo Bingmann, Andreas Eberle, and Peter Sanders. 2017. Engineering parallel string sorting. Algorithmica 77, 1 (2017), 235–286. https://doi.org/10.1007/s00453-015-0071-1
[11]
Timo Bingmann, Jasper Marianczuk, and Peter Sanders. 2021. Engineering faster sorters for small sets of items. Softw. Pract. Exp. 51, 5 (2021), 965–1004. https://doi.org/10.1002/spe.2922
[12]
Guy E. Blelloch, Daniel Anderson, and Laxman Dhulipala. 2020. ParlayLib - A toolkit for parallel algorithms on shared-memory multicore machines. In 32nd Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 507–509. https://doi.org/10.1145/3350755.3400254
[13]
Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2010. Low depth cache-oblivious algorithms. In 22nd Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 189–199. https://doi.org/10.1145/1810479.1810519
[14]
Guy E. Blelloch, Charles E. Leiserson, Bruce M. Maggs, C. Greg Plaxton, Stephen J. Smith, and Marco Zagha. 1996. A comparison of sorting algorithms for the connection machine CM-2, In 3rd Symposium on Parallel Algorithms and Architectures (SPAA).Commun. ACM 39, 12es, 273–297. https://doi.org/10.1145/113379.113380
[15]
Berenger Bramas. 2017. A novel hybrid quicksort algorithm vectorized using AVX-512 on Intel Skylake. IJACSA 8, 10 (2017), 337–344. https://doi.org/10.14569/ijacsa.2017.081044
[16]
Gerth Stølting Brodal, Rolf Fagerberg, and Kristoffer Vinther. 2007. Engineering a cache-oblivious sorting algorithm. ACM J. Exp. Algorithmics 12 (2007), 2.2:1–2.2:23. https://doi.org/10.1145/1227161.1227164
[17]
Minsik Cho, Daniel Brand, Rajesh Bordawekar, Ulrich Finkler, Vincent KulandaiSamy, and Ruchir Puri. 2015. PARADIS: An efficient parallel algorithm for in-place radix sort. Proc. VLDB Endow. 8, 12 (2015), 1518–1529. https://doi.org/10.14778/2824032.2824050
[18]
Michael Codish, Luís Cruz-Filipe, Markus Nebel, and Peter Schneider-Kamp. 2017. Optimizing sorting algorithms by using sorting networks. Formal Aspects Comput. 29, 3 (2017), 559–579. https://doi.org/10.1007/s00165-016-0401-3
[19]
Intel Corporation. 2020. Intel® Integrated Performance Primitives. https://software.intel.com/en-us/ipp-dev-reference. Version 2020 Initial Release.
[20]
Elizabeth D. Dolan and Jorge J. Moré. 2002. Benchmarking optimization software with performance profiles. Math. Program. 91, 2 (2002), 201–213. https://doi.org/10.1007/s101070100263
[21]
David H. Douglas and Thomas K. Peucker. 1973. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Artogr. Int. J. Geogr. Inf. Geovisualization 10, 2 (1973), 112–122.
[22]
Branislav Durian. 1986. Quicksort without a stack. In Mathematical Foundations of Computer Science(Lecture Notes in Computer Science, Vol. 233). Springer, 283–289. https://doi.org/10.1007/BFb0016252
[23]
Stefan Edelkamp and Armin Weiß. 2016. BlockQuicksort: Avoiding branch mispredictions in quicksort. In 24th European Symposium on Algorithms (ESA), Vol. 57. LIPIcs, 38:1–38:16. https://doi.org/10.4230/LIPIcs.ESA.2016.38
[24]
Stefan Edelkamp and Armin Weiß. 2019. Worst-case efficient sorting with QuickMergesort. In 21st Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 1–14. https://doi.org/10.1137/1.9781611975499.1
[25]
Amr Elmasry, Jyrki Katajainen, and Max Stenmark. 2012. Branch mispredictions don’t affect mergesort. In 11th Symposium on Experimental Algorithms (SEA), Vol. 7276. Springer, 160–171. https://doi.org/10.1007/978-3-642-30850-5_15
[26]
Vladimir Estivill-Castro and Derick Wood. 1992. A survey of adaptive sorting algorithms. ACM Comput. Surv. 24, 4 (1992), 441–476. https://doi.org/10.1145/146370.146381
[27]
Philip J. Fleming and John J. Wallace. 1986. How not to lie with statistics: The correct way to summarize benchmark results. Commun. ACM 29, 3 (1986), 218–221. https://doi.org/10.1145/5666.5673
[28]
Gianni Franceschini. 2004. Proximity mergesort: Optimal in-place sorting in the cache-oblivious model. In 15th Symposium on Discrete Algorithms (SODA). SIAM, 291–299.
[29]
Gianni Franceschini and Viliam Geffert. 2005. An in-place sorting with \(\mathcal {O}\!(n \log n)\) comparisons and O(n) moves. J. ACM 52, 4 (2005), 515–537. https://doi.org/10.1145/1082036.1082037
[30]
Rhys S. Francis and L. J. H. Pannan. 1992. A parallel partition for enhanced parallel QuickSort. Parallel Comput. 18, 5 (1992), 543–550. https://doi.org/10.1016/0167-8191(92)90089-P
[31]
W. Donald Frazer and A. C. McKellar. 1970. Samplesort: A sampling approach to minimal storage tree sorting. J. ACM 17, 3 (1970), 496–507. https://doi.org/10.1145/321592.321600
[32]
Edward H. Friend. 1956. Sorting on electronic computer systems. J. ACM 3, 3 (1956), 134–168. https://doi.org/10.1145/320831.320833
[33]
Viliam Geffert and Jozef Gajdos. 2010. Multiway in-place merging. Theor. Comput. Sci. 411, 16-18 (2010), 1793–1808. https://doi.org/10.1016/j.tcs.2010.01.034
[34]
Phillip B. Gibbons. 1989. A more practical PRAM Model. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, SPAA’89, Santa Fe, New Mexico, USA, June 18-21, 1989. ACM, 158–168. https://doi.org/10.1145/72935.72953
[35]
Fuji Goro and Morwenn. 2014. TimSort. https://github.com/timsort/cpp-TimSort.git. Accessed: 2020-09-01.
[36]
REFRESH Bioinformatics Group. 2017. RADULS2. https://github.com/refresh-bio/RADULS. Accessed: 2020-09-01.
[37]
Yan Gu, Omar Obeya, and Julian Shun. 2021. Parallel In-Place Algorithms: Theory and Practice. Computing Research Repository (CoRR). arxiv:2103.01216
[38]
Xiaojun Guan and Michael A. Langston. 1991. Time-space optimal parallel merging and sorting. IEEE Trans. Computers 40, 5 (1991), 596–602. https://doi.org/10.1109/12.88483
[39]
Philip Heidelberger, Alan Norton, and John T. Robinson. 1990. Parallel quicksort using fetch-and-add. IEEE Trans. Computers 39, 1 (1990), 133–138. https://doi.org/10.1109/12.46289
[40]
C. A. R. Hoare. 1962. Quicksort. Comput. J. 5, 1 (1962), 10–15. https://doi.org/10.1093/comjnl/5.1.10
[41]
Kaixi Hou, Hao Wang, and Wu-chun Feng. 2015. ASPaS: A framework for automatic SIMDization of parallel sorting on x86-based many-core processors. In 29th International Conference on Supercomputing (ICS). ACM, 383–392. https://doi.org/10.1145/2751205.2751247
[42]
Bing-Chao Huang and Michael A. Langston. 1992. Fast stable merging and sorting in constant extra space. Comput. J. 35, 6 (1992), 643–650. https://doi.org/10.1093/comjnl/35.6.643
[43]
Lorenz Hübschle-Schneider. 2016. Super Scalar Sample Sort. https://github.com/lorenzhs/ssssort. Accessed: 2020-09-01.
[44]
Hiroshi Imai and Masao Iri. 1986. An optimal algorithm for approximating a piecewise linear function. J. Inf. Process. 9, 3 (1986), 159–162.
[45]
Joseph JáJá. 2000. A perspective on Quicksort. Comput. Sci. Eng. 2, 1 (2000), 43–49. https://doi.org/10.1109/5992.814657
[46]
Tomasz Jurkiewicz and Kurt Mehlhorn. 2014. On a model of virtual address translation. Journal of Experimental Algorithmics (JEA) 19, 1 (2014), 1.9:1–1.9:28. https://doi.org/10.1145/2656337
[47]
Kanela Kaligosi and Peter Sanders. 2006. How branch mispredictions affect quicksort. In 14th European Symposium on Algorithms (ESA), Vol. 4168. Springer, 780–791. https://doi.org/10.1007/11841036_69
[48]
Jyrki Katajainen, Tomi Pasanen, and Jukka Teuhola. 1996. Practical in-place mergesort. Nord. J. Comput. 3, 1 (1996), 27–40.
[49]
Pok-Son Kim and Arne Kutzner. 2008. Ratio based stable in-place merging. In Theory and Applications of Models of Computation (TAMC), Vol. 4978. Springer, 246–257. https://doi.org/10.1007/978-3-540-79228-4_22
[50]
Marek Kokot, Sebastian Deorowicz, and Maciej Dlugosz. 2017. Even faster sorting of (not only) integers. In 5th International Conference on Man-Machine Interactions (ICMMI), Vol. 659. Springer, 481–491. https://doi.org/10.1007/978-3-319-67792-7_47
[51]
Ani Kristo. 2020. LearnedSort. https://github.com/learnedsystems/LearnedSort. Accessed: 2020-09-01.
[52]
Shrinu Kushagra, Alejandro López-Ortiz, Aurick Qiao, and J. Ian Munro. 2014. Multi-pivot quicksort: Theory and experiments. In 16th Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 47–60. https://doi.org/10.1137/1.9781611973198.6
[53]
Charles U. Martel and Dan Gusfield. 1989. A fast parallel quicksort algorithm. Inf. Process. Lett. 30, 2 (1989), 97–102. https://doi.org/10.1016/0020-0190(89)90116-6
[54]
Mike McFadden. 2014. WikiSort. https://github.com/BonzaiThePenguin/WikiSort. Accessed: 2020-09-01.
[55]
Catherine C. McGeoch. 2012. A Guide to Experimental Algorithmics. Cambridge University Press.
[56]
Peter M. McIlroy, Keith Bostic, and M. Douglas McIlroy. 1993. Engineering radix sort. Comput. Syst. 6, 1 (1993), 5–27.
[57]
J. Ian Munro and Sebastian Wild. [n.d.]. Nearly-optimal mergesorts: Fast, practical sorting methods that optimally adapt to existing runs. In 26th European Symposium on Algorithms (ESA). 63:1–63:16.
[58]
David R. Musser. 1997. Introspective sorting and selection algorithms. Softw. Pract. Exp. 27, 8 (1997), 983–993.
[59]
Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. 2019. RegionSort. https://github.com/omarobeya/parallel-inplace-radixsort. Accessed: 2020-09-01.
[60]
Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. 2019. Theoretically-efficient and practical parallel in-place radix sorting. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, ACM, 213–224.
[61]
Orson Peters. 2015. Pattern-defeating quicksort. https://github.com/orlp/pdqsort. Accessed: 2020-09-01.
[62]
Tim Peters. 2002. Timsort. http://svn.python.org/projects/python/trunk/Objects/listsort.txt. Accessed: 2020-03-31.
[63]
Orestis Polychroniou. 2014. In-place MSB. http://www.cs.columbia.edu/ orestis/publications.html. Accessed: 2020-09-01.
[64]
Orestis Polychroniou and Kenneth A. Ross. 2014. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In International Conference on Management of Data (SIGMOD). ACM, 755–766. https://doi.org/10.1145/2588555.2610522
[65]
Naila Rahman. 2002. Algorithms for Hardware Caches and TLB. Vol. 2625. Springer, 171–192. https://doi.org/10.1007/3-540-36574-5_8
[66]
James Reinders. 2007. Intel Threading Building Blocks - Outfitting C++ for Multi-core Processor Parallelism. O’Reilly.
[67]
Peter Sanders, Kurt Mehlhorn, Martin Dietzfelbinger, and Roman Dementiev. 2019. Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Springer. https://doi.org/10.1007/978-3-030-25209-0
[68]
Peter Sanders and Sebastian Winkel. 2004. Super scalar sample sort. In 12th European Symposium on Algorithms (ESA), Vol. 3221. Springer, 784–796. https://doi.org/10.1007/978-3-540-30140-0_69
[69]
Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief announcement: The problem based benchmark suite. In 24th Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 68–70. https://doi.org/10.1145/2312005.2312018
[70]
Johannes Singler, Peter Sanders, and Felix Putze. 2007. MCSTL: The multi-core standard template library. In Euro-Par, Vol. 4641. Springer, 682–694. https://doi.org/10.1007/978-3-540-74466-5_72
[71]
Malte Skarupke. 2016. I Wrote a Faster Sorting Algorithm. https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/. Accessed: 2020-03-31.
[72]
Malte Skarupke. 2016. Ska Sort. https://github.com/skarupke/ska_sort. Accessed: 2020-09-01.
[73]
Virginia Tech SyNeRGy Lab. 2018. ASPaS. https://github.com/vtsynergy/aspas_sort. Accessed: 2020-09-01.
[74]
Philippas Tsigas and Yi Zhang. 2003. A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise 10000. In 11th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP). IEEE Computer Society, 372. https://doi.org/10.1109/EMPDP.2003.1183613
[75]
Jan Wassenberg and Peter Sanders. 2011. Engineering a multi-core radix sort. In Euro-Par, Vol. 6853. Springer, 160–169. https://doi.org/10.1007/978-3-642-23397-5_16
[76]
Lutz M. Wegner. 1987. A generalized, one-way, stackless quicksort. BIT Comput. Sci. Sect. 27, 1 (1987), 44–48. https://doi.org/10.1007/BF01937353
[77]
Armin Weiss. 2016. BlockQuicksort. https://github.com/weissan/BlockQuicksort. Accessed: 2020-09-01.
[78]
Armin Weiss. 2016. Yaroslavskiy’s Dual-Pivot Quicksort. https://github.com/weissan/BlockQuicksort/blob/master/Yaroslavskiy.h++. Accessed: 2020-09-01.
[79]
Jokob Wenzel. 2019. Intel Threading Building Blocks with CMake build system. https://github.com/wjakob/tbb. TBB 2019 Update 6.
[80]
Vladimir Yaroslavskiy. 2010. Question on sorting. http://mail.openjdk.java.net/pipermail/core-libs-dev/2010-July/004649.html. Accessed: 2020-09-01.

Cited By

View all
  • (2024)Sorting on Byte-Addressable Storage: The Resurgence of Tree StructureProceedings of the VLDB Endowment10.14778/3648160.364818517:6(1487-1500)Online publication date: 3-May-2024
  • (2024)Parallel Iterative Mistake Minimization (IMM) clustering algorithm for shared-memory systemsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673057(1-10)Online publication date: 12-Aug-2024
  • (2024)Parallel Integer Sort: Theory and Practice (Abstract)Proceedings of the 2024 ACM Workshop on Highlights of Parallel Computing10.1145/3670684.3673403(13-14)Online publication date: 17-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 9, Issue 1
March 2022
168 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3505221
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2022
Accepted: 01 November 2021
Revised: 01 October 2021
Received: 01 October 2020
Published in TOPC Volume 9, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. In-place algorithm
  2. branch prediction

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)360
  • Downloads (Last 6 weeks)18
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Sorting on Byte-Addressable Storage: The Resurgence of Tree StructureProceedings of the VLDB Endowment10.14778/3648160.364818517:6(1487-1500)Online publication date: 3-May-2024
  • (2024)Parallel Iterative Mistake Minimization (IMM) clustering algorithm for shared-memory systemsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673057(1-10)Online publication date: 12-Aug-2024
  • (2024)Parallel Integer Sort: Theory and Practice (Abstract)Proceedings of the 2024 ACM Workshop on Highlights of Parallel Computing10.1145/3670684.3673403(13-14)Online publication date: 17-Jun-2024
  • (2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
  • (2024)A new approach to Mergesort algorithm: Divide smart and conquerFuture Generation Computer Systems10.1016/j.future.2024.03.049157(330-343)Online publication date: Aug-2024
  • (2024)Split-bucket partition (SBP): a novel execution model for top-K and selection algorithms on GPUsThe Journal of Supercomputing10.1007/s11227-024-06031-x80:11(15122-15160)Online publication date: 29-Mar-2024
  • (2024)Refinement of Parallel Algorithms Down to LLVM: Applied to Practically Efficient Parallel SortingJournal of Automated Reasoning10.1007/s10817-024-09701-w68:3Online publication date: 19-Jun-2024
  • (2023)Optimizing Search and Sort Algorithms: Harnessing Parallel Programming for Efficient Processing of Large Datasets2023 2nd International Conference on Automation, Computing and Renewable Systems (ICACRS)10.1109/ICACRS58579.2023.10405268(1439-1449)Online publication date: 11-Dec-2023
  • (2023)Parallel Block-InsertionSort2023 IEEE CHILEAN Conference on Electrical, Electronics Engineering, Information and Communication Technologies (CHILECON)10.1109/CHILECON60335.2023.10418645(1-7)Online publication date: 5-Dec-2023
  • (2023)Parallel Multi-Deque Partition Dual-Deque Merge sorting algorithm using OpenMPScientific Reports10.1038/s41598-023-33583-413:1Online publication date: 19-Apr-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media