Article

Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Authors:

Timothy Furtak,

José Nelson Amaral, and

Robert NiewiadomskiAuthors Info & Claims

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

June 2007

Pages 348 - 357

https://doi.org/10.1145/1248377.1248436

Published: 09 June 2007 Publication History

Abstract

Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated in-register, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries - x86-64's SSE2 and G5's AltiVec - demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11]. When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22% for moderately-sized arrays, with greater relative reductions for small arrays. Wall-clock performance of d-heaps is improved by up to 39% using a similar technique.

References

[1]

K. E. Batcher. Sorting networks and their applications. In AFIPS Spring Joint Computing Conference, pages 307--314, 1968.

Digital Library

[2]

L. Bishop, D. Eberly, T. Whitted, M. Finch, and M. Shantz. Designing a PC game engine. IEEE Computer Graphics and Applications, 18(1):46--53, 1998.

Digital Library

[3]

D. Bitton, D. J. DeWitt, D. K. Hsiao, and J. Menon. A taxonomy of parallel sorting. Computing Surveys, 16(3):287--318, September 1984.

Digital Library

[4]

J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In Principples and Practice of Parallel Programming PPoPP, pages 206--216, Las Vegas, Nevada, 1997.

Digital Library

[5]

M. Frigo. A fast Fourier transform compiler. In Programming Language Design and Implementation PLDI, pages 169--180, Atlanta, GA, June 1999.

Digital Library

[6]

Intel. IA-32 Intel R64 and ai-32 architectures software developer's manual volume 1: Basic architecture. http://www.intel.com/design/processor/manuals/253665.pdf, 2007.

[7]

Douglas W. Jones. An empirical comparison of priority-queue and event-set implementations. Commun. ACM, 29(4):300--311, 1986.

Digital Library

[8]

Donald Ervin Knuth. The Art of Computer Programming, Vol. 3 - Sorting and Searching. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1973.

[9]

A. LaMarca and R. E. Ladner. The influence of caches on the performance of heaps. ACM Journal of Experimental Algorithms, 1:4, 1996.

Digital Library

[10]

A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. In SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1997.

Digital Library

[11]

X. Li, M. Garzaran, and D. Padua. A dynamically tuned sorting library. In Code Generation and Optimization CGO, pages 111--122, Palo Alto, CA, 2004.

Digital Library

[12]

X. Li, M. J. Garzarán, and D. Padua. Optimizing sorting with genetic algorithms. In Code Generation and Optimization CGO, pages 99--110, San Jose, CA, March 2005.

Digital Library

[13]

D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In Programming language design and implementation PLDI, pages 132--143, 2006.

Digital Library

[14]

A. Ranade, S. Kothari, and R. Udupa. Register efficient mergesorting. In High Performance Computing -- HiPC, volume 1970 of LNCS, pages 96--103. Springer, 2000.

Digital Library

[15]

Gang Ren, Peng Wu, and David Padua. Optimizing data permutations for SIMD devices. In Programming language design and implementation PLDI, pages 118--131, 2006.

Digital Library

[16]

Xipeng Shen and Chen Ding. Adaptive data partition for sorting using probability distribution. In ICPP '04: Proceedings of the 2004 International Conference on Parallel Processing (ICPP'04), pages 250--257, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[17]

H. J. Siegel. The universality of various types of SIMD machine interconnection networks. In Proceedings of the 4th Annual Symposium on Computer Architecture, pages 23--25, Silver Spring, MD, March 1977. ACM SIGARCH/IEEE-CS.

Digital Library

[18]

S. A. A. Touati. Register saturation in instruction level parallelism. International Journal of Parallel Programming, 33(4):393--449, 2005.

Digital Library

[19]

R. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimizations of sotware and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.

Digital Library

[20]

R. Wickremesinghe, L. Arge, J. S. Chase, and J. S. Vitter. Efficient sorting using registers and caches. ACM Journal of Experimental Algorithmics, 7:9, 2002.

Digital Library

[21]

J. Xiong, J. Johnson, R. Johnson, and D. Padua. SPL: A language and compiler for DSP algorithms. In Programming Language Design and Implementation PLDI, pages 298--308, Snowbird, Utah, June 2001.

Digital Library

Cited By

Bingmann TDinklage PFischer JKurpicz FOhlebusch ESanders P(2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
https://doi.org/10.1007/978-3-031-21534-6_14
Kristo AVaidya KÇetintemel UMisra SKraska TMaier DPottinger RDoan ATan WAlawini ANgo H(2020)The Case for a Learned Sorting AlgorithmProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389752(1001-1016)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389752
Bingmann TMarianczuk JSanders P(2020)Engineering faster sorters for small sets of itemsSoftware: Practice and Experience10.1002/spe.292251:5(965-1004)Online publication date: 2-Nov-2020
https://doi.org/10.1002/spe.2922
Show More Cited By

Index Terms

Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
      2. Single instruction, multiple data

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Read More
Retargetable code optimization with SIMD instructions
CODES+ISSS '06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis

Retargetable C compilers are nowadays widely used to quickly obtain compiler support for new embedded processors and to perform early processor architecture exploration. One frequent concern about retargetable compilers, though, is their lack of machine-...
Read More
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

June 2007

376 pages

ISBN:9781595936677

DOI:10.1145/1248377

General Chair:
Phillip B. Gibbons
Intel Research, USA
,
Program Chair:
Christian Scheideler
Technische Universität München, Germany

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SPAA07

Sponsor:

SPAA07: 19th ACM Symposium on Parallelism in Algorithms and Architectures

June 9 - 11, 2007

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
949
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Bingmann TDinklage PFischer JKurpicz FOhlebusch ESanders P(2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
https://doi.org/10.1007/978-3-031-21534-6_14
Kristo AVaidya KÇetintemel UMisra SKraska TMaier DPottinger RDoan ATan WAlawini ANgo H(2020)The Case for a Learned Sorting AlgorithmProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389752(1001-1016)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389752
Bingmann TMarianczuk JSanders P(2020)Engineering faster sorters for small sets of itemsSoftware: Practice and Experience10.1002/spe.292251:5(965-1004)Online publication date: 2-Nov-2020
https://doi.org/10.1002/spe.2922
Brankovic SMarkovic ASimic DRikalo A(2019)Improving performance of sorting small arrays on MIPS CPUs using bitonic sort and SIMD instructions2019 27th Telecommunications Forum (TELFOR)10.1109/TELFOR48224.2019.8971325(1-4)Online publication date: Nov-2019
https://doi.org/10.1109/TELFOR48224.2019.8971325
Huang XLiu ZLi J(2019)Array sort: an adaptive sorting algorithm on multi-threadThe Journal of Engineering10.1049/joe.2018.51542019:5(3455-3459)Online publication date: 1-May-2019
https://doi.org/10.1049/joe.2018.5154
Léonardon MCassagne ALeroux CJégo CHamelin LSavaria Y(2019)Fast and Flexible Software Polar List DecodersJournal of Signal Processing Systems10.1007/s11265-018-1430-391:8(937-952)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1007/s11265-018-1430-3
Hou KWang HFeng W(2018)A Framework for the Automatic Vectorization of Parallel Sort on x86-Based ProcessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.278990329:5(958-972)Online publication date: 1-May-2018
https://doi.org/10.1109/TPDS.2018.2789903
Bundala DCodish MCruz-Filipe LSchneider-Kamp PZávodný J(2017)Optimal-depth sorting networksJournal of Computer and System Sciences10.1016/j.jcss.2016.09.00484:C(185-204)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.jcss.2016.09.004
Li XChen CLuo YChen M(2017)Optimization Scheme Based on Parallel Computing TechnologyParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_48(504-513)Online publication date: 6-Oct-2017
https://doi.org/10.1007/978-981-10-6442-5_48
Codish MCruz-Filipe LNebel MSchneider-Kamp P(2016)Optimizing sorting algorithms by using sorting networksFormal Aspects of Computing10.1007/s00165-016-0401-329:3(559-579)Online publication date: 4-Nov-2016
https://doi.org/10.1007/s00165-016-0401-3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents