Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1248377.1248436acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
Article

Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Published: 09 June 2007 Publication History
  • Get Citation Alerts
  • Abstract

    Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated in-register, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries - x86-64's SSE2 and G5's AltiVec - demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11]. When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22% for moderately-sized arrays, with greater relative reductions for small arrays. Wall-clock performance of d-heaps is improved by up to 39% using a similar technique.

    References

    [1]
    K. E. Batcher. Sorting networks and their applications. In AFIPS Spring Joint Computing Conference, pages 307--314, 1968.
    [2]
    L. Bishop, D. Eberly, T. Whitted, M. Finch, and M. Shantz. Designing a PC game engine. IEEE Computer Graphics and Applications, 18(1):46--53, 1998.
    [3]
    D. Bitton, D. J. DeWitt, D. K. Hsiao, and J. Menon. A taxonomy of parallel sorting. Computing Surveys, 16(3):287--318, September 1984.
    [4]
    J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In Principples and Practice of Parallel Programming PPoPP, pages 206--216, Las Vegas, Nevada, 1997.
    [5]
    M. Frigo. A fast Fourier transform compiler. In Programming Language Design and Implementation PLDI, pages 169--180, Atlanta, GA, June 1999.
    [6]
    Intel. IA-32 Intel R64 and ai-32 architectures software developer's manual volume 1: Basic architecture. http://www.intel.com/design/processor/manuals/253665.pdf, 2007.
    [7]
    Douglas W. Jones. An empirical comparison of priority-queue and event-set implementations. Commun. ACM, 29(4):300--311, 1986.
    [8]
    Donald Ervin Knuth. The Art of Computer Programming, Vol. 3 - Sorting and Searching. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1973.
    [9]
    A. LaMarca and R. E. Ladner. The influence of caches on the performance of heaps. ACM Journal of Experimental Algorithms, 1:4, 1996.
    [10]
    A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. In SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1997.
    [11]
    X. Li, M. Garzaran, and D. Padua. A dynamically tuned sorting library. In Code Generation and Optimization CGO, pages 111--122, Palo Alto, CA, 2004.
    [12]
    X. Li, M. J. Garzarán, and D. Padua. Optimizing sorting with genetic algorithms. In Code Generation and Optimization CGO, pages 99--110, San Jose, CA, March 2005.
    [13]
    D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In Programming language design and implementation PLDI, pages 132--143, 2006.
    [14]
    A. Ranade, S. Kothari, and R. Udupa. Register efficient mergesorting. In High Performance Computing -- HiPC, volume 1970 of LNCS, pages 96--103. Springer, 2000.
    [15]
    Gang Ren, Peng Wu, and David Padua. Optimizing data permutations for SIMD devices. In Programming language design and implementation PLDI, pages 118--131, 2006.
    [16]
    Xipeng Shen and Chen Ding. Adaptive data partition for sorting using probability distribution. In ICPP '04: Proceedings of the 2004 International Conference on Parallel Processing (ICPP'04), pages 250--257, Washington, DC, USA, 2004. IEEE Computer Society.
    [17]
    H. J. Siegel. The universality of various types of SIMD machine interconnection networks. In Proceedings of the 4th Annual Symposium on Computer Architecture, pages 23--25, Silver Spring, MD, March 1977. ACM SIGARCH/IEEE-CS.
    [18]
    S. A. A. Touati. Register saturation in instruction level parallelism. International Journal of Parallel Programming, 33(4):393--449, 2005.
    [19]
    R. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimizations of sotware and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.
    [20]
    R. Wickremesinghe, L. Arge, J. S. Chase, and J. S. Vitter. Efficient sorting using registers and caches. ACM Journal of Experimental Algorithmics, 7:9, 2002.
    [21]
    J. Xiong, J. Johnson, R. Johnson, and D. Padua. SPL: A language and compiler for DSP algorithms. In Programming Language Design and Implementation PLDI, pages 298--308, Snowbird, Utah, June 2001.

    Cited By

    View all
    • (2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
    • (2020)The Case for a Learned Sorting AlgorithmProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389752(1001-1016)Online publication date: 11-Jun-2020
    • (2020)Engineering faster sorters for small sets of itemsSoftware: Practice and Experience10.1002/spe.292251:5(965-1004)Online publication date: 2-Nov-2020
    • Show More Cited By

    Index Terms

    1. Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
        June 2007
        376 pages
        ISBN:9781595936677
        DOI:10.1145/1248377
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 09 June 2007

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. SIMD
        2. instruction-level parallelism
        3. quicksort
        4. sorting
        5. sorting networks
        6. vectorization

        Qualifiers

        • Article

        Conference

        SPAA07

        Acceptance Rates

        Overall Acceptance Rate 447 of 1,461 submissions, 31%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)14
        • Downloads (Last 6 weeks)1

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
        • (2020)The Case for a Learned Sorting AlgorithmProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389752(1001-1016)Online publication date: 11-Jun-2020
        • (2020)Engineering faster sorters for small sets of itemsSoftware: Practice and Experience10.1002/spe.292251:5(965-1004)Online publication date: 2-Nov-2020
        • (2019)Improving performance of sorting small arrays on MIPS CPUs using bitonic sort and SIMD instructions2019 27th Telecommunications Forum (TELFOR)10.1109/TELFOR48224.2019.8971325(1-4)Online publication date: Nov-2019
        • (2019)Array sort: an adaptive sorting algorithm on multi-threadThe Journal of Engineering10.1049/joe.2018.51542019:5(3455-3459)Online publication date: 1-May-2019
        • (2019)Fast and Flexible Software Polar List DecodersJournal of Signal Processing Systems10.1007/s11265-018-1430-391:8(937-952)Online publication date: 1-Aug-2019
        • (2018)A Framework for the Automatic Vectorization of Parallel Sort on x86-Based ProcessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.278990329:5(958-972)Online publication date: 1-May-2018
        • (2017)Optimal-depth sorting networksJournal of Computer and System Sciences10.1016/j.jcss.2016.09.00484:C(185-204)Online publication date: 1-Mar-2017
        • (2017)Optimization Scheme Based on Parallel Computing TechnologyParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_48(504-513)Online publication date: 6-Oct-2017
        • (2016)Optimizing sorting algorithms by using sorting networksFormal Aspects of Computing10.1007/s00165-016-0401-329:3(559-579)Online publication date: 4-Nov-2016
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media