research-article

Streaming Sorting Networks

Authors:

Marcela Zuluaga,

Markus PüschelAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 21, Issue 4

Article No.: 55, Pages 1 - 30

https://doi.org/10.1145/2854150

Published: 27 May 2016 Publication History

Abstract

Sorting is a fundamental problem in computer science and has been studied extensively. Thus, a large variety of sorting methods exist for both software and hardware implementations. For the latter, there is a trade-off between the throughput achieved and the cost (i.e., the logic and storage invested to sort n elements). Two popular solutions are bitonic sorting networks with O(nlog ²n) logic and storage, which sort n elements per cycle, and linear sorters with O(n) logic and storage, which sort n elements per n cycles. In this article, we present new hardware structures that we call streaming sorting networks, which we derive through a mathematical formalism that we introduce, and an accompanying domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog. With the new networks, we achieve novel and improved cost-performance trade-offs. For example, assuming that n is a two-power and w is any divisor of n, one class of these networks can sort in n/;w cycles with O(wlog ²n) logic and O(nlog ²n) storage; the other class that we present sorts in nlog ²n/;w cycles with O(w) logic and O(n) storage. We carefully analyze the performance of these networks and their cost at three levels of abstraction: (1) asymptotically, (2) exactly in terms of the number of basic elements needed, and (3) in terms of the resources required by the actual circuit when mapped to a field-programmable gate array. The accompanying hardware generator allows us to explore the entire design space, identify the Pareto-optimal solutions, and show superior cost-performance trade-offs compared to prior work.

References

[1]

M. Ajtai, J. Komlós, and E. Szemerédi. 1983. An O(N log N) sorting network. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing (STOC’83). ACM, New York, NY, 1--9.

Digital Library

[2]

K. E. Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30--May 2, 1968, Spring Joint Computer Conference (AFIPS’68 (Spring)). ACM, New York, NY, 307--314.

Digital Library

[3]

Gianfranco Bilardi and Franco P. Preparata. 1984. An architecture for bitonic sorting with optimal VLSI performance. IEEE Transactions on Computers 100, 7, 646--651.

Digital Library

[4]

Ren Chen, Sruja Siriyal, and Viktor Prasanna. 2015. Energy and memory efficient mapping of bitonic sorting on FPGA. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 240--249.

Digital Library

[5]

Yen-Cheng Chen and Wen-Tsuen Chen. 1994. Constant time sorting on reconfigurable meshes. IEEE Transactions on Computers 43, 6, 749--751.

Digital Library

[6]

Martin Dowd, Yehoshua Perl, Larry Rudolph, and Michael Saks. 1989. The periodic balanced sorting network. Journal of the ACM 36, 4, 738--757.

Digital Library

[7]

Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. 2009. Operator language: A program generation framework for fast kernels. In Domain-Specific Languages. Lecture Notes in Computer Science, Vol. 5658. Springer, 385--410.

Digital Library

[8]

Ju-Wook Jang and Viktor K. Prasanna. 1992. An optimal sorting algorithm on reconfigurable mesh. In Proceedings of the Parallel Processing Symposium. IEEE, Los Alamitos, CA, 130--137.

Digital Library

[9]

J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. 1990. A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. Circuits, Systems, and Signal Processing 9, 4, 449--500.

Digital Library

[10]

Donald E. Knuth. 1968. The Art of Computer Programming: Sorting and Searching. Addison-Wesley.

[11]

Miroslaw Kutylowsky, Krzysztof Loryś, Brigitte Oesterdiekhoff, and Rolf Wanka. 2000. Periodification scheme: Constructing sorting networks with constant period. Journal of the ACM 47, 5, 944--967.

Digital Library

[12]

Christophe Layer and Hans-Jörg Pfleiderer. 2004. A reconfigurable recurrent bitonic sorting network for concurrently accessible data. In Field Programmable Logic and Application. Lecture Notes in Computer Science, Vol. 3203. Springer, 648--657.

[13]

Christophe Layer, Daniel Schaupp, and Hans-Jörg Pfleiderer. 2007. Area and throughput aware comparator networks optimization for parallel data processing on FPGA. In Proceedings of the International Symposium on Circuits and Systems. IEEE, Los Alamitos, CA, 405--408.

[14]

Chen-Yi Lee and Jer-Min Tsai. 1995. A shift register architecture for high-speed data sorting. Journal of VLSI Signal Processing Systems 11, 3, 273--280.

Digital Library

[15]

F. T. Leighton. 1992. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Vol. 1. Morgan Kaufmann, San Mateo, CA.

Digital Library

[16]

Chi-Sheng Lin and Bin-Da Liu. 2002. Design of a pipelined and expandable sorting architecture with simple control scheme. In Proceedings of the International Symposium on Circuits and Systems. IEEE, Los Alamitos, CA, IV-217--IV-220.

[17]

Peter Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012. Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems 17, 2, Article No. 15.

Digital Library

[18]

Rene Mueller, Jens Teubner, and Gustavo Alonso. 2012. Sorting networks on FPGAs. VLDB Journal 21, 1, 1--23.

Digital Library

[19]

Jorge Ortiz and David Andrews. 2010. A configurable high-throughput linear sorter system. In Proceedings of the Reconfigurable Architectures Workshop at the International Symposium on Parallel and Distributed Systems. IEEE, Los Alamitos, CA.

[20]

Marshall C. Pease. 1968. An adaptation of the fast fourier transform for parallel processing. Journal of the ACM 15, 2, 252--264.

Digital Library

[21]

Roberto Perez-Andrade, Rene Cumplido, Claudia Feregrino-Uribe, and Fernando Martin Del Campo. 2009. A versatile linear insertion sorter based on an FIFO scheme. Microelectronics Journal 40, 12, 1705--1713.

Digital Library

[22]

Markus Püschel, Peter A. Milder, and James C. Hoe. 2009. Permuting streaming data using RAMs. Journal of the ACM 56, 2, Article No. 10.

Digital Library

[23]

Markus Püschel, Peter A. Milder, and James C. Hoe. 2012. System and method for designing architecture for specified permutation and datapath circuits for permutation. U.S. Patent No. 8,321,823.

[24]

Markus Püschel, José M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gačić, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE 93, 2, 232--275.

[25]

Isaac D. Scherson and Sandeep Sen. 1989. Parallel sorting in two-dimensional VLSI models of computation. IEEE Transactions on Computers 38, 2, 238--249.

Digital Library

[26]

H. S. Stone. 1971. Parallel processing with the perfect shuffle. IEEE Transactions on Computers 20, 2, 153--161.

Digital Library

[27]

Charles Van Loan. 1992. Computational Frameworks for the Fast Fourier Transform. SIAM, Philadelphia, PA.

Digital Library

[28]

Y. Zhang and S. Q. Zheng. 2000. An efficient parallel VLSI sorting architecture. VLSI Design 11, 2, 137--147.

[29]

Marcela Zuluaga, Peter Milder, and Markus Püschel. 2012a. Computer generation of streaming sorting networks. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, New York, NY, 1245--1253.

Digital Library

[30]

Marcela Zuluaga, Peter Milder, and Markus Püschel. 2012b. Sorting Network IP Generator. Retrieved April 6, 2016, from http://www.spiral.net/hardware/sort/sort.html.

Cited By

Lu XFang JPeng LHuang CDu ZZhao YWang Z(2024)Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise ProductACM Transactions on Architecture and Code Optimization10.1145/368861221:4(1-25)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3688612
Dong QSajedi SCui KLevin C(2024)Compact FPGA-Based Data Acquisition System for a High-Channel, High-Count-Rate TOF-PET Insert for Brain PET/MRIIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.332809173(1-9)Online publication date: 2024
https://doi.org/10.1109/TIM.2023.3328091
Oh HPark JLee S(2024)DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337725571:5(2549-2553)Online publication date: May-2024
https://doi.org/10.1109/TCSII.2024.3377255
Show More Cited By

Index Terms

Streaming Sorting Networks
1. Hardware
  1. Electronic design automation
    1. Hardware description languages and compilation
    2. High-level and register-transfer level synthesis
  2. Integrated circuits
    1. Logic circuits
    2. Reconfigurable logic and FPGAs
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Sorting and searching

Recommendations

Computer generation of streaming sorting networks
DAC '12: Proceedings of the 49th Annual Design Automation Conference

Sorting networks offer great performance but become prohibitively expensive for large data sets. We present a domain-specific language and compiler to automatically generate hardware implementations of sorting networks with reduced area and optimized ...
Portable, flexible, and scalable soft vector processors

Field-programmable gate arrays (FPGAs) are increasingly used to implement embedded digital systems, however, the hardware design necessary to do so is time-consuming and tedious. The amount of hardware design can be reduced by employing a microprocessor ...
A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform

This paper presents a novel unified and programmable 2-D Discrete Wavelet Transform (DWT) system architecture, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 21, Issue 4

September 2016

423 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/2939671

Editor:
Naehyuck Chang
Korea Advanced Institute of Science and Technology, Korea

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 27 May 2016

Accepted: 01 November 2015

Received: 01 October 2015

Published in TODAES Volume 21, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
433
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)6

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lu XFang JPeng LHuang CDu ZZhao YWang Z(2024)Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise ProductACM Transactions on Architecture and Code Optimization10.1145/368861221:4(1-25)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3688612
Dong QSajedi SCui KLevin C(2024)Compact FPGA-Based Data Acquisition System for a High-Channel, High-Count-Rate TOF-PET Insert for Brain PET/MRIIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.332809173(1-9)Online publication date: 2024
https://doi.org/10.1109/TIM.2023.3328091
Oh HPark JLee S(2024)DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337725571:5(2549-2553)Online publication date: May-2024
https://doi.org/10.1109/TCSII.2024.3377255
Chen YHo CChen WChen P(2024)A Low-Cost Pipelined Architecture Based on a Hybrid Sorting AlgorithmIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.334292971:2(717-730)Online publication date: Feb-2024
https://doi.org/10.1109/TCSI.2023.3342929
Pan YZhou MLee CLi ZKushwah RNarayanan VRosing T(2024)PRIMATE: Processing in Memory Acceleration for Dynamic Token-pruning Transformers2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473968(557-563)Online publication date: 22-Jan-2024
https://doi.org/10.1109/ASP-DAC58780.2024.10473968
Petrović MMilovanović V(2023)A Hardware Design Generator of High-Performance FIFO-Based Linear Insertion Streaming Sorters2023 30th International Conference on Mixed Design of Integrated Circuits and System (MIXDES)10.23919/MIXDES58562.2023.10203246(79-82)Online publication date: 29-Jun-2023
https://doi.org/10.23919/MIXDES58562.2023.10203246
Ning ATziantzioulis GWentzlaff DSolihin YHeinrich M(2023)Supply Chain Aware Computer ArchitectureProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589052(1-15)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589052
Zokaee FChen FSun GJiang L(2023)Sky-Sorter: A Processing-in-Memory Architecture for Large-Scale SortingIEEE Transactions on Computers10.1109/TC.2022.316943472:2(480-493)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TC.2022.3169434
Xu YLi ASorensen T(2023)Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00028(201-213)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00028
Li ANing AWentzlaff D(2023)Duet: Creating Harmony between Processors and Embedded FPGAs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070989(745-758)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070989
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents