Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3173162.3173168acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Automatic Hierarchical Parallelization of Linear Recurrences

Published: 19 March 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Linear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-efficient algorithm to compute linear recurrences that is amenable to automatic parallelization and suitable for hierarchical massively-parallel architectures such as GPUs. We implemented our approach in a domain-specific code generator that emits optimized CUDA code. Our evaluation shows that, for standard prefix sums and single-stage IIR filters, the generated code reaches the throughput of memory copy for large inputs, which cannot be surpassed. On higher-order prefix sums, it performs nearly as well as the fastest handwritten code from the literature. On tuple-based prefix sums and digital filters, our automatically parallelized code outperforms the fastest prior implementations.

    References

    [1]
    Alg3: https://github.com/andmax/gpufilter/, accessed 8/8/2017.
    [2]
    G.E. Blelloch. "Scans as Primitive Parallel Operations." IEEE Transactions on Computers, 38(11):1526--1538. 1989.
    [3]
    G.E. Blelloch. "Prefix Sums and Their Applications." In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990.
    [4]
    G. Chaurasia, J. Ragan-Kelley, S. Paris, G. Drettakis, and F. Durand. "Compiling High Performance Recursive Filters." In Proceedings of the 7th Conference on High-Performance Graphics, pp. 85--94. 2015.
    [5]
    CUB: https://nvlabs.github.io/cub/, accessed 8/8/2017.
    [6]
    Y. Dotsenko, N.K. Govindaraju, P.P. Sloan, C. Boyd, and J. Manferdelli. "Fast Scan Algorithms on Graphics Processors." In Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 205--213. 2008.
    [7]
    J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra. "Fast Summed-Area Table Generation and its Applications." Computer Graphics Forum, 24(3):547--555. 2005.
    [8]
    W.D. Hillis and G.L. Steele. "Data Parallel Algorithms." Communications of the ACM, 29(12): 1170--1183. 1986.
    [9]
    R.M. Karp, R.E. Miller, and S. Winograd. "The Organization of Computations for Uniform Recurrence equations." Journal of the ACM, 14:3, pp. 563--590. 1967.
    [10]
    P.M. Kogge and H.S. Stone. "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations." IEEE Transactions on Computers, 22(8):786--793. 1973.
    [11]
    S. Maleki, A. Yang, and M. Burtscher. "Higher-Order and Tuple-Based Massively-Parallel Prefix Sums." In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 539--552. 2016.
    [12]
    B. Merry. "A Performance Comparison of Sort and Scan Libraries for GPUs." World Scientific Publishing Company. 2014.
    [13]
    D. Merrill and M. Garland. "Single-Pass Parallel Prefix Scan with Decoupled Look-back." NVIDIA Technical Report NVR-2016-002. 2016.
    [14]
    n-nacci numbers: https://en.wikipedia.org/wiki/Generalizations_of_Fibonacci_numbers, accessed 8/8/2017.
    [15]
    D. Nehab, A. Maximo, R.S. Lima, and H. Hoppe. "GPU-Efficient Recursive Filtering and Summed-Area Tables." In Proceedings of the SIGGRAPH Asia Conference, pp. 176:1--176:12. 2011.
    [16]
    A.V. Oppenheim and R.W. Schafer. "Discrete-Time Signal Processing." 3rd Edition. Prentice Hall. 2009.
    [17]
    Rec: https://github.com/mit-gfx/recfilter, accessed 8/8/2017.
    [18]
    SAM: http://cs.txstate.edu/~burtscher/research/SAM/, accessed 8/8/2017.
    [19]
    S. Sengupta, A.E. Lefohn, and J.D. Owens. "A Work-Efficient Step-Efficient Prefix Sum Algorithm." In Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, pp. 26--27. 2006.
    [20]
    S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. "Scan Primitives for GPU Computing." In Proceedings of Graphics Hardware, pp. 97--106. 2007.
    [21]
    S. Sengupta, M. Harris, and M. Garland. "Efficient Parallel Scan Algorithms for GPUs." NVIDIA. 2008 - gpucomputing.net.
    [22]
    S.W. Smith. "Digital Signal Processing: A Practical Guide for Engineers and Scientists." Newnes, 2002. ISBN 0--7506--7444-X.
    [23]
    H.S. Stone. "An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations." Journal of the ACM, 20(1):27--38. 1973.
    [24]
    W. Sung and S. Mitra. "Efficient Multi-Processor Implementation of Recursive Digital Filters." In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, 11:257--260. 1986.
    [25]
    W. Thies, M. Karczmarek, and S.P. Amarasinghe. "StreamIt: A Language for Streaming Applications." In Proceedings of the 11th International Conference on Compiler Construction, pp. 179--196. 2002.

    Cited By

    View all
    • (2023)cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End PerformanceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607048(1-13)Online publication date: 12-Nov-2023
    • (2021)GPU efficient 1D and 3D recursive filteringDigital Signal Processing10.1016/j.dsp.2021.103076114(103076)Online publication date: Jul-2021
    • (2020)Scaling out speculative execution of finite-state machines with parallel mergeProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374524(160-172)Online publication date: 19-Feb-2020
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2018
    827 pages
    ISBN:9781450349116
    DOI:10.1145/3173162
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 53, Issue 2
      ASPLOS '18
      February 2018
      809 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3296957
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 March 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. automatic parallelization
    2. code optimization
    3. linear recurrences
    4. prefix sums
    5. recursive filters

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ASPLOS '18

    Acceptance Rates

    ASPLOS '18 Paper Acceptance Rate 56 of 319 submissions, 18%;
    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)106
    • Downloads (Last 6 weeks)14

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End PerformanceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607048(1-13)Online publication date: 12-Nov-2023
    • (2021)GPU efficient 1D and 3D recursive filteringDigital Signal Processing10.1016/j.dsp.2021.103076114(103076)Online publication date: Jul-2021
    • (2020)Scaling out speculative execution of finite-state machines with parallel mergeProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374524(160-172)Online publication date: 19-Feb-2020
    • (2020)An efficient parallel strategy for high-cost prefix operationThe Journal of Supercomputing10.1007/s11227-020-03473-xOnline publication date: 5-Nov-2020
    • (2019)Enabling prefix sum parallelism pattern for recurrences with principled function reconstructionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307354(17-28)Online publication date: 16-Feb-2019
    • (2021)Parallel Tiled Code for Computing General Linear Recurrence EquationsElectronics10.3390/electronics1017205010:17(2050)Online publication date: 25-Aug-2021
    • (2021)BiPartProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441611(161-174)Online publication date: 17-Feb-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media