Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Published: 16 June 2013 Publication History

Abstract

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.
We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

References

[1]
A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph., 29(4), 2010.
[2]
J. Ansel, C. Chan, Y. L.Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In ACM Programming Language Design and Implementation, 2009.
[3]
M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast and robust pyramid-based image processing. Technical Report MIT-CSAILTR- 2011-049, Massachusetts Institute of Technology, 2011.
[4]
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004.
[5]
J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph., 26(3), 2007.
[6]
CoreImage. Apple CoreImage programming guide, 2006.
[7]
J. L. T. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, and B. Nicoletti. High-performance SIMT code generation in an active visual effects library. In Conf. on Computing Frontiers, 2009.
[8]
C. Elliott. Functional image synthesis. In Proceedings of Bridges, 2001.
[9]
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In ACM/IEEE conference on Supercomputing, 2006.
[10]
M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, 2005.
[11]
M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In International Conf. on Architectural Support for Programming Languages and Operating Systems, 2002.
[12]
P. L. Guernic, A. Benveniste, P. Bournai, and T. Gautier. Signal -- A data flow-oriented language for signal processing. IEEE Transactionsn on Acoustics, Speech and Signal Processing, 34(2):362--374, 1986.
[13]
J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012.
[14]
IPP. Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/intel-ipp/.
[15]
S. Kamil, C. Chan, L. Oliker, J. Shalf, and S.Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, 2010.
[16]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007.
[17]
J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. In IJPP, 2011.
[18]
R. Moore. Interval Analysis. 1966.
[19]
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Supercomputing, 2010.
[20]
OpenMP. OpenMP. http://openmp.org/.
[21]
S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 2009.
[22]
S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4), 2011.
[23]
PixelBender. Adobe PixelBender reference, 2010.
[24]
H. Printz. Automatic Mapping of Large Signal Processing Systems to a Parallel Machine. Ph.D. Thesis, Carnegie Mellon University, 1991.
[25]
M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, 2005.
[26]
J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4), 2012.
[27]
M. A. Shantzis. A model for efficient and flexible image computing. In ACM SIGGRAPH, 1994.
[28]
Y. Tang, R. Chowdhury, B. Kuszmaul, C.-K. Luk, and C. Leiserson. The Pochoir stencil compiler. In SPAA, 2011.
[29]
W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In International Conference on Compiler Construction, 2002.
[30]
P.-S. Tseng. A Parallelizing Compiler for Disributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, 1989.
[31]
X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling.

Cited By

View all
  • (2025)CG-Kit: Code Generation Toolkit for performant and maintainable variants of source code applied to Flash-X hydrodynamics simulationsFuture Generation Computer Systems10.1016/j.future.2024.107511163(107511)Online publication date: Feb-2025
  • (2024)Towards Full Stack Adaptivity in Permissioned BlockchainsProceedings of the VLDB Endowment10.14778/3641204.364121617:5(1073-1080)Online publication date: 1-Jan-2024
  • (2024)FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673076(1022-1031)Online publication date: 12-Aug-2024
  • Show More Cited By

Index Terms

  1. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 48, Issue 6
    PLDI '13
    June 2013
    515 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2499370
    Issue’s Table of Contents
    • cover image ACM Conferences
      PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
      June 2013
      546 pages
      ISBN:9781450320146
      DOI:10.1145/2491956
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 June 2013
    Published in SIGPLAN Volume 48, Issue 6

    Check for updates

    Author Tags

    1. autotuning
    2. compiler
    3. domain specific language
    4. gpu
    5. image processing
    6. locality
    7. optimization
    8. parallelism
    9. redundant computation
    10. vectorization

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)599
    • Downloads (Last 6 weeks)93
    Reflects downloads up to 10 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)CG-Kit: Code Generation Toolkit for performant and maintainable variants of source code applied to Flash-X hydrodynamics simulationsFuture Generation Computer Systems10.1016/j.future.2024.107511163(107511)Online publication date: Feb-2025
    • (2024)Towards Full Stack Adaptivity in Permissioned BlockchainsProceedings of the VLDB Endowment10.14778/3641204.364121617:5(1073-1080)Online publication date: 1-Jan-2024
    • (2024)FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673076(1022-1031)Online publication date: 12-Aug-2024
    • (2024)Optimization of Convolution Operators' Backpropagation for Domestic Accelerators: Convolution Operators: Convolution kernel operation Backpropagation optimization to Enhance Performance and Efficiency in Convolutional Neural NetworkProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering10.1145/3672758.3672817(358-363)Online publication date: 26-Jan-2024
    • (2024)IMESH: A DSL for Mesh ProcessingACM Transactions on Graphics10.1145/366218143:5(1-17)Online publication date: 25-Jun-2024
    • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
    • (2024)A Verified Compiler for a Functional Tensor LanguageProceedings of the ACM on Programming Languages10.1145/36563908:PLDI(320-342)Online publication date: 20-Jun-2024
    • (2024)BOOM: Use your Desktop to Accurately Predict the Performance of Large Deep Neural NetworksProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676950(284-296)Online publication date: 14-Oct-2024
    • (2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
    • (2024)Scheduling for Cyber-Physical Systems with Heterogeneous Processing Units under Real-World ConstraintsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656625(298-311)Online publication date: 30-May-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media