research-article

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Authors:

Jonathan Ragan-Kelley,

Connelly Barnes,

Saman AmarasingheAuthors Info & Claims

ACM SIGPLAN Notices, Volume 48, Issue 6

Pages 519 - 530

https://doi.org/10.1145/2499370.2462176

Published: 16 June 2013 Publication History

Abstract

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.

We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

References

[1]

A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph., 29(4), 2010.

Digital Library

[2]

J. Ansel, C. Chan, Y. L.Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In ACM Programming Language Design and Implementation, 2009.

Digital Library

[3]

M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast and robust pyramid-based image processing. Technical Report MIT-CSAILTR- 2011-049, Massachusetts Institute of Technology, 2011.

[4]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004.

Digital Library

[5]

J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph., 26(3), 2007.

Digital Library

[6]

CoreImage. Apple CoreImage programming guide, 2006.

[7]

J. L. T. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, and B. Nicoletti. High-performance SIMT code generation in an active visual effects library. In Conf. on Computing Frontiers, 2009.

Digital Library

[8]

C. Elliott. Functional image synthesis. In Proceedings of Bridges, 2001.

[9]

K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In ACM/IEEE conference on Supercomputing, 2006.

Digital Library

[10]

M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, 2005.

Digital Library

[11]

M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In International Conf. on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[12]

P. L. Guernic, A. Benveniste, P. Bournai, and T. Gautier. Signal -- A data flow-oriented language for signal processing. IEEE Transactionsn on Acoustics, Speech and Signal Processing, 34(2):362--374, 1986.

[13]

J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012.

Digital Library

[14]

IPP. Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/intel-ipp/.

[15]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S.Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, 2010.

[16]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007.

Digital Library

[17]

J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. In IJPP, 2011.

[18]

R. Moore. Interval Analysis. 1966.

[19]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Supercomputing, 2010.

Digital Library

[20]

OpenMP. OpenMP. http://openmp.org/.

[21]

S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 2009.

Digital Library

[22]

S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4), 2011.

Digital Library

[23]

PixelBender. Adobe PixelBender reference, 2010.

[24]

H. Printz. Automatic Mapping of Large Signal Processing Systems to a Parallel Machine. Ph.D. Thesis, Carnegie Mellon University, 1991.

Digital Library

[25]

M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, 2005.

[26]

J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4), 2012.

Digital Library

[27]

M. A. Shantzis. A model for efficient and flexible image computing. In ACM SIGGRAPH, 1994.

Digital Library

[28]

Y. Tang, R. Chowdhury, B. Kuszmaul, C.-K. Luk, and C. Leiserson. The Pochoir stencil compiler. In SPAA, 2011.

Digital Library

[29]

W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In International Conference on Compiler Construction, 2002.

Digital Library

[30]

P.-S. Tseng. A Parallelizing Compiler for Disributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, 1989.

Digital Library

[31]

X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling.

Cited By

Rudi JLee YChadha AWahib MWeide KO’Neal JDubey A(2025)CG-Kit: Code Generation Toolkit for performant and maintainable variants of source code applied to Flash-X hydrodynamics simulationsFuture Generation Computer Systems10.1016/j.future.2024.107511163(107511)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107511
Wu CAmiri MQin HMehta BMarcus RLoo B(2024)Towards Full Stack Adaptivity in Permissioned BlockchainsProceedings of the VLDB Endowment10.14778/3641204.364121617:5(1073-1080)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641216
Zhu Q(2024)FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673076(1022-1031)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673076
Show More Cited By

Index Terms

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
PolyMage: Automatic Optimization for Image Processing Pipelines
ASPLOS '15

This paper presents the design and implementation of PolyMage, a domain-specific language and compiler for image processing pipelines. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 48, Issue 6

PLDI '13

June 2013

515 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2499370

Issue’s Table of Contents

PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2013
546 pages
ISBN:9781450320146
DOI:10.1145/2491956
General Chair:
Hans-J. Boehm
HP Labs
,
Program Chair:
Cormac Flanagan
University of California, Santa Cruz

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2013

Published in SIGPLAN Volume 48, Issue 6

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

912
Total Citations
View Citations
4,092
Total Downloads

Downloads (Last 12 months)599
Downloads (Last 6 weeks)93

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rudi JLee YChadha AWahib MWeide KO’Neal JDubey A(2025)CG-Kit: Code Generation Toolkit for performant and maintainable variants of source code applied to Flash-X hydrodynamics simulationsFuture Generation Computer Systems10.1016/j.future.2024.107511163(107511)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107511
Wu CAmiri MQin HMehta BMarcus RLoo B(2024)Towards Full Stack Adaptivity in Permissioned BlockchainsProceedings of the VLDB Endowment10.14778/3641204.364121617:5(1073-1080)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641216
Zhu Q(2024)FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673076(1022-1031)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673076
Li WChen YLi B(2024)Optimization of Convolution Operators' Backpropagation for Domestic Accelerators: Convolution Operators: Convolution kernel operation Backpropagation optimization to Enhance Performance and Efficiency in Convolutional Neural NetworkProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering10.1145/3672758.3672817(358-363)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3672758.3672817
Li YKamil SCrane KJacobson AGingold Y(2024)IMESH: A DSL for Mesh ProcessingACM Transactions on Graphics10.1145/366218143:5(1-17)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3662181
Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://doi.org/10.1145/3656401
Liu ABernstein GChlipala ARagan-Kelley J(2024)A Verified Compiler for a Functional Tensor LanguageProceedings of the ACM on Programming Languages10.1145/36563908:PLDI(320-342)Online publication date: 20-Jun-2024
https://doi.org/10.1145/3656390
Su QYang JPekhimenko G(2024)BOOM: Use your Desktop to Accurately Predict the Performance of Large Deep Neural NetworksProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676950(284-296)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676950
Huang HChen XZhao J(2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656631
McGowen JDagli IDantam NBelviranli M(2024)Scheduling for Cyber-Physical Systems with Heterogeneous Processing Units under Real-World ConstraintsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656625(298-311)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656625
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents