Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2790060.2790063acmconferencesArticle/Chapter ViewAbstractPublication PageshpgConference Proceedingsconference-collections
research-article

Compiling high performance recursive filters

Published: 07 August 2015 Publication History

Abstract

Infinite impulse response (IIR) or recursive filters, are essential for image processing because they turn expensive large-footprint convolutions into operations that have a constant cost per pixel regardless of kernel size. However, their recursive nature constrains the order in which pixels can be computed, severely limiting both parallelism within a filter and memory locality across multiple filters. Prior research has developed algorithms that can compute IIR filters with image tiles. Using a divide-and-recombine strategy inspired by parallel prefix sum, they expose greater parallelism and exploit producer-consumer locality in pipelines of IIR filters over multi-dimensional images. While the principles are simple, it is hard, given a recursive filter, to derive a corresponding tile-parallel algorithm, and even harder to implement and debug it.
We show that parallel and locality-aware implementations of IIR filter pipelines can be obtained through program transformations, which we mechanize through a domain-specific compiler. We show that the composition of a small set of transformations suffices to cover the space of possible strategies. We also demonstrate that the tiled implementations can be automatically scheduled in hardware-specific manners using a small set of generic heuristics. The programmer specifies the basic recursive filters, and the choice of transformation requires only a few lines of code. Our compiler then generates high-performance implementations that are an order of magnitude faster than standard GPU implementations, and outperform hand tuned tiled implementations of specialized algorithms which require orders of magnitude more programming effort---a few lines of code instead of a few thousand lines per pipeline.

Supplementary Material

ZIP File (p85-chaurasia.zip)

References

[1]
Blelloch, G. E. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers 38, 11 (Nov), 1526--1538.
[2]
Blythe, D. 2006. The Direct3D 10 system. ACM Trans. Graph. 25, 3 (July), 724--734.
[3]
Chang, L.-W., Dakkak, A., Rodrigues, C. I., and m. W. Hwu, W. 2015. Tangram: a high-level language for performance portable code synthesis. In Programmability Issues for Heterogeneous Multicores.
[4]
Chen, J., Paris, S., and Durand, F. 2007. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph. 26, 3 (July), 103:1--103:9.
[5]
CoreImage, 2006. Apple CoreImage programming guide. https://developer.apple.com/library/ios/documentation/GraphicsImaging/Conceptual/CoreImaging/ci_intro/ci_intro.html.
[6]
Crow, F. C. 1984. Summed-area tables for texture mapping. In SIGGRAPH, 207--212.
[7]
Deriche, R. 1993. Recursively implementating the Gaussian and its derivatives. Tech. Rep. RR-1893, INRIA.
[8]
Elliott, C. 2001. Functional image synthesis. In Proceedings of Bridges.
[9]
Gordon, M. I., Thies, W., Karczmarek, M., Lin, J., Meli, A. S., Lamb, A. A., Leger, C., Wong, J., Hoffmann, H., Maze, D., and Amarasinghe, S. 2002. A stream compiler for communication-exposed architectures. In ASPLOS, 291--303.
[10]
Hannig, F., Ruckdeschel, H., Dutta, H., and Teich, J. 2008. PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Reconfigurable Computing: Architectures, Tools and Applications, 287--293.
[11]
Hanrahan, P., and Lawson, J. 1990. A language for shading and lighting calculations. In Computer Graphics (Proceedings of SIGGRAPH 90), 289--298.
[12]
He, K., Sun, J., and Tang, X. 2013. Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35, 6, 1397--1409.
[13]
Hensley, J., Scheuermann, T., Coombe, G., Singh, M., and Lastra, A. 2005. Fast summed-area table generation and its applications. Computer Graphics Forum 24, 3, 547--555.
[14]
Holzmann, G. 1988. Beyond Photography. Prentice Hall.
[15]
Huang, T., Yang, G., and Tang, G. 1979. A fast two-dimensional median filtering algorithm. 13--18.
[16]
Karp, R. M., Miller, R. E., and Winograd, S. 1967. The organization of computations for uniform recurrence equations. Journal of the ACM 14, 3 (July), 563--590.
[17]
Kasagi, A., Nakano, K., and Ito, Y. 2014. Parallel algorithms for the summed area table on the asynchronous hierarchical memory machine, with GPU implementations. In International Conference on Parallel Processing, 251--260.
[18]
Kass, M., and Solomon, J. 2010. Smoothed local histogram filters. ACM Trans. Graph. 29, 4 (July), 100:1--100:10.
[19]
Kogge, P. M., and Stone, H. S. 1973. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers C-22, 8 (Aug), 786--793.
[20]
Lepley, T., Paulin, P., and Flamand, E. 2013. A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 1--10.
[21]
Mark, W. R., Glanville, R. S., Akeley, K., and Kilgard, M. J. 2003. Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graph. 22, 3 (July), 896--907.
[22]
Membarth, R., Reiche, O., Hannig, F., Teich, J., Korner, M., and Eckert, W. 2015. HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems.
[23]
Mullapudi, R. T., Vasista, V., and Bondhugula, U. 2015. Polymage: Automatic optimization for image processing pipelines. In ASPLOS, 429--443.
[24]
Nehab, D., Maximo, A., Lima, R. S., and Hoppe, H. 2011. GPU-efficient recursive filtering and summed-area tables. ACM Trans. Graph. 30, 6 (Dec.), 176:1--176:12.
[25]
NVIDIA, 2015. NVIDIA CUDA toolkit. https://developer.nvidia.com/cuda-toolkit.
[26]
Ofenbeck, G., Rompf, T., Stojanov, A., Odersky, M., and Püschel, M. 2013. Spiral in Scala: Towards the systematic construction of generators for performance libraries. SIGPLAN Not. 49, 3 (Oct.), 125--134.
[27]
Oppenheim, A. V., and Schafer, R. W. 2009. Discrete-Time Signal Processing, 3rd ed. Prentice Hall Press.
[28]
Perreault, S., and Hébert, P. 2007. Median filtering in constant time. IEEE Trans. Img. Proc. 16, 9 (Sept.), 2389--2394.
[29]
PixelBender, 2010. Adobe PixelBender reference. http://www.adobe.com/devnet/archive/pixelbender.html.
[30]
Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., and Durand, F. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph. 31, 4 (July), 32:1--32:12.
[31]
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In PLDI, 519--530.
[32]
Rau, R., and McClellan, J. 1997. Efficient approximation of Gaussian filters. IEEE Trans. Sig. Proc. 45, 2 (Feb), 468--471.
[33]
Ravishankar, M., Holewinski, J., and Grover, V. 2015. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In GPGPU, 109--120.
[34]
Ruijters, D., and Thévenaz, P. 2010. GPU prefilter for accurate cubic b-spline interpolation. The Computer Journal.
[35]
Sengupta, S., Harris, M., Zhang, Y., and Owens, J. D. 2007. Scan primitives for GPU computing. In Symposium on Graphics Hardware, 97--106.
[36]
Stearns, S., and Hush, D. 2002. Digital Signal Processing with Examples in MATLAB, Second Edition. Electrical Engineering & Applied Signal Processing Series. Taylor & Francis.
[37]
Sung, W., and Mitra, S. 1986. Efficient multi-processor implementation of recursive digital filters. In IEEE Conference on Acoustics, Speech and Signal Processing, vol. 11, 257--260.
[38]
Thies, W., Karczmarek, M., and Amarasinghe, S. P. 2002. Streamit: A language for streaming applications. In International Conference on Compiler Construction, 179--196.
[39]
Thrust, 2015. NVIDIA Thrust. https://developer.nvidia.com/Thrust.
[40]
van Vliet, L. J., Young, I. T., and Verbeek, P. W. 1998. Recursive Gaussian derivative filters. In International Conference on Pattern Recognition, vol. 1, 509--514 vol.1.
[41]
Weiss, B. 2006. Fast median and bilateral filtering. ACM Trans. Graph. 25, 3 (July), 519--526.

Cited By

View all
  • (2024)Compiling Recurrences over Dense and Sparse ArraysProceedings of the ACM on Programming Languages10.1145/36498208:OOPSLA1(250-275)Online publication date: 29-Apr-2024
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2022)High-speed recursive-separable image processing filtersComputer Optics10.18287/2412-6179-CO-106346:4(659-665)Online publication date: Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPG '15: Proceedings of the 7th Conference on High-Performance Graphics
August 2015
112 pages
ISBN:9781450337076
DOI:10.1145/2790060
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU computation
  2. IIR filter
  3. compiler
  4. domain-specific language
  5. high performance
  6. image processing
  7. memory locality
  8. parallelism

Qualifiers

  • Research-article

Conference

HPG '15
Sponsor:
HPG '15: High Performance Graphics
August 7 - 9, 2015
California, Los Angeles

Acceptance Rates

Overall Acceptance Rate 15 of 44 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Compiling Recurrences over Dense and Sparse ArraysProceedings of the ACM on Programming Languages10.1145/36498208:OOPSLA1(250-275)Online publication date: 29-Apr-2024
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2022)High-speed recursive-separable image processing filtersComputer Optics10.18287/2412-6179-CO-106346:4(659-665)Online publication date: Aug-2022
  • (2021)Recursive filter based GPU algorithms in a Data Assimilation scenarioJournal of Computational Science10.1016/j.jocs.2021.10133953(101339)Online publication date: Jul-2021
  • (2021)GPU Efficient 1D and 3D recursive filteringDigital Signal Processing10.1016/j.dsp.2021.103076(103076)Online publication date: Apr-2021
  • (2021)GPU-CUDA Implementation of the Third Order Gaussian Recursive FilterSN Computer Science10.1007/s42979-021-00960-73:1Online publication date: 19-Nov-2021
  • (2020)Accelerated Gaussian Convolution in a Data Assimilation ScenarioComputational Science – ICCS 202010.1007/978-3-030-50433-5_16(199-211)Online publication date: 15-Jun-2020
  • (2019)Accelerating reduction and scan using tensor core unitsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3331057(46-57)Online publication date: 26-Jun-2019
  • (2019)A Gaussian Recursive Filter Parallel Implementation with Overlapping2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)10.1109/SITIS.2019.00105(641-648)Online publication date: Nov-2019
  • (2018)Efficient Computational Scheduling of Box and Gaussian FIR Filtering for CPU Microarchitecture2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.23919/APSIPA.2018.8659674(875-879)Online publication date: Nov-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media