Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

RIPL: A Parallel Image Processing Language for FPGAs

Published: 14 March 2018 Publication History

Abstract

Specialized FPGA implementations can deliver higher performance and greater power efficiency than embedded CPU or GPU implementations for real-time image processing. Programming challenges limit their wider use, because the implementation of FPGA architectures at the register transfer level is time consuming and error prone. Existing software languages supported by high-level synthesis (HLS), although providing a productivity improvement, are too general purpose to generate efficient hardware without the use of hardware-specific code optimizations. Such optimizations leak hardware details into the abstractions that software languages are there to provide, and they require knowledge of FPGAs to generate efficient hardware, such as by using language pragmas to partition data structures across memory blocks.
This article presents a thorough account of the Rathlin image processing language (RIPL), a high-level image processing domain-specific language for FPGAs. We motivate its design, based on higher-order algorithmic skeletons, with requirements from the image processing domain. RIPL’s skeletons suffice to elegantly describe image processing stencils, as well as recursive algorithms with nonlocal random access patterns. At its core, RIPL employs a dataflow intermediate representation. We give a formal account of the compilation scheme from RIPL skeletons to static and cyclostatic dataflow models to describe their data rates and static scheduling on FPGAs.
RIPL compares favorably to the Vivado HLS OpenCV library and C++ compiled with Vivado HLS. RIPL achieves between 54 and 191 frames per second (FPS) at 100MHz for four synthetic benchmarks, faster than HLS OpenCV in three cases. Two real-world algorithms are implemented in RIPL: visual saliency and mean shift segmentation. For the visual saliency algorithm, RIPL achieves 71 FPS compared to optimized C++ at 28 FPS. RIPL is also concise, being 5x shorter than C++ and 111x shorter than an equivalent direct dataflow implementation. For mean shift segmentation, RIPL achieves 7 FPS compared to optimized C++ on 64 CPU cores at 1.1, and RIPL is 10x shorter than the direct dataflow FPGA implementation.

References

[1]
S. Ahmad, V. Boppana, I. Ganusov, V. Kathail, V. Rajagopalan, and R. Wittig. 2016. A 16-nm multiprocessing system-on-chip field-programmable gate array platform. IEEE Micro 36, 2, 48--62.
[2]
Altera. 2017. DSP Builder for Intel FPGAs. Retrieved February 4, 2018, from https://www.altera.com/products/design-software/model---simulation/dsp-builder/overview.html.
[3]
David L. Andrews, Douglas Niehaus, Razali Jidin, Michael Finley, Wesley Peck, Michael Frisbie, Jorge L. Ortiz, Ed Komp, and Peter J. Ashenden. 2004. Programming models for hybrid FPGA-CPU computational components: A missing link. IEEE Micro 24, 4, 42--53.
[4]
Endri Bezati. 2015. High-Level Synthesis of Dataflow Programs for Heterogeneous Platforms: Design Flow Tools and Design Space Exploration. Ph.D. Dissertation. School of Engineering, Ecole Polytechnique Federale de Lausanne, Switzerland.
[5]
Endri Bezati, Simone Casale Brunet, Marco Mattavelli, and Jörn W. Janneck. 2016. High-level synthesis of dynamic dataflow programs on heterogeneous MPSoC platforms. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’16). IEEE, Los Alamitos, CA, 227--234.
[6]
Deepayan Bhowmik, Paulo Garcia, Andrew M. Wallace, Robert J. Stewart, and Greg Michaelson. 2017. Power efficient dataflow design for a heterogeneous smart camera architecture. In Proceedings of the 2017 Conference on Design and Architectures for Signal and Image Processing (DASIP’17). IEEE, Los Alamitos, CA, 1--6.
[7]
Deepayan Bhowmik, Matthew Oakes, and Charith Abhayaratne. 2016. Visual attention-based image watermarking. IEEE Access 4, 8002--8018.
[8]
G. Bilsen, M. Engels, R. Lauwereins, and J. A. Peperstraete. 1996. Cycle-static dataflow. IEEE Transactions on Signal Processing 44, 2, 397--408.
[9]
Ali Borji and Laurent Itti. 2013. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1, 185--207.
[10]
André Rigland Brodtkorb, Christopher Dyken, Trond Runar Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli. 2010. State-of-the-art in heterogeneous computing. Scientific Programming 18, 1, 1--33.
[11]
Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the POPL 2011 Workshop on Declarative Aspects of Multicore Programming (DAMP’11). ACM, New York, NY, 3--14.
[12]
Murray Cole. 1991. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge, MA.
[13]
Dorin Comaniciu and Peter Meer. 1999. Mean shift analysis and applications. In Proceedings of the 7th IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 1197--1203.
[14]
Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. 2000. Real-time tracking of non-rigid objects using mean shift. In Proceedings of the 2000 Conference on Computer Vision and Pattern Recognition (CVPR’00). IEEE, Los Alamitos, CA, 2142.
[15]
Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34, 2, 171--210.
[16]
I. Daubechies and W. Sweldens. 1998. Factoring wavelet transforms into lifting steps. Journal of Fourier Analysis and Applications 4, 3, 245--267.
[17]
Johan Eker and Jorn W. Janneck. 2003. CAL Language Report Specification of the CAL Actor Language. Technical Report UCB/ERL M03/48. EECS Department, University of California, Berkeley.
[18]
Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56.
[19]
Keinosuke Fukunaga and Larry Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 1, 32--40.
[20]
Rafael C. González and Richard E. Woods. 1992. Digital Image Processing. Addison-Wesley, Reading, MA.
[21]
James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Transactions on Graphics 33, 4, 144:1--144:11.
[22]
James Hegarty, Ross Daly, Zachary DeVito, Mark Horowitz, Pat Hanrahan, and Jonathan Ragan-Kelley. 2016. Rigel: Flexible multi-rate image processing hardware. ACM Transactions on Graphics 35, 4, 85:1--85:11.
[23]
Jörn W. Janneck. 2003. Actors and their composition. Formal Aspects of Computing 15, 4, 349--369.
[24]
J. Jeddeloh and B. Keeth. 2012. Hybrid Memory Cube new DRAM architecture increases density and performance. In Proceedings of the 2012 Symposium on VLSI Technology (VLSIT’12). IEEE, Los Alamitos, CA, 87--88.
[25]
S. Peyton Jones, A. Tolmach, and T. Hoare. 2001. Playing by the rules: Rewriting as a practical optimisation technique in GHC. In Proceedings of the ACM SIGPLAN Haskell Workshop. ACM, New York, NY, 203--233.
[26]
Kwang In Kim, Keechul Jung, and Jin Hyung Kim. 2003. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 12, 1631--1639.
[27]
Oleg Kiselyov. 2012. Iteratees. In Proceedings of the 11th International Symposium on Functional and Logic Programming (FLOPS’12). 166--181.
[28]
Edward A. Lee and David G. Messerschmitt. 1987. Synchronous data flow: Describing signal processing algorithm for parallel computation. In Proceedings of the 32nd IEEE Computer Society International Conference (COMPCON’87). IEEE, Los Alamitos, CA, 310--315.
[29]
Edward A. Lee and Thomas M. Parks. 2002. Dataflow process networks. In Readings in Hardware/Software Co-Design, G. De Micheli, R. Ernst, and W. Wolf (Eds.). Kluwer Academic Publishers, Norwell, MA, 59--85.
[30]
Erik Jan Marinissen and Yervant Zorian. 2017. Guest editors introduction: Design and test of a high-volume 3-D stacked graphics processor with high-bandwidth memory. IEEE Design and Test 34, 1, 6--7.
[31]
David R. Martin, Charless C. Fowlkes, Doron Tal, and Jitendra Malik. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the 8th IEEE International Conference on Computer Vision (ICCV’01). IEEE, Los Alamitos, CA, 416--425.
[32]
MathWorks. 2017. FPGA Design and SoC Codesign. Retrieved February 4, 2018, from https://uk.mathworks.com/solutions/fpga-design.html.
[33]
J. McGraw, S. Skedzielewski, S. Allan, Oldehoeft Oldehoeft, J. Glauert, C. Kirkham, B. Noyce, and R. Thomas. 1985. SISAL: Streams and Iteration in a Single Assignment Language, Language Reference Manual Version 1.2. Lawrence-Livermore-National-Laboratory, Livermore, CA.
[34]
R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. 2016. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10, 1591--1604.
[35]
Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming heterogeneous systems from an image processing DSL. ACM Transactions on Architecture and Code Optimization 14, 3, 26:1--26:25.
[36]
B. C. Schafer and A. Mahapatra. 2014. S2CBench: Synthesizable SystemC benchmark suite for high-level synthesis. IEEE Embedded Systems Letters 6, 3, 53--56.
[37]
Stephen Neuendorffer, Thomas Li, and Devin Wang. 2015. Accelerating OpenCV Applications With Zynq-7000 All Programmable SoC Using Vivado HLS Video Libraries (v3.0). Technical Report. Xilinx. https://www.xilinx.com/support/documentation/application_notes/xapp1167.pdf.
[38]
Robert Stewart. 2018. Open dataset for “RIPL: A Parallel Image Processing Language for FPGAs.” ACM Transactions on Reconfigurable Technology and Systems. Forthcoming.
[39]
Robert Stewart, Greg J. Michaelson, Deepayan Bhowmik, Paulo Garcia, and Andy Wallace. 2016. A dataflow IR for memory efficient RIPL compilation to FPGAs. In Algorithms and Architectures for Parallel Processing. Lecture Notes in Computer Science, Vol. 1194. Springer, 174--188.
[40]
Robert J. Stewart, Deepayan Bhowmik, Andrew M. Wallace, and Greg Michaelson. 2017. Profile guided dataflow transformation for FPGAs and CPUs. Signal Processing Systems 87, 1, 3--20.
[41]
David Taubman and Michael Marcellin. 2012. JPEG2000 Image Compression Fundamentals, Standards and Practice. Vol. 642. Springer Science 8 Business Media, Berlin, Germany.
[42]
David B. Thomas, Lee W. Howes, and Wayne Luk. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA 17th International Symposium on Field Programmable Gate Arrays (FPGA’09). ACM, New York, NY, 63--72.
[43]
Donald E. Thomas and Philip Moorby. 1996. The Verilog Hardware Description Language (3rd ed.). Kluwer, Boston, MA.
[44]
William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23, 1, 20--24.
[45]
Xilinx. 2015. 7 Series FPGAs Overview, DS180 (v1.17) Product Specification. Technical Report. Xilinx.
[46]
Xilinx. 2017a. System Generator for DSP. Retrieved February 4, 2018, from https://www.xilinx.com/products/design-tools/vivado/integration/sysgen.html.
[47]
Xilinx. 2017b. Vivado High-Level Synthesis. Retrieved February 4, 2018, from https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.

Cited By

View all
  • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2023)HDLRuby: A Ruby Extension for Hardware Description and its Translation to Synthesizable Verilog HDLACM Transactions on Embedded Computing Systems10.1145/3581757Online publication date: Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 11, Issue 1
Special Section on FCCM 2016 and Regular Papers
March 2018
183 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3178391
  • Editor:
  • Steve Wilton
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2018
Accepted: 01 December 2017
Revised: 01 November 2017
Received: 01 February 2017
Published in TRETS Volume 11, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cyclo static dataflow
  2. Dataflow
  3. Domain specific languages
  4. FPGA
  5. Hardware accelerators
  6. High level synthesis
  7. Image processing
  8. OpenCV
  9. Parallel processing
  10. RIPL
  11. Semantics

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • EPSRC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)111
  • Downloads (Last 6 weeks)11
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2023)HDLRuby: A Ruby Extension for Hardware Description and its Translation to Synthesizable Verilog HDLACM Transactions on Embedded Computing Systems10.1145/3581757Online publication date: Feb-2023
  • (2023)ImaGen: A General Framework for Generating Memory- and Power-Efficient Image Processing AcceleratorsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589076(1-13)Online publication date: 17-Jun-2023
  • (2023)The Good, the Bad and the Ugly: Practices and Perspectives on Hardware Acceleration for Embedded Image ProcessingJournal of Signal Processing Systems10.1007/s11265-023-01885-595:10(1181-1201)Online publication date: 1-Oct-2023
  • (2023)Accelerating OCaml Programs on FPGAInternational Journal of Parallel Programming10.1007/s10766-022-00748-z51:2-3(186-207)Online publication date: 24-Jan-2023
  • (2022)Applications and Techniques for Fast Machine Learning in ScienceFrontiers in Big Data10.3389/fdata.2022.7874215Online publication date: 12-Apr-2022
  • (2022)Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAsACM Computing Surveys10.1145/353298955:5(1-48)Online publication date: 3-Dec-2022
  • (2021)Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future ProspectsACM Transactions on Reconfigurable Technology and Systems10.1145/346966014:4(1-39)Online publication date: 13-Sep-2021
  • (2021)FLOWER: A comprehensive dataflow compiler for high-level synthesis2021 International Conference on Field-Programmable Technology (ICFPT)10.1109/ICFPT52863.2021.9609930(1-9)Online publication date: 6-Dec-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media