Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Let Coarse-Grained Resources Be Shared: Mapping Entire Neural Networks on FPGAs

Published: 09 September 2023 Publication History

Abstract

Traditional High-Level Synthesis (HLS) provides rapid prototyping of hardware accelerators without coding with Hardware Description Languages (HDLs). However, such an approach does not well support allocating large applications like entire deep neural networks on a single Field Programmable Gate Array (FPGA) device. The approach leads to designs that are inefficient or do not fit into FPGAs due to resource constraints.
This work proposes to shrink generated designs by coarse-grained resource control based on function sharing in functional Intermediate Representations (IRs). The proposed compiler passes and rewrite system aim at producing valid design points and removing redundant hardware. Such optimizations make fitting entire neural networks on FPGAs feasible and produce competitive performance compared to running specialized kernels for each layer.

References

[1]
C. P. R. Baaij. 2015. Digital circuit in C\(\lambda\)aSH: functional specifications and type-directed synthesis. Ph.D. Dissertation. University of Twente, Netherlands. eemcs-eprint-23939.
[2]
Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference (DAC).
[3]
Zhihong Bai, Haoxin Fan, Lingzhi Liu, Li Liu, and Dong Wang. 2019. An OpenCL-based FPGA accelerator with the winograd’s minimal filtering algorithm for convolution neuron networks. In 2019 IEEE 5th International Conference on Computer and Communications (ICCC’19). 277–282.
[4]
Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 33–36.
[5]
Gregory J. Chaitin. 1982. Register allocation & spilling via graph coloring. ACM Sigplan Notices 17, 6 (1982), 98–101.
[6]
Jason Cong, Peng Wei, Cody Hao Yu, and Peipei Zhou. 2017. Bandwidth optimization through on-chip memory restructuring for HLS. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 1–6.
[7]
David Durst, Matthew Feldman, Dillon Huff, David Akeley, Ross Daly, Gilbert Louis Bernstein, Marco Patrignani, Kayvon Fatahalian, and Pat Hanrahan. 2020. Type-directed scheduling of streaming accelerators. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
[8]
Hsuan Hsiao and Jason H. Anderson. 2018. Sensei: An area-reduction advisor for FPGA high-level synthesis. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE.
[9]
Qijing Huang, Ameer Haj-Ali, William Moses, John Xiang, Ion Stoica, Krste Asanovic, and John Wawrzynek. 2020. AutoPhase: Juggling HLS phase orderings in random forests with deep reinforcement learning. arXiv preprint arXiv:2003.00671 (2020).
[10]
G. Jo, H. Kim, J. Lee, and J. Lee. 2020. SOFF: An OpenCL high-level synthesis framework for FPGAs. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 295–308.
[11]
David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
[12]
Martin Kristien, Bruno Bodin, Michel Steuwer, and Christophe Dubach. 2019. High-level synthesis of functional patterns with lift. In Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY).
[13]
Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, USA, 242–251.
[14]
Huyuan Li. 2017. Acceleration of Deep Learning on FPGA. Ph.D. Dissertation. University of Windsor (Canada).
[15]
Hung-Yi Liu and Luca P. Carloni. 2013. On learning-based methods for design-space exploration with high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference.
[16]
Richard Membarth, Oliver Reiche, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. 2015. Hipa cc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems 27, 1 (2015), 210–224.
[17]
Alexander Montgomerie-Corcoran, Zhewen Yu, and Christos-Savvas Bouganis. 2022. SAMO: Optimised mapping of convolutional neural networks to streaming architectures. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL’22). 418–424.
[18]
Rachit Nigam, Sachille Atapattu, Samuel Thomas, Zhijing Li, Theodore Bauer, Yuwei Ye, Apurva Koti, Adrian Sampson, and Zhiru Zhang. 2020. Predictable accelerator design with time-sensitive affine types. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020).
[19]
M Akif Özkan, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, Roland Leißa, Sebastian Hack, Jürgen Teich, and Frank Hannig. 2020. AnyHLS: High-level synthesis with partial evaluation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 3202–3214.
[20]
Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2017. COSMOS: Coordination of high-level synthesis and memory optimization for hardware accelerators. ACM Transactions on Embedded Computing Systems (TECS) 16, 5s (2017), 1–22.
[21]
Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. Generating configurable hardware from parallel patterns. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 651–665.
[22]
Christof Schlaak, Tzung-Han Juang, and Christophe Dubach. 2022. Memory-aware functional IR for higher-level synthesis of accelerators. ACM Trans. Archit. Code Optim. 19, 2, Article 16 (Jan. 2022), 26 pages.
[23]
Christof Schlaak, Tzung-Han Juang, and Christophe Dubach. 2022. Optimizing data reshaping operations in functional IRs for high-level synthesis. In Proceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’22). Association for Computing Machinery, New York, NY, USA, 61–72.
[24]
Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE.
[25]
Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). 40–47.
[26]
Ke Xu, Xiaoyun Wang, Xinyang Liu, Changfeng Cao, Huolin Li, Haiyong Peng, and Dong Wang. 2021. A dedicated hardware accelerator for real-time acceleration of YOLOv2. Journal of Real-Time Image Processing 18 (2021), 481–492.
[27]
Hanchen Ye, HyeGang Jun, Hyunmin Jeong, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A scalable high-level synthesis framework with multi-level transformations and optimizations: Invited. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC’22). Association for Computing Machinery, New York, NY, USA, 1355–1358.
[28]
Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2019. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 11 (2019), 2072–2085.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 22, Issue 5s
Special Issue ESWEEK 2023
October 2023
1394 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3614235
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 09 September 2023
Accepted: 13 July 2023
Revised: 02 June 2023
Received: 23 March 2023
Published in TECS Volume 22, Issue 5s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. High-level synthesis
  2. neural networks
  3. functional IRs
  4. rewrite rules

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 345
    Total Downloads
  • Downloads (Last 12 months)176
  • Downloads (Last 6 weeks)27
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media