Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture

Published: 01 January 2021 Publication History

Abstract

Deep neural networks (DNNs) have been shown to outperform conventional machine learning algorithms across a wide range of applications, e.g., image recognition, object detection, robotics, and natural language processing. However, the high computational complexity of DNNs often necessitates extremely fast and efficient hardware. The problem gets worse as the size of neural networks grows exponentially. As a result, customized hardware accelerators have been developed to accelerate DNN processing without sacrificing model accuracy. However, previous accelerator design studies have not fully considered the characteristics of the target applications, which may lead to sub-optimal architecture designs. On the other hand, new DNN models have been developed for better accuracy, but their compatibility with the underlying hardware accelerator is often overlooked. In this article, we propose an application-driven framework for architectural design space exploration of DNN accelerators. This framework is based on a hardware analytical model of individual DNN operations. It models the accelerator design task as a multi-dimensional optimization problem. We demonstrate that it can be efficaciously used in application-driven accelerator architecture design: we use the framework to optimize the accelerator configurations for eight representative DNNs and select the configuration with the highest geometric mean performance. The geometric mean performance improvement of the selected DNN configuration relative to the architectural configuration optimized only for each individual DNN ranges from 12.0 to 117.9 percent. Given a target DNN, the framework can generate efficient accelerator design solutions with optimized performance and area. Furthermore, we explore the opportunity to use the framework for accelerator configuration optimization under simultaneous diverse DNN applications. The framework is also capable of improving neural network models to best fit the underlying hardware resources. We demonstrate that it can be used to analyze the relationship between the operations of the target DNNs and the corresponding accelerator configurations, based on which the DNNs can be tuned for better processing efficiency on the given accelerator without sacrificing accuracy.

References

[1]
O. Russakovsky, et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
[2]
A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Advances Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[3]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015. [Online]. Available: https://dblp.org/rec/bibtex/journals/corr/SimonyanZ14a
[4]
X. Xu, et al., “Scaling for edge inference of deep neural networks,” Nature Electronics, vol. 1, no. 4, 2018, Art. no.
[5]
N. Suda, et al., “Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2016, pp. 16–25.
[6]
J. Qiu, et al., “Going deeper with embedded FPGA platform for convolutional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2016, pp. 26–35.
[7]
S. Han, et al., “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 75–84.
[8]
C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efficient CNN implementation on a deeply pipelined FPGA cluster,” in Proc. Int. Symp. Low Power Electronics Des., 2016, pp. 326–331.
[9]
R. Zhao, et al., “Accelerating binarized convolutional neural networks with software-programmable FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 15–24.
[10]
Y. Shen, M. Ferdman, and P. Milder, “Overcoming resource underutilization in spatial CNN accelerators,” in Proc. Int. Conf. Field Programmable Logic Appl., 2016, pp. 1–4.
[11]
A. Rahman, J. Lee, and K. Choi, “Efficient FPGA acceleration of convolutional neural networks using logical-3D compute array,” in Proc. Des. Autom. Test Europe Conf. Exhib., 2016, pp. 1393–1398.
[12]
H. Sharma, et al., “From high-level deep neural models to FPGAs,” in Proc. IEEE/ACM Int. Symp. Microarchit., 2016, pp. 1–12.
[13]
H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects,” in Proc. Int. Conf. Architect. Support Program. Languages Operating Syst., 2018, pp. 461–475.
[14]
J. Park and W. Sung, “FPGA based implementation of deep neural networks using on-chip memory only,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Process., 2016, pp. 1011–1015.
[15]
S. Han, et al., “EIE: Efficient inference engine on compressed deep neural network,” in Proc. Int. Symp. Comput. Archit., 2016, pp. 243–254.
[16]
S. Venkataramani, et al., “Scaledeep: A scalable compute architecture for learning and evaluating deep networks,” in Proc. Int. Symp. Comput. Archit., 2017, pp. 13–26.
[17]
Y. Chen, et al., “DaDianNao: A machine-learning supercomputer,” in Proc. IEEE/ACM Int. Symp. Microarchit., 2014, pp. 609–622.
[18]
N. P. Jouppi, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proc. Int. Symp. Comput. Archit., 2017, pp. 1–12.
[19]
W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2017, pp. 553–564.
[20]
J. Zhu, J. Jiang, X. Chen, and C. Tsui, “SparseNN: An energy-efficient neural network accelerator exploiting input and output sparsity,” in Proc. Des. Autom. Test Europe Conf. Exhib., 2018, pp. 241–244.
[21]
A. Parashar, et al., “SCNN: An accelerator for compressed-sparse convolutional neural networks,” in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2017, pp. 27–40.
[22]
K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. W. Fletcher, “UCNN: Exploiting computational reuse in deep neural networks via weight repetition,” in Proc. Int. Symp. Comput. Archit., 2018, pp. 674–687.
[23]
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network computing,” in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2016, pp. 1–13.
[24]
S. Zhang, et al., “Cambricon-X: An accelerator for sparse neural networks,” in Proc. IEEE/ACM Int. Symp. Microarchit., 2016, pp. 1–12.
[25]
C. Ding, et al., “CirCNN: Accelerating and compressing deep neural networks using block-circulant weight matrices,” in Proc. IEEE/ACM Int. Symp. Microarchit., 2017, pp. 395–408.
[26]
M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in Proc. IEEE/ACM Int. Symp. Microarchit., 2016, pp. 1–12.
[27]
Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2016, pp. 367–379.
[28]
Y. Ma, N. Suda, Y. Cao, J. Seo, and S. Vrudhula, “Scalable and modularized RTL compilation of convolutional neural networks onto FPGA,” in Proc. Int. Conf. Field Programmable Logic Appl., 2016, pp. 1–8.
[29]
Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 45–54.
[30]
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2015, pp. 161–170.
[31]
S. Venkataramani, J. Choi, V. Srinivasan, K. Gopalakrishnan, and L. Chang, “POSTER: Design space exploration for performance optimization of deep neural networks on shared memory accelerators,” in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2017, pp. 146–147.
[32]
X. Zhang, et al., “DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs,” in Proc. Int. Conf. Comput.-Aided Des., 2018, pp. 1–8.
[33]
L. Ke, X. He, and X. Zhang, “NNest: Early-stage design space exploration tool for neural network inference accelerators,” in Proc. Int. Symp. Low Power Electronics Des., 2018, pp. 4:1–4:6.
[34]
M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration of FPGA-based deep convolutional neural networks,” in Proc. Asia South Pacific Des. Autom. Conf., 2016, pp. 575–580.
[35]
H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high performance FPGA-based accelerator for large-scale convolutional neural networks,” in Proc. Int. Conf. Field Programmable Logic Appl., 2016, pp. 1–9.
[36]
F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1800–1807.
[37]
M. Abadi, et al., “Tensorflow: A system for large-scale machine learning,” in Proc. USENIX Conf. Operating Syst. Des. Implementation, 2016, pp. 265–283.
[38]
D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, 1st ed. Boston, MA, USA: Addison-Wesley, 1989.
[39]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
[40]
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” CoRR, vol. abs/1706.05587, 2017.
[41]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[42]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Advances Neural Inf. Process. Syst., 2015, pp. 91–99.
[43]
W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” 2014,.
[44]
H.-T. Cheng, et al., “Wide & deep learning for recommender systems,” in Proc. Workshop Deep Learn. Recommender Syst., 2016, pp. 7–10.
[45]
B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8698–8710.
[46]
S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,” Int. J. Uncertainty Fuzziness Knowl.-Based Syst., vol. 6, no. 2, pp. 107–116, Apr. 1998.
[47]
A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master's thesis, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, 2009.
[48]
S. Wang, D. Zhou, X. Han, and T. Yoshimura, “Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks,” in Proc. Des. Autom. Test Eur. Conf. Exhibit., 2017, pp. 1032–1037.

Cited By

View all
  • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
  • (2024)HitGNN: High-Throughput GNN Training Framework on CPU+Multi-FPGA Heterogeneous PlatformIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337133235:5(707-719)Online publication date: 1-May-2024
  • (2023)Explainable-DSE: An Agile and Explainable Exploration of Efficient HW/SW Codesigns of Deep Learning Accelerators Using Bottleneck AnalysisProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624772(87-107)Online publication date: 25-Mar-2023
  • Show More Cited By

Index Terms

  1. Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Computers
      IEEE Transactions on Computers  Volume 70, Issue 1
      Jan. 2021
      12 pages

      Publisher

      IEEE Computer Society

      United States

      Publication History

      Published: 01 January 2021

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 01 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
      • (2024)HitGNN: High-Throughput GNN Training Framework on CPU+Multi-FPGA Heterogeneous PlatformIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337133235:5(707-719)Online publication date: 1-May-2024
      • (2023)Explainable-DSE: An Agile and Explainable Exploration of Efficient HW/SW Codesigns of Deep Learning Accelerators Using Bottleneck AnalysisProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624772(87-107)Online publication date: 25-Mar-2023
      • (2023)CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkACM Transactions on Embedded Computing Systems10.1145/357579822:3(1-30)Online publication date: 20-Apr-2023
      • (2023)ACDSE: A Design Space Exploration Method for CNN Accelerator based on Adaptive Compression MechanismACM Transactions on Embedded Computing Systems10.1145/354517722:6(1-26)Online publication date: 9-Nov-2023
      • (2021)Automated HW/SW co-design for edge AIProceedings of the 2021 International Conference on Hardware/Software Codesign and System Synthesis10.1145/3478684.3479261(11-20)Online publication date: 30-Sep-2021
      • (2021)Weight-Sharing Neural Architecture Search: A Battle to Shrink the Optimization GapACM Computing Surveys10.1145/347333054:9(1-37)Online publication date: 8-Oct-2021

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media