research-article

TNPU: an efficient accelerator architecture for training convolutional neural networks

Authors:

Xiaowei LiAuthors Info & Claims

ASPDAC '19: Proceedings of the 24th Asia and South Pacific Design Automation Conference

Pages 450 - 455

https://doi.org/10.1145/3287624.3287641

Published: 21 January 2019 Publication History

Abstract

Training large scale convolutional neural networks (CNNs) is an extremely computation and memory intensive task that requires massive computational resources and training time. Recently, many accelerator solutions have been proposed to improve the performance and efficiency of CNNs. Existing approaches mainly focus on the inference phase of CNN, and can hardly address the new challenges posed in CNN training: the resource requirement diversity and bidirectional data dependency between convolutional layers (CVLs) and fully-connected layers (FCLs). To overcome this problem, this paper presents a new accelerator architecture for CNN training, called TNPU, which leverages the complementary effect of the resource requirements between CVLs and FCLs. Unlike prior approaches optimizing CVLs and FCLs in separate way, we take an alternative by smartly orchestrating the computation of CVLs and FCLs in single computing unit to work concurrently so that both computing and memory resources will maintain high utilization, thereby boosting the performance. We also proposed a simplified out-of-order scheduling mechanism to address the bidirectional data dependency issues in CNN training. The experiments show that TNPU achieves a speedup of 1.5x and 1.3x, with an average energy reduction of 35.7% and 24.1% over comparably provisioned state-of-the-art accelerators (DNPU and DaDianNao), respectively.

References

[1]

J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In Proceedings of the 43rd ISCA.

Digital Library

[2]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. {n. d.}. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th ASPLOS.

Digital Library

[3]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proc. of MICRO.

Digital Library

[4]

Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits (2017).

[5]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: shifting vision processing closer to the sensor.

[6]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd ISCA.

Digital Library

[7]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 1135--1143.

Digital Library

[8]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

Digital Library

[9]

J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, and X. Li. 2018. CCR: A concise convolution rule for sparse neural network accelerators. In 2018 Design, Automation and Test in Europe Conference and Exhibition (DATE). 189--194.

[10]

J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, and X. Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In 2018 Design, Automation and Test in Europe Conference and Exhibition (DATE). 343--348.

[11]

Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In 2017 IEEE 23th HPCA.

[12]

Angshuman Parashar et al. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of 44th ISCA. ACM.

Digital Library

[13]

Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proc. of FPGA.

Digital Library

[14]

D. Shin, J. Lee, J. Lee, J. Lee, and Hoi-Jun Yoo. 2017. An energy-efficient deep learning processor with heterogeneous multi-core architecture for convolutional neural networks and recurrent neural networks. In 2017 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS). 1--2.

[15]

Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proc. of FPGA.

Digital Library

[16]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 FPGA.

Digital Library

[17]

Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proc. of MICRO.

Digital Library

Cited By

Lu JWang HLin JWang Z(2024)WinTA: An Efficient Reconfigurable CNN Training Accelerator With Decomposition WinogradIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.333847171:2(634-645)Online publication date: Feb-2024
https://doi.org/10.1109/TCSI.2023.3338471
Han DYoo HHan DYoo H(2023)An Overview of Energy-Efficient DNN Training ProcessorsOn-Chip Training NPU - Algorithm, Architecture and SoC Design10.1007/978-3-031-34237-0_8(183-210)Online publication date: 29-May-2023
https://doi.org/10.1007/978-3-031-34237-0_8
Lu JHuang JWang Z(2022)THETA: A High-Efficiency Training Accelerator for DNNs With Triple-Side Sparsity ExplorationIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.317558230:8(1034-1046)Online publication date: Aug-2022
https://doi.org/10.1109/TVLSI.2022.3175582
Show More Cited By

Index Terms

TNPU: an efficient accelerator architecture for training convolutional neural networks
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures
      2. Neural networks

Recommendations

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
ISCA'17

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs, especially in mobile platforms such as autonomous vehicles, cameras, ...
A CGRA-Based Approach for Accelerating Convolutional Neural Networks
MCSOC '15: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip

Convolutional neural network (CNN) is an emerging approach for achieving high recognition accuracy in various machine learning applications. To accelerate CNN computations, various GPU-based or application-specific hardware approaches have been recently ...
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs, especially in mobile platforms such as autonomous vehicles, cameras, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPDAC '19: Proceedings of the 24th Asia and South Pacific Design Automation Conference

January 2019

794 pages

ISBN:9781450360074

DOI:10.1145/3287624

General Chair:
Toshiyuki Shibuya
Fujitsu Laboratories

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

In-Cooperation

IEICE ESS: Institute of Electronics, Information and Communication Engineers, Engineering Sciences Society
IEEE CAS
IEEE CEDA
IPSJ SIG-SLDM: Information Processing Society of Japan, SIG System LSI Design Methodology

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 January 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ASPDAC '19

Sponsor:

SIGDA

ASPDAC '19: 24th Asia and South Pacific Design Automation Conference

January 21 - 24, 2019

Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 466 of 1,454 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
636
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)3

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lu JWang HLin JWang Z(2024)WinTA: An Efficient Reconfigurable CNN Training Accelerator With Decomposition WinogradIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.333847171:2(634-645)Online publication date: Feb-2024
https://doi.org/10.1109/TCSI.2023.3338471
Han DYoo HHan DYoo H(2023)An Overview of Energy-Efficient DNN Training ProcessorsOn-Chip Training NPU - Algorithm, Architecture and SoC Design10.1007/978-3-031-34237-0_8(183-210)Online publication date: 29-May-2023
https://doi.org/10.1007/978-3-031-34237-0_8
Lu JHuang JWang Z(2022)THETA: A High-Efficiency Training Accelerator for DNNs With Triple-Side Sparsity ExplorationIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.317558230:8(1034-1046)Online publication date: Aug-2022
https://doi.org/10.1109/TVLSI.2022.3175582
Liang LQu ZChen ZTu FWu YDeng LLi GLi PXie Y(2022)H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.313834741:11(4782-4796)Online publication date: Nov-2022
https://doi.org/10.1109/TCAD.2021.3138347
Han DKang SKim SLee JYoo H(2022)Energy-Efficient DNN Training Processors on Micro-AI SystemsIEEE Open Journal of the Solid-State Circuits Society10.1109/OJSSCS.2022.32190342(259-275)Online publication date: 2022
https://doi.org/10.1109/OJSSCS.2022.3219034
Unnikrishnan NParhi K(2021)LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)10.1109/ICCAD51958.2021.9643567(1-8)Online publication date: 1-Nov-2021
https://doi.org/10.1109/ICCAD51958.2021.9643567
Unnikrishnan NParhi K(2020)A Gradient-Interleaved Scheduler for Energy-Efficient Backpropagation for Training Neural Networks2020 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS45731.2020.9181242(1-5)Online publication date: Oct-2020
https://doi.org/10.1109/ISCAS45731.2020.9181242
Hojabr RGivaki KPourahmadi KNooralinejad PKhonsari ARahmati DNajafi M(2020)TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training2020 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS45731.2020.9181001(1-5)Online publication date: Oct-2020
https://doi.org/10.1109/ISCAS45731.2020.9181001
Wang JLi H(2020)Minimizing Off-Chip Memory Access for Deep Convolutional Neural Network TrainingParallel Architectures, Algorithms and Programming10.1007/978-981-15-2767-8_42(479-491)Online publication date: 26-Jan-2020
https://doi.org/10.1007/978-981-15-2767-8_42

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten