research-article

A Small-Footprint Accelerator for Large-Scale Neural Networks

Authors:

Dongsheng Wang,

Olivier TemamAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 33, Issue 2

Article No.: 6, Pages 1 - 27

https://doi.org/10.1145/2701417

Published: 22 May 2015 Publication History

Abstract

Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope.

Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy.

We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm<sup>2</sup> and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87 × faster, and it can reduce the total energy by 21.08 ×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

References

[1]

Renee St. Amant, Daniel A. Jimenez, and Doug Burger. 2008. Low-power, high-performance analog neural branch prediction. In International Symposium on Microarchitecture. Como.

Digital Library

[2]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques. ACM Press, New York, New York.

Digital Library

[3]

Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture. ACM Press, New York, NY, 247.

Digital Library

[4]

Tianshi Chen, Yunji Chen, Marc Duranton, Qi Guo, Atif Hashmi, Mikko Lipasti, Andrew Nere, Shi Qiu, Michele Sebag, and Olivier Temam. 2012. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization.

Digital Library

[5]

Yunji Chen, Tao Luo, Shijin Zhang, Shaoli Liu, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In International Symposium on Microarchitecture.

Digital Library

[6]

Adam Coates, Brody Huval, Tao Wang, David J. Wu, and Andrew Y. Ng. 2013. Deep learning with cots hpc systems. In International Conference on Machine Learning. http://jmlr.org/proceedings/papers/v28/coates13.html.

[7]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. In Machine Learning. 273 --297.

Digital Library

[8]

George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In International Conference on Acoustics, Speech and Signal Processing. http://www.cs.toronto.edu/&sim;gdahl/papers/reluDropoutBN&lowbar;icassp2013.pdf.

[9]

Sorin Draghici. 2002. On the capabilities of neural networks using limited precision weights. Neural Netw. 15, 3 (2002), 395--414.

Digital Library

[10]

Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna V. Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In Asia and South Pacific Design Automation Conference.

[11]

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA).

Digital Library

[12]

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In International Symposium on Microarchitecture. 1--6.

Digital Library

[13]

Kevin Fan, Manjunath Kudlur, Ganesh S. Dasika, and Scott A. Mahlke. 2009. Bridging the computation gap between programmable processors and hardwired accelerators. In HPCA. IEEE Computer Society, 313--322.

[14]

Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop. IEEE, 109--116.

[15]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture. ACM Press, New York, New York, 37.

Digital Library

[16]

Atif Hashmi, Andrew Nere, James Jamal Thomas, and Mikko Lipasti. 2011. A case for neuromorphic ISAs. In International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY.

Digital Library

[17]

Geoffrey E. Hinton and N. Srivastava. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv (2012), 1--18. http://arxiv.org/abs/1207.0580

[18]

Jordan L. Holi and Jenq-Neng Hwang. 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput. 42, 3 (1993), 281--290.

Digital Library

[19]

Mark Holler, Simon Tam, Hernan Castro, and Ronald Benson. 1990. An electrically trainable artificial neural network (ETANN) with 10240 “floating gate” synapses. In Artificial Neural Networks. IEEE Press, Piscataway, NJ, 50--55.

Digital Library

[20]

Po-Sen Huang, Xiaodong He, Jianfeng Gao, and Li Deng. 2013. Learning deep structured semantic models for web search using clickthrough data. In International Conference on Information and Knowledge Management. http://dl.acm.org/citation.cfm&quest;id=2505665

Digital Library

[21]

Muhammad Mukaram Khan, David R. Lester, Luis A. Plana, Alexander D. Rast, Xin Jin, Eustace Painkras, and Stephen B. Furber. 2008. SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor. In IEEE International Joint Conference on Neural Networks (IJCNN). IEEE, 2849--2856.

[22]

Joo-young Kim, Minsu Kim, Seungjin Lee, Jinwook Oh, Kwanho Kim, and Hoi-jun Yoo. 2010. A 201.4 GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45.

[23]

Eric J. King and Earl E. Swartzlander Jr. 1997. Data-dependent truncation scheme for parallel multipliers. In Conference Record of the 31st Asilomar Conference on Signals, Systems & Computers, Vol. 2. IEEE, 1178--1182.

[24]

Daniel Larkin, Andrew Kinane, Valentin Muresan, and Noel E O’Connor. 2006b. An efficient hardware architecture for a neural network activation function generator. In ISNN (2), Jun Wang, Zhang Yi, Jacek M. Zurada, Bao-Liang Lu, and Hujun Yin (Eds.), Lecture Notes in Computer Science, Vol. 3973. Springer, 1319--1327.

Digital Library

[25]

Daniel Larkin, Andrew Kinane, and Noel E. O’Connor. 2006a. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In ICONIP (3). 1178--1188.

Digital Library

[26]

Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. 2007. An empirical evaluation of deep architectures on problems with many factors of variation. In International Conference on Machine Learning. ACM Press, New York, New York, 473--480.

Digital Library

[27]

Quoc V. Le, MarcAurelio Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning.

[28]

Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (1998).

[29]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, 469--480.

Digital Library

[30]

Ahmed Al Maashri, Michael Debole, Matthew Cotter, Nandhini Chandramoorthy, Yang Xiao, Vijaykrishnan Narayanan, and Chaitali Chakrabarti. 2012. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (DAC’12) (2012), 579.

Digital Library

[31]

Paul Merolla, John Arthur, Filipp Akopyan, Nabil Imam, Rajit Manohar, and D. S. Modha. 2011. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm. In IEEE Custom Integrated Circuits Conference. IEEE, 1--4.

[32]

Volodymyr Mnih and Geoffrey Hinton. 2012. Learning to label aerial images from noisy data. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 567--574.

[33]

Mike Muller. 2010. Dark silicon and the internet. In EE Times “Designing with ARM” Virtual Conference.

[34]

Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. 2013. Convolution engine: Balancing efficiency and flexibility in specialized computing. In International Symposium on Computer Architecture.

Digital Library

[35]

Johannes Schemmel, Johannes Fieres, and Karlheinz Meier. 2008. Wafer-scale integration of analog neural networks. In International Joint Conference on Neural Networks. IEEE, 431--438.

[36]

Pierre Sermanet, Soumith Chintala, and Y. LeCun. 2012. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition. http://ieeexplore.ieee.org/xpls/abs_all.jsp&quest;arnumber=6460867.

[37]

Pierre Sermanet and Yann LeCun. 2011. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks. IEEE, 2809--2813.

[38]

Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio. 2007. Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 3 (March 2007), 411--26.

Digital Library

[39]

Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture. Portland, Oregon.

Digital Library

[40]

Olivier Temam and Nathalie Drach. 1995. Software assistance for data caches. Future Generation Computer Systems 11, 6 (1995), 519--536.

Digital Library

[41]

Shyamkumar Thoziyoor, Naveen Muralimanohar, and JH Ahn. 2008. CACTI 5.1. HP Labs, Palo Alto, Tech (2008). http://www.hpl.hp.com/techreports/2008/HPL-2008-20.pdf&quest;q=cacti.

[42]

Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.

[43]

Ganesh Venkatesh, Jack Sampson, Nathan Goulding-hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores categories and subject descriptors. In International Symposium on Microarchitecture.

Digital Library

[44]

R. Jacob Vogelstein, Udayan Mallik, Joshua T. Vogelstein, and Gert Cauwenberghs. 2007. Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Transactions on Neural Networks 18, 1 (2007), 253--265.

Digital Library

[45]

Sami Yehia, Sylvain Girbal, Hugues Berry, and Olivier Temam. 2009. Reconciling specialization and flexibility through compound circuits. In International Symposium on High Performance Computer Architecture. IEEE, 277--288.

Cited By

Baruah SLee PSarathy PWolf M(2020)Achieving Resiliency and Behavior Assurance in Autonomous Navigation: An Industry PerspectiveProceedings of the IEEE10.1109/JPROC.2020.2978661108:7(1196-1207)Online publication date: Jul-2020
https://doi.org/10.1109/JPROC.2020.2978661
Du BGuo QZhao YZhi TChen YXu Z(2020)Self-Aware Neural Network Systems: A Survey and New PerspectiveProceedings of the IEEE10.1109/JPROC.2020.2977722108:7(1047-1067)Online publication date: Jul-2020
https://doi.org/10.1109/JPROC.2020.2977722
Zhou XDu ZZhang SZhang LLan HLiu SLi LGuo QChen TChen Y(2019)Addressing Sparsity in Deep Neural NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.286428938:10(1858-1871)Online publication date: Oct-2019
https://doi.org/10.1109/TCAD.2018.2864289
Show More Cited By

Index Terms

A Small-Footprint Accelerator for Large-Scale Neural Networks
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer

The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on ...
A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks
Special Issue on Frontiers of Hardware and Algorithms for On-chip Learning, Special Issue on Silicon Photonics and Regular Papers

FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. ...
A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

FPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained great attentions due to its higher energy efficiency than GPUs. However, it has been a challenge for FPGA-based solutions to achieve a higher throughput than GPU ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 33, Issue 2

June 2015

86 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/2785582

Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2015

Accepted: 01 December 2014

Revised: 01 November 2014

Received: 01 September 2014

Published in TOCS Volume 33, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF of China
Strategic Priority Research Program of the CAS
973 Program of China
French ANR MHANN and NEMESIS
International Collaboration Key Program of the CAS
Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)
10,000 and 1,000 talent programs
Google Faculty Research Award

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,526
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Baruah SLee PSarathy PWolf M(2020)Achieving Resiliency and Behavior Assurance in Autonomous Navigation: An Industry PerspectiveProceedings of the IEEE10.1109/JPROC.2020.2978661108:7(1196-1207)Online publication date: Jul-2020
https://doi.org/10.1109/JPROC.2020.2978661
Du BGuo QZhao YZhi TChen YXu Z(2020)Self-Aware Neural Network Systems: A Survey and New PerspectiveProceedings of the IEEE10.1109/JPROC.2020.2977722108:7(1047-1067)Online publication date: Jul-2020
https://doi.org/10.1109/JPROC.2020.2977722
Zhou XDu ZZhang SZhang LLan HLiu SLi LGuo QChen TChen Y(2019)Addressing Sparsity in Deep Neural NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.286428938:10(1858-1871)Online publication date: Oct-2019
https://doi.org/10.1109/TCAD.2018.2864289
Zhang SDu ZZhang LLan HLiu SLi LGuo QChen TChen YHsu WYang CLipasti MLee H(2016)Cambricon-xThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195662(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195662
Soman Sjayadeva Suri M(2016)Recent trends in neuromorphic engineeringBig Data Analytics10.1186/s41044-016-0013-11:1Online publication date: 1-Dec-2016
https://doi.org/10.1186/s41044-016-0013-1
Chen TDu ZSun NWang JWu CChen YTemam O(2015)A High-Throughput Neural Network AcceleratorIEEE Micro10.1109/MM.2015.4135:3(24-32)Online publication date: May-2015
https://doi.org/10.1109/MM.2015.41
Shamsi JKhojaye MQasmi M(2013)Data-Intensive Cloud ComputingJournal of Grid Computing10.1007/s10723-013-9255-611:2(281-310)Online publication date: 1-Jun-2013
https://dl.acm.org/doi/10.1007/s10723-013-9255-6

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents