Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Small-Footprint Accelerator for Large-Scale Neural Networks

Published: 22 May 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope.
    Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy.
    We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm<sup>2</sup> and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87 × faster, and it can reduce the total energy by 21.08 ×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

    References

    [1]
    Renee St. Amant, Daniel A. Jimenez, and Doug Burger. 2008. Low-power, high-performance analog neural branch prediction. In International Symposium on Microarchitecture. Como.
    [2]
    Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques. ACM Press, New York, New York.
    [3]
    Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture. ACM Press, New York, NY, 247.
    [4]
    Tianshi Chen, Yunji Chen, Marc Duranton, Qi Guo, Atif Hashmi, Mikko Lipasti, Andrew Nere, Shi Qiu, Michele Sebag, and Olivier Temam. 2012. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization.
    [5]
    Yunji Chen, Tao Luo, Shijin Zhang, Shaoli Liu, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In International Symposium on Microarchitecture.
    [6]
    Adam Coates, Brody Huval, Tao Wang, David J. Wu, and Andrew Y. Ng. 2013. Deep learning with cots hpc systems. In International Conference on Machine Learning. http://jmlr.org/proceedings/papers/v28/coates13.html.
    [7]
    Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. In Machine Learning. 273 --297.
    [8]
    George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In International Conference on Acoustics, Speech and Signal Processing. http://www.cs.toronto.edu/&sim;gdahl/papers/reluDropoutBN&lowbar;icassp2013.pdf.
    [9]
    Sorin Draghici. 2002. On the capabilities of neural networks using limited precision weights. Neural Netw. 15, 3 (2002), 395--414.
    [10]
    Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna V. Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In Asia and South Pacific Design Automation Conference.
    [11]
    Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA).
    [12]
    Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In International Symposium on Microarchitecture. 1--6.
    [13]
    Kevin Fan, Manjunath Kudlur, Ganesh S. Dasika, and Scott A. Mahlke. 2009. Bridging the computation gap between programmable processors and hardwired accelerators. In HPCA. IEEE Computer Society, 313--322.
    [14]
    Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop. IEEE, 109--116.
    [15]
    Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture. ACM Press, New York, New York, 37.
    [16]
    Atif Hashmi, Andrew Nere, James Jamal Thomas, and Mikko Lipasti. 2011. A case for neuromorphic ISAs. In International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY.
    [17]
    Geoffrey E. Hinton and N. Srivastava. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv (2012), 1--18. http://arxiv.org/abs/1207.0580
    [18]
    Jordan L. Holi and Jenq-Neng Hwang. 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput. 42, 3 (1993), 281--290.
    [19]
    Mark Holler, Simon Tam, Hernan Castro, and Ronald Benson. 1990. An electrically trainable artificial neural network (ETANN) with 10240 “floating gate” synapses. In Artificial Neural Networks. IEEE Press, Piscataway, NJ, 50--55.
    [20]
    Po-Sen Huang, Xiaodong He, Jianfeng Gao, and Li Deng. 2013. Learning deep structured semantic models for web search using clickthrough data. In International Conference on Information and Knowledge Management. http://dl.acm.org/citation.cfm&quest;id&equals;2505665
    [21]
    Muhammad Mukaram Khan, David R. Lester, Luis A. Plana, Alexander D. Rast, Xin Jin, Eustace Painkras, and Stephen B. Furber. 2008. SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor. In IEEE International Joint Conference on Neural Networks (IJCNN). IEEE, 2849--2856.
    [22]
    Joo-young Kim, Minsu Kim, Seungjin Lee, Jinwook Oh, Kwanho Kim, and Hoi-jun Yoo. 2010. A 201.4 GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45.
    [23]
    Eric J. King and Earl E. Swartzlander Jr. 1997. Data-dependent truncation scheme for parallel multipliers. In Conference Record of the 31st Asilomar Conference on Signals, Systems &amp; Computers, Vol. 2. IEEE, 1178--1182.
    [24]
    Daniel Larkin, Andrew Kinane, Valentin Muresan, and Noel E O’Connor. 2006b. An efficient hardware architecture for a neural network activation function generator. In ISNN (2), Jun Wang, Zhang Yi, Jacek M. Zurada, Bao-Liang Lu, and Hujun Yin (Eds.), Lecture Notes in Computer Science, Vol. 3973. Springer, 1319--1327.
    [25]
    Daniel Larkin, Andrew Kinane, and Noel E. O’Connor. 2006a. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In ICONIP (3). 1178--1188.
    [26]
    Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. 2007. An empirical evaluation of deep architectures on problems with many factors of variation. In International Conference on Machine Learning. ACM Press, New York, New York, 473--480.
    [27]
    Quoc V. Le, MarcAurelio Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning.
    [28]
    Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (1998).
    [29]
    Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, 469--480.
    [30]
    Ahmed Al Maashri, Michael Debole, Matthew Cotter, Nandhini Chandramoorthy, Yang Xiao, Vijaykrishnan Narayanan, and Chaitali Chakrabarti. 2012. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (DAC’12) (2012), 579.
    [31]
    Paul Merolla, John Arthur, Filipp Akopyan, Nabil Imam, Rajit Manohar, and D. S. Modha. 2011. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm. In IEEE Custom Integrated Circuits Conference. IEEE, 1--4.
    [32]
    Volodymyr Mnih and Geoffrey Hinton. 2012. Learning to label aerial images from noisy data. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 567--574.
    [33]
    Mike Muller. 2010. Dark silicon and the internet. In EE Times “Designing with ARM” Virtual Conference.
    [34]
    Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. 2013. Convolution engine: Balancing efficiency and flexibility in specialized computing. In International Symposium on Computer Architecture.
    [35]
    Johannes Schemmel, Johannes Fieres, and Karlheinz Meier. 2008. Wafer-scale integration of analog neural networks. In International Joint Conference on Neural Networks. IEEE, 431--438.
    [36]
    Pierre Sermanet, Soumith Chintala, and Y. LeCun. 2012. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition. http://ieeexplore.ieee.org/xpls/abs_all.jsp&quest;arnumber&equals;6460867.
    [37]
    Pierre Sermanet and Yann LeCun. 2011. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks. IEEE, 2809--2813.
    [38]
    Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio. 2007. Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 3 (March 2007), 411--26.
    [39]
    Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture. Portland, Oregon.
    [40]
    Olivier Temam and Nathalie Drach. 1995. Software assistance for data caches. Future Generation Computer Systems 11, 6 (1995), 519--536.
    [41]
    Shyamkumar Thoziyoor, Naveen Muralimanohar, and JH Ahn. 2008. CACTI 5.1. HP Labs, Palo Alto, Tech (2008). http://www.hpl.hp.com/techreports/2008/HPL-2008-20.pdf&quest;q&equals;cacti.
    [42]
    Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
    [43]
    Ganesh Venkatesh, Jack Sampson, Nathan Goulding-hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores categories and subject descriptors. In International Symposium on Microarchitecture.
    [44]
    R. Jacob Vogelstein, Udayan Mallik, Joshua T. Vogelstein, and Gert Cauwenberghs. 2007. Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Transactions on Neural Networks 18, 1 (2007), 253--265.
    [45]
    Sami Yehia, Sylvain Girbal, Hugues Berry, and Olivier Temam. 2009. Reconciling specialization and flexibility through compound circuits. In International Symposium on High Performance Computer Architecture. IEEE, 277--288.

    Cited By

    View all
    • (2020)Achieving Resiliency and Behavior Assurance in Autonomous Navigation: An Industry PerspectiveProceedings of the IEEE10.1109/JPROC.2020.2978661108:7(1196-1207)Online publication date: Jul-2020
    • (2020)Self-Aware Neural Network Systems: A Survey and New PerspectiveProceedings of the IEEE10.1109/JPROC.2020.2977722108:7(1047-1067)Online publication date: Jul-2020
    • (2019)Addressing Sparsity in Deep Neural NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.286428938:10(1858-1871)Online publication date: Oct-2019
    • Show More Cited By

    Index Terms

    1. A Small-Footprint Accelerator for Large-Scale Neural Networks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Computer Systems
      ACM Transactions on Computer Systems  Volume 33, Issue 2
      June 2015
      86 pages
      ISSN:0734-2071
      EISSN:1557-7333
      DOI:10.1145/2785582
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 May 2015
      Accepted: 01 December 2014
      Revised: 01 November 2014
      Received: 01 September 2014
      Published in TOCS Volume 33, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Hardware accelerator
      2. convolutional neural network
      3. deep learning
      4. deep neural network

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • NSF of China
      • Strategic Priority Research Program of the CAS
      • 973 Program of China
      • French ANR MHANN and NEMESIS
      • International Collaboration Key Program of the CAS
      • Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)
      • 10,000 and 1,000 talent programs
      • Google Faculty Research Award

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)39
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Achieving Resiliency and Behavior Assurance in Autonomous Navigation: An Industry PerspectiveProceedings of the IEEE10.1109/JPROC.2020.2978661108:7(1196-1207)Online publication date: Jul-2020
      • (2020)Self-Aware Neural Network Systems: A Survey and New PerspectiveProceedings of the IEEE10.1109/JPROC.2020.2977722108:7(1047-1067)Online publication date: Jul-2020
      • (2019)Addressing Sparsity in Deep Neural NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.286428938:10(1858-1871)Online publication date: Oct-2019
      • (2016)Cambricon-xThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195662(1-12)Online publication date: 15-Oct-2016
      • (2016)Recent trends in neuromorphic engineeringBig Data Analytics10.1186/s41044-016-0013-11:1Online publication date: 1-Dec-2016
      • (2015)A High-Throughput Neural Network AcceleratorIEEE Micro10.1109/MM.2015.4135:3(24-32)Online publication date: May-2015
      • (2013)Data-Intensive Cloud ComputingJournal of Grid Computing10.1007/s10723-013-9255-611:2(281-310)Online publication date: 1-Jun-2013

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media