Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

DianNao family: energy-efficient hardware accelerators for machine learning

Published: 28 October 2016 Publication History

Abstract

Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems (from embedded systems to data centers). As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously achieve high efficiency and broad application scope.
While efficient computational primitives are important for a hardware accelerator, inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an Amdahl's law effect, and thus, they should become a first-order concern, just like in processors, rather than an element factored in accelerator design on a second step. In this article, we introduce a series of hardware accelerators (i.e., the DianNao family) designed for ML (especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that, on a number of representative neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family).<!-- END_PAGE_1 -->

References

[1]
Cadambi, S., Durdanovic, I., Jakkula, V., Sankaradass, M., Cosatto, E., Chakradhar, S., Graf, H.P. A massively parallel fpga-based coprocessor for support vector machines. In 17th IEEE Symposium on Field Programmable Custom Computing Machines, 2009. FCCM'09 (2009; IEEE, 115--122.
[2]
Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture (Saint Malo, France, June 2010). ACM 38(3): 247--257.
[3]
Chan, E. Algorithmic Trading: Winning Strategies and Their Rationale. John Wiley & Sons, 2013.
[4]
Chen, T., Chen, Y., Duranton, M., Guo, Q., Hashmi, A., Lipasti, M., Nere, A., Qiu, S., Sebag, M., Temam, O. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization, 2012.
[5]
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS), (March 2014). ACM 49(4): 269--284.
[6]
Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., Temam, O. Dadiannao: A machine-learning supercomputer. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (December 2014). IEEE Computer Society, 609--622.
[7]
Coates, A., Huval, B., Wang, T., Wu, D.J., Ng, A.Y. Deep learning with cots HPC systems. In International Conference on Machine Learning, 2013: 1337--1345.
[8]
Deng, J. Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR) (2009). IEEE, 248--255.
[9]
Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O. Shidiannao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA'15) (2015). ACM, 92--104.
[10]
Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., Burger, D. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA) (June 2011). IEEE, 365--376.
[11]
Esmaeilzadeh, H., Sampson, A., Ceze, L., Burger, D. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (Dec 2012). IEEE Computer Society, 449--460.
[12]
Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop (June 2011). IEEE, 109--116.
[13]
Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. Neuflow: A runtime reconfigurable dataflow processor for vision. In 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2011). IEEE, 109--116.
[14]
Frery, A., de Araujo, C., Alice, H., Cerqueira, J., Loureiro, J.A., de Lima, M.E., Oliveira, M., Horta, M., et al. Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm. In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003 (2003). IEEE, 99--104.
[15]
Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., Horowitz, M. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture (New York, New York, USA, 2010). ACM, 38(3): 37--47.
[16]
Hinton, G., Srivastava, N. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv: …, 1--18, 2012.
[17]
Hussain, H.M., Benkrid, K., Seker, H., Erdogan, A.T. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS) (2011). IEEE, 248--255.
[18]
Keckler, S. Life after Dennard and how I learned to love the Picojoule (keynote). In International Symposium on Microarchitecture, Keynote presentation, Sao Paolo, Dec. 2011.
[19]
Kim, J.Y., Kim, M., Lee, S., Oh, J., Kim, K., Yoo, H.-J.A. GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45.
[20]
Krizhevsky, A., Sutskever, I., Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012), 1--9.
[21]
Krizhevsky, A., Sutskever, I., Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012) 1--9.
[22]
Larkin, D., Kinane, A., O'Connor, N.E. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In Neural Information Processing (2006). Springer, Berlin Heidelberg, 1178--1188.
[23]
Le, Q.V. Building high-level features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013). IEEE, 8595--8598.
[24]
Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning, June 2012.
[25]
LeCun, Y., Bengio, Y., Hintion, G. Deep learning. Nature 521, 7553 (2015), 436--444.
[26]
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 11 (1998), 2278--2324.
[27]
Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42 (New York, NY, USA, 2009). ACM, 469--480.
[28]
Liu, D., Chen, T., Liu, S., Zhou, J., Zhou, S., Teman, O., Feng, X., Zhou, X., Chen, Y. Pudiannao: A polyvalent machine learning accelerator. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2015). ACM, 369--381.
[29]
Maashri, A.A., Debole, M., Cotter, M., Chandramoorthy, N., Xiao, Y., Narayanan, V., Chakrabarti, C. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (2012). ACM, 579--584.
[30]
Maeda, N., Komatsu, S., Morimoto, M., Shimazaki, Y. A 0.41 µa standby leakage 32 kb embedded SRAM with low-voltage resume-standby utilizing all digital current comparator in 28 nm hkmg CMOS. In International Symposium on VLSI Circuits (VLSIC), 2012.
[31]
Majumdar, A., Cadambi, S., Becchi, M., Chakradhar, S.T., Graf, H.P. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Trans. Arch. Code Optim. (TACO) 9, 1 (2012), 6.
[32]
Majumdar, A., Cadambi, S., Chakradhar, S.T. An energy-efficient heterogeneous system for embedded learning and classification. Embedded Systems Letters 3, 1 (2011), 42--45.
[33]
Manolakos, E.S., Stamoulias, I. IP-cores design for the KNN classifier. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS) (2010). IEEE, 4133--4136.
[34]
Maruyama, T. Real-time k-means clustering for color images on reconfigurable hardware. In 18th International Conference on Pattern Recognition (ICPR) (Aug 2006). IEEE, Volume 2, 816--819.
[35]
Muller, M. Dark silicon and the internet. In EE Times "Designing with ARM" Virtual Conference, 26, 70(2010), 285--288.
[36]
Papadonikolakis, M., Bouganis, C. A heterogeneous FPGA architecture for support vector machine training. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (May 2010). IEEE, 211--214.
[37]
Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C., Horowitz, M.A. Convolution engine: Balancing efficiency & flexibility in specialized computing. In International Symposium on Computer Architecture, 2013). ACM, 41(3), 24--35.
[38]
Sermanet, P., Chintala, S., LeCun, Y. Convolutional neural networks applied to house numbers digit classification. In Pattern Recognition (ICPR), …, 2012.
[39]
Sermanet, P., LeCun, Y. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks (July 2011). IEEE, 2809--2813.
[40]
Stamoulias, I., Manolakos, E.S. Parallel architectures for the KNN classifier--design of soft IP cores and FPGA implementations. ACM Transactions on Embedded Computing Systems (TECS) 13, 2 (2013), 22.
[41]
Swanson, S., Michelson, K., Schwerin, A., Oskin, M. Wavescalar. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (Dec 2003). IEEE Computer Society, 291.
[42]
Temam, O. The rebirth of neural networks. In International Symposium on Computer Architecture, (2010).
[43]
Temam, O. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture (Sep 2012). Portland, Oregon, 40(3), 356--367.
[44]
Vanhoucke, V., Senior, A., Mao, M.Z. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop (NIPS) (2011). Vol. 1.
[45]
Wang, G., Anand, D., Butt, N., Cestero, A., Chudzik, M., Ervin, J., Fang, S., Freeman, G., Ho, H., Khan, B., Kim, B., Kong, W., Krishnan, R., Krishnan, S., Kwon, O., Liu, J., McStay, K., Nelson, E., Nummy, K., Parries, P., Sim, J., Takalkar, R., Tessier, A., Todi, R., Malik, R., Stiffler, S., Iyer, S. Scaling deep trench based EDRAM on SOI to 32 nm and beyond. In IEEE International Electron Devices Meeting (IEDM) (2009). IEEE, 1--4.
[46]
Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Comput. 8, 7 (1996), 1341--1390.
[47]
Yeh, Y.-J., Li, H.-Y., Hwang, W.-J., Fang, C.-Y. Fpga implementation of KNN classifier based on wavelet transform and partial distance search. In Image Analysis (June 2007). Springer Berlin Heidelberg, 512--521.

Cited By

View all
  • (2025)SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multiprecision DNN InferenceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.346622433:1(207-220)Online publication date: 1-Jan-2025
  • (2025)Efficient and Fast High-Performance Library Generation for Deep Learning AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.347557574:1(155-169)Online publication date: 1-Jan-2025
  • (2025)Approximate CNN Hardware Accelerators for Resource Constrained DevicesIEEE Access10.1109/ACCESS.2025.352966813(12542-12553)Online publication date: 2025
  • Show More Cited By

Index Terms

  1. DianNao family: energy-efficient hardware accelerators for machine learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Communications of the ACM
      Communications of the ACM  Volume 59, Issue 11
      November 2016
      118 pages
      ISSN:0001-0782
      EISSN:1557-7317
      DOI:10.1145/3013530
      • Editor:
      • Moshe Y. Vardi
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2016
      Published in CACM Volume 59, Issue 11

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,250
      • Downloads (Last 6 weeks)144
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multiprecision DNN InferenceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.346622433:1(207-220)Online publication date: 1-Jan-2025
      • (2025)Efficient and Fast High-Performance Library Generation for Deep Learning AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.347557574:1(155-169)Online publication date: 1-Jan-2025
      • (2025)Approximate CNN Hardware Accelerators for Resource Constrained DevicesIEEE Access10.1109/ACCESS.2025.352966813(12542-12553)Online publication date: 2025
      • (2025)A Review on Hardware Accelerators for Convolutional Neural Network-based Inference engines: Strategies for Performance and Energy-efficiency EnhancementMicroprocessors and Microsystems10.1016/j.micpro.2025.105146(105146)Online publication date: Feb-2025
      • (2025)FusionFrame: A Fusion Dataflow Scheduling Framework for DNN Accelerators via Analytical ModelingAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1551-3_21(315-334)Online publication date: 17-Feb-2025
      • (2024)LRMP: Layer Replication with Mixed Precision for spatial in-memory DNN acceleratorsFrontiers in Artificial Intelligence10.3389/frai.2024.12683177Online publication date: 4-Oct-2024
      • (2024)Distributed Training of Large Language Models on AWS TrainiumProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698535(961-976)Online publication date: 20-Nov-2024
      • (2024)ZeroTetris: A Spacial Feature Similarity-based Sparse MLP Engine for Neural Volume RenderingProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655684(1-6)Online publication date: 23-Jun-2024
      • (2024)Greenlight: Highlighting TensorFlow APIs Energy FootprintProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644894(304-308)Online publication date: 15-Apr-2024
      • (2024)Advancements in Accelerating Deep Neural Network Inference on AIoT Devices: A SurveyIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33531769:6(830-847)Online publication date: Nov-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Digital Edition

      View this article in digital edition.

      Digital Edition

      Magazine Site

      View this article on the magazine site (external)

      Magazine Site

      Login options

      Full Access

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media