research-article

Free access

DianNao family: energy-efficient hardware accelerators for machine learning

Authors:

Olivier TemamAuthors Info & Claims

Communications of the ACM, Volume 59, Issue 11

Pages 105 - 112

https://doi.org/10.1145/2996864

Published: 28 October 2016 Publication History

All formats PDF

Abstract

Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems (from embedded systems to data centers). As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously achieve high efficiency and broad application scope.

While efficient computational primitives are important for a hardware accelerator, inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an Amdahl's law effect, and thus, they should become a first-order concern, just like in processors, rather than an element factored in accelerator design on a second step. In this article, we introduce a series of hardware accelerators (i.e., the DianNao family) designed for ML (especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that, on a number of representative neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family).

References

[1]

Cadambi, S., Durdanovic, I., Jakkula, V., Sankaradass, M., Cosatto, E., Chakradhar, S., Graf, H.P. A massively parallel fpga-based coprocessor for support vector machines. In 17th IEEE Symposium on Field Programmable Custom Computing Machines, 2009. FCCM'09 (2009; IEEE, 115--122.

Digital Library

[2]

Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture (Saint Malo, France, June 2010). ACM 38(3): 247--257.

Digital Library

[3]

Chan, E. Algorithmic Trading: Winning Strategies and Their Rationale. John Wiley & Sons, 2013.

Digital Library

[4]

Chen, T., Chen, Y., Duranton, M., Guo, Q., Hashmi, A., Lipasti, M., Nere, A., Qiu, S., Sebag, M., Temam, O. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization, 2012.

Digital Library

[5]

Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS), (March 2014). ACM 49(4): 269--284.

Digital Library

[6]

Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., Temam, O. Dadiannao: A machine-learning supercomputer. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (December 2014). IEEE Computer Society, 609--622.

Digital Library

[7]

Coates, A., Huval, B., Wang, T., Wu, D.J., Ng, A.Y. Deep learning with cots HPC systems. In International Conference on Machine Learning, 2013: 1337--1345.

[8]

Deng, J. Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR) (2009). IEEE, 248--255.

[9]

Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O. Shidiannao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA'15) (2015). ACM, 92--104.

Digital Library

[10]

Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., Burger, D. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA) (June 2011). IEEE, 365--376.

Digital Library

[11]

Esmaeilzadeh, H., Sampson, A., Ceze, L., Burger, D. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (Dec 2012). IEEE Computer Society, 449--460.

Digital Library

[12]

Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop (June 2011). IEEE, 109--116.

[13]

Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y. Neuflow: A runtime reconfigurable dataflow processor for vision. In 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2011). IEEE, 109--116.

[14]

Frery, A., de Araujo, C., Alice, H., Cerqueira, J., Loureiro, J.A., de Lima, M.E., Oliveira, M., Horta, M., et al. Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm. In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003 (2003). IEEE, 99--104.

Digital Library

[15]

Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., Horowitz, M. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture (New York, New York, USA, 2010). ACM, 38(3): 37--47.

Digital Library

[16]

Hinton, G., Srivastava, N. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv: …, 1--18, 2012.

[17]

Hussain, H.M., Benkrid, K., Seker, H., Erdogan, A.T. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS) (2011). IEEE, 248--255.

[18]

Keckler, S. Life after Dennard and how I learned to love the Picojoule (keynote). In International Symposium on Microarchitecture, Keynote presentation, Sao Paolo, Dec. 2011.

[19]

Kim, J.Y., Kim, M., Lee, S., Oh, J., Kim, K., Yoo, H.-J.A. GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45.

[20]

Krizhevsky, A., Sutskever, I., Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012), 1--9.

Digital Library

[21]

Krizhevsky, A., Sutskever, I., Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012) 1--9.

Digital Library

[22]

Larkin, D., Kinane, A., O'Connor, N.E. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In Neural Information Processing (2006). Springer, Berlin Heidelberg, 1178--1188.

Digital Library

[23]

Le, Q.V. Building high-level features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013). IEEE, 8595--8598.

[24]

Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning, June 2012.

Digital Library

[25]

LeCun, Y., Bengio, Y., Hintion, G. Deep learning. Nature 521, 7553 (2015), 436--444.

[26]

Lecun, Y., Bottou, L., Bengio, Y., Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 11 (1998), 2278--2324.

[27]

Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42 (New York, NY, USA, 2009). ACM, 469--480.

Digital Library

[28]

Liu, D., Chen, T., Liu, S., Zhou, J., Zhou, S., Teman, O., Feng, X., Zhou, X., Chen, Y. Pudiannao: A polyvalent machine learning accelerator. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2015). ACM, 369--381.

Digital Library

[29]

Maashri, A.A., Debole, M., Cotter, M., Chandramoorthy, N., Xiao, Y., Narayanan, V., Chakrabarti, C. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (2012). ACM, 579--584.

Digital Library

[30]

Maeda, N., Komatsu, S., Morimoto, M., Shimazaki, Y. A 0.41 µa standby leakage 32 kb embedded SRAM with low-voltage resume-standby utilizing all digital current comparator in 28 nm hkmg CMOS. In International Symposium on VLSI Circuits (VLSIC), 2012.

[31]

Majumdar, A., Cadambi, S., Becchi, M., Chakradhar, S.T., Graf, H.P. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Trans. Arch. Code Optim. (TACO) 9, 1 (2012), 6.

Digital Library

[32]

Majumdar, A., Cadambi, S., Chakradhar, S.T. An energy-efficient heterogeneous system for embedded learning and classification. Embedded Systems Letters 3, 1 (2011), 42--45.

Digital Library

[33]

Manolakos, E.S., Stamoulias, I. IP-cores design for the KNN classifier. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS) (2010). IEEE, 4133--4136.

[34]

Maruyama, T. Real-time k-means clustering for color images on reconfigurable hardware. In 18th International Conference on Pattern Recognition (ICPR) (Aug 2006). IEEE, Volume 2, 816--819.

Digital Library

[35]

Muller, M. Dark silicon and the internet. In EE Times "Designing with ARM" Virtual Conference, 26, 70(2010), 285--288.

[36]

Papadonikolakis, M., Bouganis, C. A heterogeneous FPGA architecture for support vector machine training. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (May 2010). IEEE, 211--214.

Digital Library

[37]

Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C., Horowitz, M.A. Convolution engine: Balancing efficiency & flexibility in specialized computing. In International Symposium on Computer Architecture, 2013). ACM, 41(3), 24--35.

Digital Library

[38]

Sermanet, P., Chintala, S., LeCun, Y. Convolutional neural networks applied to house numbers digit classification. In Pattern Recognition (ICPR), …, 2012.

[39]

Sermanet, P., LeCun, Y. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks (July 2011). IEEE, 2809--2813.

[40]

Stamoulias, I., Manolakos, E.S. Parallel architectures for the KNN classifier--design of soft IP cores and FPGA implementations. ACM Transactions on Embedded Computing Systems (TECS) 13, 2 (2013), 22.

Digital Library

[41]

Swanson, S., Michelson, K., Schwerin, A., Oskin, M. Wavescalar. In ACM/IEEE International Symposium on Microarchitecture (MICRO) (Dec 2003). IEEE Computer Society, 291.

Digital Library

[42]

Temam, O. The rebirth of neural networks. In International Symposium on Computer Architecture, (2010).

Digital Library

[43]

Temam, O. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture (Sep 2012). Portland, Oregon, 40(3), 356--367.

Digital Library

[44]

Vanhoucke, V., Senior, A., Mao, M.Z. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop (NIPS) (2011). Vol. 1.

[45]

Wang, G., Anand, D., Butt, N., Cestero, A., Chudzik, M., Ervin, J., Fang, S., Freeman, G., Ho, H., Khan, B., Kim, B., Kong, W., Krishnan, R., Krishnan, S., Kwon, O., Liu, J., McStay, K., Nelson, E., Nummy, K., Parries, P., Sim, J., Takalkar, R., Tessier, A., Todi, R., Malik, R., Stiffler, S., Iyer, S. Scaling deep trench based EDRAM on SOI to 32 nm and beyond. In IEEE International Electron Devices Meeting (IEDM) (2009). IEEE, 1--4.

[46]

Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Comput. 8, 7 (1996), 1341--1390.

Digital Library

[47]

Yeh, Y.-J., Li, H.-Y., Hwang, W.-J., Fang, C.-Y. Fpga implementation of KNN classifier based on wavelet transform and partial distance search. In Image Analysis (June 2007). Springer Berlin Heidelberg, 512--521.

Digital Library

Cited By

Wang CFang CWu XWang ZLin J(2025)SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multiprecision DNN InferenceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.346622433:1(207-220)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TVLSI.2024.3466224
Bi JWen YLi XZhao YGuo YZhou EHu XDu ZLi LChen HChen TGuo Q(2025)Efficient and Fast High-Performance Library Generation for Deep Learning AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.347557574:1(155-169)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TC.2024.3475575
Thejaswini PSuresh GChiraag VNandi S(2025)Approximate CNN Hardware Accelerators for Resource Constrained DevicesIEEE Access10.1109/ACCESS.2025.352966813(12542-12553)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3529668
Show More Cited By

Index Terms

DianNao family: energy-efficient hardware accelerators for machine learning
1. Computing methodologies
  1. Machine learning
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs

Recommendations

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 59, Issue 11

November 2016

118 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3013530

Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2016

Published in CACM Volume 59, Issue 11

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

172
Total Citations
View Citations
11,145
Total Downloads

Downloads (Last 12 months)1,250
Downloads (Last 6 weeks)144

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang CFang CWu XWang ZLin J(2025)SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multiprecision DNN InferenceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.346622433:1(207-220)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TVLSI.2024.3466224
Bi JWen YLi XZhao YGuo YZhou EHu XDu ZLi LChen HChen TGuo Q(2025)Efficient and Fast High-Performance Library Generation for Deep Learning AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.347557574:1(155-169)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TC.2024.3475575
Thejaswini PSuresh GChiraag VNandi S(2025)Approximate CNN Hardware Accelerators for Resource Constrained DevicesIEEE Access10.1109/ACCESS.2025.352966813(12542-12553)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3529668
Deepika SArunachalam VRaj A(2025)A Review on Hardware Accelerators for Convolutional Neural Network-based Inference engines: Strategies for Performance and Energy-efficiency EnhancementMicroprocessors and Microsystems10.1016/j.micpro.2025.105146(105146)Online publication date: Feb-2025
https://doi.org/10.1016/j.micpro.2025.105146
Zheng LLan HLiu XJiang LZhou X(2025)FusionFrame: A Fusion Dataflow Scheduling Framework for DNN Accelerators via Analytical ModelingAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1551-3_21(315-334)Online publication date: 17-Feb-2025
https://doi.org/10.1007/978-981-96-1551-3_21
Nallathambi ABose CHaensch WRaghunathan A(2024)LRMP: Layer Replication with Mixed Precision for spatial in-memory DNN acceleratorsFrontiers in Artificial Intelligence10.3389/frai.2024.12683177Online publication date: 4-Oct-2024
https://doi.org/10.3389/frai.2024.1268317
Fu XZhang ZFan HHuang GEl-Shabani MHuang RSolanki RWu FDiamant RWang Y(2024)Distributed Training of Large Language Models on AWS TrainiumProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698535(961-976)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698535
Wan HMa LLi AZhou PYu JLou XDe V(2024)ZeroTetris: A Spacial Feature Similarity-based Sparse MLP Engine for Neural Volume RenderingProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655684(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3655684
Rajput SKechagia MSarro FSharma TSpinellis DConstantinou EBacchelli A(2024)Greenlight: Highlighting TensorFlow APIs Energy FootprintProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644894(304-308)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643991.3644894
Cheng LGu YLiu QYang LLiu CWang Y(2024)Advancements in Accelerating Deep Neural Network Inference on AIoT Devices: A SurveyIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33531769:6(830-847)Online publication date: Nov-2024
https://doi.org/10.1109/TSUSC.2024.3353176
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents