Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394885.3431554acmconferencesArticle/Chapter ViewAbstractPublication PagesaspdacConference Proceedingsconference-collections
research-article

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

Published: 29 January 2021 Publication History

Abstract

ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these crossbars require the use of ADCs which constitute a significant fraction of the cost of MVM operations. The overhead of ADCs can be mitigated via partial sum quantization. However, prior quantization flows for DNN inference accelerators do not consider partial sum quantization which is not highly relevant to traditional digital architectures. To address this issue, we propose a mixed precision quantization scheme for ReRAM-based DNN inference accelerators where weight quantization, input quantization, and partial sum quantization are jointly applied for each DNN layer. We also propose an automated quantization flow powered by deep reinforcement learning to search for the best quantization configuration in the large design space. Our evaluation shows that the proposed mixed precision quantization scheme and quantization flow reduce inference latency and energy consumption by up to 3.89x and 4.84x, respectively, while only losing 1.18% in DNN inference accuracy.

References

[1]
Patrick Judd et al. Reduced-precision strategies for bounded memory in deep neural nets. arXiv preprint arXiv:1511.05236, 2015.
[2]
Bert Moons et al. Energy-efficient convnets through approximate computing. In WACV, pages 1--8. IEEE, 2016.
[3]
Darryl Lin et al. Fixed point quantization of deep convolutional networks. In ICML, pages 2849--2858, 2016.
[4]
Charbel Sakr et al. Analytical guarantees on numerical precision of deep neural networks. In ICML, pages 3007--3016. JMLR. org, 2017.
[5]
Lu Hou et al. Loss-aware weight quantization of deep networks. arXiv preprint arXiv:1802.08635, 2018.
[6]
Kuan Wang et al. HAQ: Hardware-aware automated quantization with mixed precision. In CVPR, pages 8612--8620, 2019.
[7]
Junsong Wang et al. Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In FPL, pages 163--169, 2018.
[8]
Cong Hao et al. Fpga/dnn co-design: An efficient design methodology for iot intelligence on the edge. In DAC, New York, NY, USA, 2019.
[9]
Yuhong Li et al. Edd: Efficient differentiable dnn architecture and implementation co-search for embedded ai solutions. In DAC. IEEE Press, 2020.
[10]
Cheng Gong et al. VecQ: Minimal loss dnn model compression with vectorized weight quantization. IEEE Transactions on Computers, (01):1--1, may 5555.
[11]
Xiaoxiao Liu et al. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design. In DAC, pages 1--6. IEEE, 2015.
[12]
Ping Chi et al. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ISCA, 2016.
[13]
Ali Shafiee et al. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In ISCA, ISCA'16, pages 14--26. IEEE Press, 2016.
[14]
Ben Feinberg et al. Making memristive neural network accelerators reliable. In HPCA, pages 52--65. IEEE, 2018.
[15]
Aayush Ankit et al. PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In ASPLOS, pages 715--731, 2019.
[16]
Indranil Chakraborty et al. GENIEx: A Generalized Approach to Emulating Non-Idealities in Memristive X-bars Using Neural Networks. In DAC, 2020.
[17]
Matthew J Marinella et al. Multiscale co-design analysis of energy, latency, area, and accuracy of a ReRAM analog neural training accelerator. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 8(1):86--101, 2018.
[18]
Neta Zmora et al. Neural network distiller, June 2018.
[19]
TensorFlow. Model optimization toolkit.
[20]
Ao Ren et al. ADMM-NN: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In ASPLOS, 2019.
[21]
Brandon Reagen et al. Ares: A framework for quantifying the resilience of deep neural networks. In DAC, pages 1--6. IEEE, 2018.
[22]
Yoojin Choi et al. Learning low precision deep neural networks through regularization. arXiv preprint arXiv:1809.00095, 2018.
[23]
Charbel Sakr et al. Per-tensor fixed-point quantization of the back-propagation algorithm. In ICLR, 2019.
[24]
Hardik Sharma et al. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. In ISCA, pages 764--775, 2018.
[25]
Jinmook Lee et al. UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision. In ISSCC, pages 218--220, 2018.
[26]
Eunhyeok Park et al. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In ISCA, pages 688--698. IEEE, 2018.
[27]
Caiwen Ding et al. REQ-YOLO: A resource-aware, efficient quantization framework for object detection on fpgas. In FPGA, pages 33--42, 2019.
[28]
Yaman Umuroglu et al. FINN: A framework for fast, scalable binarized neural network inference. In FPGA, pages 65--74, 2017.
[29]
Zhenhua Zhu et al. A configurable multi-precision cnn computing framework based on single bit rram. In DAC, pages 1--6. IEEE, 2019.
[30]
Wenqiang Zhang et al. Design guidelines of rram based neural-processing-unit: A joint device-circuit-algorithm analysis. In DAC, pages 1--6. IEEE, 2019.
[31]
Mahdi Nazm Bojnordi et al. Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. In HPCA, 2016.
[32]
Ming Cheng et al. Time: A training-in-memory architecture for memristor-based deep neural networks. In DAC, page 26. ACM, 2017.
[33]
Linghao Song et al. Pipelayer: A pipelined ReRAM-based accelerator for deep learning. In HPCA, pages 541--552. IEEE, 2017.
[34]
Fan Chen et al. ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks. In ASP-DAC, pages 178--183. IEEE, 2018.
[35]
Aayush Ankit et al. PANTHER: A programmable architecture for neural network training harnessing energy-efficient reram. IEEE Transactions on Computers, 2020.
[36]
Yu Ji et al. NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. In MICRO, page 21. IEEE Press, 2016.
[37]
Yu Ji et al. Bridge the gap between neural networks and neuromorphic hardware with a neural network compiler. In ASPLOS, pages 448--460. ACM, 2018.
[38]
Yandan Wang et al. Group Scissor: Scaling Neuromorphic Computing Design to Large Neural Networks. In DAC, page 85. ACM, 2017.

Cited By

View all
  • (2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
  • (2024)FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343948843:11(3852-3863)Online publication date: Nov-2024
  • (2024)CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.329870543:1(189-202)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference
January 2021
930 pages
ISBN:9781450379991
DOI:10.1145/3394885
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DNN inference accelerators
  2. Mixed precision quantization
  3. ReRAM

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ASPDAC '21
Sponsor:

Acceptance Rates

ASPDAC '21 Paper Acceptance Rate 111 of 368 submissions, 30%;
Overall Acceptance Rate 466 of 1,454 submissions, 32%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)140
  • Downloads (Last 6 weeks)10
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
  • (2024)FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343948843:11(3852-3863)Online publication date: Nov-2024
  • (2024)CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.329870543:1(189-202)Online publication date: Jan-2024
  • (2024)PyAIM: Pynq-Based Scalable Analog In-Memory Computing Prototyping Platform2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595868(174-178)Online publication date: 22-Apr-2024
  • (2024)Neural architecture search for in-memory computing-based deep learning acceleratorsNature Reviews Electrical Engineering10.1038/s44287-024-00052-71:6(374-390)Online publication date: 20-May-2024
  • (2024)A memristive all-inclusive hypernetwork for parallel analog deployment of full search space architecturesNeural Networks10.1016/j.neunet.2024.106312175(106312)Online publication date: Jul-2024
  • (2023)Enabling Neuromorphic Computing for Artificial Intelligence with Hardware-Software Co-DesignNeuromorphic Computing10.5772/intechopen.111963Online publication date: 15-Nov-2023
  • (2023)APQ: Automated DNN Pruning and Quantization for ReRAM-Based AcceleratorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329001034:9(2498-2511)Online publication date: Sep-2023
  • (2023)Towards Efficient In-Memory Computing Hardware for Quantized Neural Networks: State-of-the-Art, Open Challenges and PerspectivesIEEE Transactions on Nanotechnology10.1109/TNANO.2023.329302622(377-386)Online publication date: 1-Jan-2023
  • (2023)E-UPQ: Energy-Aware Unified Pruning-Quantization Framework for CIM ArchitectureIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2023.324276113:1(21-32)Online publication date: Mar-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media