research-article

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

Authors:

Plinio Silveira,

Rodrigo Antunes,

Sai Rahul Chalamalasetti,

Glaucimar Aguiar,

Sergey Serebryakov,

Paolo Faraboschi,

John Paul Strachan,

Dejan MilojicicAuthors Info & Claims

ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference

Pages 372 - 377

https://doi.org/10.1145/3394885.3431554

Published: 29 January 2021 Publication History

Abstract

ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these crossbars require the use of ADCs which constitute a significant fraction of the cost of MVM operations. The overhead of ADCs can be mitigated via partial sum quantization. However, prior quantization flows for DNN inference accelerators do not consider partial sum quantization which is not highly relevant to traditional digital architectures. To address this issue, we propose a mixed precision quantization scheme for ReRAM-based DNN inference accelerators where weight quantization, input quantization, and partial sum quantization are jointly applied for each DNN layer. We also propose an automated quantization flow powered by deep reinforcement learning to search for the best quantization configuration in the large design space. Our evaluation shows that the proposed mixed precision quantization scheme and quantization flow reduce inference latency and energy consumption by up to 3.89x and 4.84x, respectively, while only losing 1.18% in DNN inference accuracy.

References

[1]

Patrick Judd et al. Reduced-precision strategies for bounded memory in deep neural nets. arXiv preprint arXiv:1511.05236, 2015.

[2]

Bert Moons et al. Energy-efficient convnets through approximate computing. In WACV, pages 1--8. IEEE, 2016.

[3]

Darryl Lin et al. Fixed point quantization of deep convolutional networks. In ICML, pages 2849--2858, 2016.

[4]

Charbel Sakr et al. Analytical guarantees on numerical precision of deep neural networks. In ICML, pages 3007--3016. JMLR. org, 2017.

[5]

Lu Hou et al. Loss-aware weight quantization of deep networks. arXiv preprint arXiv:1802.08635, 2018.

[6]

Kuan Wang et al. HAQ: Hardware-aware automated quantization with mixed precision. In CVPR, pages 8612--8620, 2019.

[7]

Junsong Wang et al. Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In FPL, pages 163--169, 2018.

[8]

Cong Hao et al. Fpga/dnn co-design: An efficient design methodology for iot intelligence on the edge. In DAC, New York, NY, USA, 2019.

[9]

Yuhong Li et al. Edd: Efficient differentiable dnn architecture and implementation co-search for embedded ai solutions. In DAC. IEEE Press, 2020.

[10]

Cheng Gong et al. VecQ: Minimal loss dnn model compression with vectorized weight quantization. IEEE Transactions on Computers, (01):1--1, may 5555.

[11]

Xiaoxiao Liu et al. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design. In DAC, pages 1--6. IEEE, 2015.

[12]

Ping Chi et al. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ISCA, 2016.

[13]

Ali Shafiee et al. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In ISCA, ISCA'16, pages 14--26. IEEE Press, 2016.

[14]

Ben Feinberg et al. Making memristive neural network accelerators reliable. In HPCA, pages 52--65. IEEE, 2018.

[15]

Aayush Ankit et al. PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In ASPLOS, pages 715--731, 2019.

[16]

Indranil Chakraborty et al. GENIEx: A Generalized Approach to Emulating Non-Idealities in Memristive X-bars Using Neural Networks. In DAC, 2020.

[17]

Matthew J Marinella et al. Multiscale co-design analysis of energy, latency, area, and accuracy of a ReRAM analog neural training accelerator. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 8(1):86--101, 2018.

[18]

Neta Zmora et al. Neural network distiller, June 2018.

[19]

TensorFlow. Model optimization toolkit.

[20]

Ao Ren et al. ADMM-NN: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In ASPLOS, 2019.

[21]

Brandon Reagen et al. Ares: A framework for quantifying the resilience of deep neural networks. In DAC, pages 1--6. IEEE, 2018.

[22]

Yoojin Choi et al. Learning low precision deep neural networks through regularization. arXiv preprint arXiv:1809.00095, 2018.

[23]

Charbel Sakr et al. Per-tensor fixed-point quantization of the back-propagation algorithm. In ICLR, 2019.

[24]

Hardik Sharma et al. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. In ISCA, pages 764--775, 2018.

[25]

Jinmook Lee et al. UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision. In ISSCC, pages 218--220, 2018.

[26]

Eunhyeok Park et al. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In ISCA, pages 688--698. IEEE, 2018.

[27]

Caiwen Ding et al. REQ-YOLO: A resource-aware, efficient quantization framework for object detection on fpgas. In FPGA, pages 33--42, 2019.

[28]

Yaman Umuroglu et al. FINN: A framework for fast, scalable binarized neural network inference. In FPGA, pages 65--74, 2017.

[29]

Zhenhua Zhu et al. A configurable multi-precision cnn computing framework based on single bit rram. In DAC, pages 1--6. IEEE, 2019.

[30]

Wenqiang Zhang et al. Design guidelines of rram based neural-processing-unit: A joint device-circuit-algorithm analysis. In DAC, pages 1--6. IEEE, 2019.

[31]

Mahdi Nazm Bojnordi et al. Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. In HPCA, 2016.

[32]

Ming Cheng et al. Time: A training-in-memory architecture for memristor-based deep neural networks. In DAC, page 26. ACM, 2017.

[33]

Linghao Song et al. Pipelayer: A pipelined ReRAM-based accelerator for deep learning. In HPCA, pages 541--552. IEEE, 2017.

[34]

Fan Chen et al. ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks. In ASP-DAC, pages 178--183. IEEE, 2018.

[35]

Aayush Ankit et al. PANTHER: A programmable architecture for neural network training harnessing energy-efficient reram. IEEE Transactions on Computers, 2020.

[36]

Yu Ji et al. NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. In MICRO, page 21. IEEE Press, 2016.

[37]

Yu Ji et al. Bridge the gap between neural networks and neuromorphic hardware with a neural network compiler. In ASPLOS, pages 448--460. ACM, 2018.

[38]

Yandan Wang et al. Group Scissor: Scaling Neuromorphic Computing Design to Large Neural Networks. In DAC, page 85. ACM, 2017.

Cited By

Luo XLiu DKong HHuai SChen HXiong GLiu W(2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3701728
Lou WQin YWang XGong LWang CZhou X(2024)FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343948843:11(3852-3863)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3439488
Bai JSun SZhao WKang W(2024)CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.329870543:1(189-202)Online publication date: Jan-2024
https://doi.org/10.1109/TCAD.2023.3298705
Show More Cited By

Recommendations

Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-Based DNN Accelerators
Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures have demonstrated great potential to accelerate the deep neural network (DNN) training/inference. However, the computational accuracy of analog PIM is compromised due to ...
BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization
Field-programmable Gate Array (FPGA) is a high-performance computing platform for Convolution Neural Networks (CNNs) inference. Winograd algorithm, weight pruning, and quantization are widely adopted to reduce the storage and arithmetic overhead of CNNs ...
A Novel ReRAM-based Main Memory Structure for Optimizing Access Latency and Reliability
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017

Emerging Resistive Memory (ReRAM) is a promising candidate as the replacement for DRAM because of its low power consumption, high density and high endurance. Due to the unique crossbar structure, ReRAM can be constructed with a very high density. However,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference

January 2021

930 pages

ISBN:9781450379991

DOI:10.1145/3394885

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE CAS
IEEE CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ASPDAC '21

Sponsor:

SIGDA

ASPDAC '21: 26th Asia and South Pacific Design Automation Conference

January 18 - 21, 2021

Tokyo, Japan

Acceptance Rates

ASPDAC '21 Paper Acceptance Rate 111 of 368 submissions, 30%;

Overall Acceptance Rate 466 of 1,454 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
673
Total Downloads

Downloads (Last 12 months)140
Downloads (Last 6 weeks)10

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo XLiu DKong HHuai SChen HXiong GLiu W(2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3701728
Lou WQin YWang XGong LWang CZhou X(2024)FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343948843:11(3852-3863)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3439488
Bai JSun SZhao WKang W(2024)CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.329870543:1(189-202)Online publication date: Jan-2024
https://doi.org/10.1109/TCAD.2023.3298705
Yu MHong MLee SKim SLee J(2024)PyAIM: Pynq-Based Scalable Analog In-Memory Computing Prototyping Platform2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595868(174-178)Online publication date: 22-Apr-2024
https://doi.org/10.1109/AICAS59952.2024.10595868
Krestinskaya OFouda MBenmeziane HEl Maghraoui KSebastian ALu WLanza MLi HKurdahi FFahmy SEltawil ASalama K(2024)Neural architecture search for in-memory computing-based deep learning acceleratorsNature Reviews Electrical Engineering10.1038/s44287-024-00052-71:6(374-390)Online publication date: 20-May-2024
https://doi.org/10.1038/s44287-024-00052-7
Lyu BYang YCao YShi TChen YHuang TWen S(2024)A memristive all-inclusive hypernetwork for parallel analog deployment of full search space architecturesNeural Networks10.1016/j.neunet.2024.106312175(106312)Online publication date: Jul-2024
https://doi.org/10.1016/j.neunet.2024.106312
Li BZhong DChen XLiu C(2023)Enabling Neuromorphic Computing for Artificial Intelligence with Hardware-Software Co-DesignNeuromorphic Computing10.5772/intechopen.111963Online publication date: 15-Nov-2023
https://doi.org/10.5772/intechopen.111963
Yang SHe SDuan HChen WZhang XWu TYin Y(2023)APQ: Automated DNN Pruning and Quantization for ReRAM-Based AcceleratorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329001034:9(2498-2511)Online publication date: Sep-2023
https://doi.org/10.1109/TPDS.2023.3290010
Krestinskaya OZhang LSalama K(2023)Towards Efficient In-Memory Computing Hardware for Quantized Neural Networks: State-of-the-Art, Open Challenges and PerspectivesIEEE Transactions on Nanotechnology10.1109/TNANO.2023.329302622(377-386)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TNANO.2023.3293026
Chang CChou KChuang YWu A(2023)E-UPQ: Energy-Aware Unified Pruning-Quantization Framework for CIM ArchitectureIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2023.324276113:1(21-32)Online publication date: Mar-2023
https://doi.org/10.1109/JETCAS.2023.3242761
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten