Low-bit Quantization of Neural Networks for Efficient Inference

Choukroun, Yoni; Kravchik, Eli; Yang, Fan; Kisilev, Pavel

Computer Science > Machine Learning

arXiv:1902.06822 (cs)

[Submitted on 18 Feb 2019 (v1), last revised 25 Mar 2019 (this version, v2)]

Title:Low-bit Quantization of Neural Networks for Efficient Inference

Authors:Yoni Choukroun, Eli Kravchik, Fan Yang, Pavel Kisilev

View PDF

Abstract:Recent machine learning methods use increasingly large deep neural networks to achieve state of the art results in various tasks. The gains in performance come at the cost of a substantial increase in computation and storage requirements. This makes real-time implementations on limited resources hardware a challenging task. One popular approach to address this challenge is to perform low-bit precision computations via neural network quantization. However, aggressive quantization generally entails a severe penalty in terms of accuracy, and often requires retraining of the network, or resorting to higher bit precision quantization. In this paper, we formalize the linear quantization task as a Minimum Mean Squared Error (MMSE) problem for both weights and activations, allowing low-bit precision inference without the need for full network retraining. The main contributions of our approach are the optimizations of the constrained MSE problem at each layer of the network, the hardware aware partitioning of the network parameters, and the use of multiple low precision quantized tensors for poorly approximated layers. The proposed approach allows 4 bits integer (INT4) quantization for deployment of pretrained models on limited hardware resources. Multiple experiments on various network architectures show that the suggested method yields state of the art results with minimal loss of tasks accuracy.

Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as:	arXiv:1902.06822 [cs.LG]
	(or arXiv:1902.06822v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1902.06822

Submission history

From: Yoni Choukroun [view email]
[v1] Mon, 18 Feb 2019 22:28:34 UTC (1,605 KB)
[v2] Mon, 25 Mar 2019 08:12:15 UTC (1,535 KB)

Computer Science > Machine Learning

Title:Low-bit Quantization of Neural Networks for Efficient Inference

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Low-bit Quantization of Neural Networks for Efficient Inference

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators