Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

Dehghanpour, Alireza; Khodamoradi Kordestani, Javad; Dehyadegari, Masoud

doi:10.1007/s11063-023-11409-8

Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

Published: 15 September 2023

Volume 55, pages 12061–12078, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Alireza Dehghanpour¹,
Javad Khodamoradi Kordestani¹ &
Masoud Dehyadegari^1,2

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

A 32-bit floating-point format is often used for the development and training of deep neural networks. Training and inference in deep learning-optimized codecs can result in enormous performance and energy efficiency advantages. However, training and inferring low-bit neural networks still pose a significant challenge. In this study, we propose a sorting method that maintains accuracy in numerical formats with a low number of bits. We tested this method on convolutional neural networks, including AlexNet. Using our method, we found that in our convolutional neural network, the accuracy achieved with 11 bits matches that of the IEEE 32-bit format. Similarly, in AlexNet, the accuracy achieved with 10 bits matches that of the IEEE 32-bit format. These results suggest that the sorting method shows promise for calculations with limited accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Neural network training with limited precision and asymmetric exponent

Article Open access 12 May 2022

Extremely Fast Neural Computation Using Tally Numeral Arithmetic

Recent advances in efficient computation of deep convolutional neural networks

Article 26 January 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

Available on request

References

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, vol 1, Lake Tahoe, Nevada, NIPS’12. Curran Associates Inc, Red Hook, NY, USA, pp 1097–1105. https://doi.org/10.5555/2999134.2999257
Deng L, Li J, Huang J-T, Yao K, Yu D, Seide F, Seltzer M, Zweig G, He X, Williams J, Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8604–8608
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In: European conference on computer vision, Springer, pp 525–542
Courbariaux M, Hubara I, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or -1, arXiv preprint arXiv:1602.02830
Cai Z, He X, Sun J, Vasconcelos N (2017) Deep learning with low precision by half-wave gaussian quantization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5918–5926
Lin D, Talathi S, Annapureddy S (2016) Fixed point quantization of deep convolutional networks. In: International conference on machine learning, PMLR, pp 2849–2858
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Kahan W (1996) Ieee standard 754 for binary floating-point arithmetic. Lect Note Status of IEEE 754(94720–1776):11
Google Scholar
Park J, Lee S, Jeon D (2021) A neural network training processor with 8-bit shared exponent bias floating point and multiple-way fused multiply-add trees. IEEE J Solid-State Circuits 57(3):965–977
Article Google Scholar
Zhang H, Ko S-B (2022) Variable-precision approximate floating-point multiplier for efficient deep learning computation. IEEE Trans Circuits Syst II: Exp Br 69(5):2503–2507
Google Scholar
Hanif MA, Marchisio A, Arif T, Hafiz R, Rehman S, Shafique M (2018) X-dnns: systematic cross-layer approximations for energy-efficient deep neural networks. J Low Power Electron 14(4):520–534
Article Google Scholar
Sun X, Choi J, Chen C-Y, Wang N, Venkataramani S, Srinivasan VV, Cui X, Zhang W, Gopalakrishnan K (2019) Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc, Red Hook, NY, USA, Article No 441, pp 10
Agarwal RP, Meehan M, O’regan D (2001) Fixed point theory and applications, vol 141. Cambridge University Press
Book Google Scholar
Wu C, Wang M, Chu X, Wang K, He L (2020) Low precision floating-point arithmetic for high performance fpga-based cnn acceleration, arXiv preprint arXiv:2003.03852
Liu F, Zhao W, He Z, Wang Y, Wang Z, Dai C, Liang X, Jiang L (2021) Improving neural network efficiency via post-training quantization with adaptive floating-point. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5281–5290
Prost-Boucle A, Bourge A, Pétrot F, Alemdar H, Caldwell N, Leroy V (2017) Scalable high-performance architecture for convolutional ternary neural networks on fpga. In: 2017 27th International conference on field programmable logic and applications (FPL), IEEE, pp 1–7
Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789
Article Google Scholar
Judd P, Albericio J, Hetherington T, Aamodt TM, Moshovos A (2016) Stripes: bit-serial deep neural network computing. In: 2016 49th annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE, pp 1–12
Lee J, Kim C, Kang S, Shin D, Kim S, Yoo H-J (2018) An energy-efficient unified deep neural network accelerator with fully-variable weight precision for mobile deep learning applications. In: Hot chips: a symposium on high performance chips, hot chips: a symposium on high performance chips
Sharify S, Lascorz AD, Siu K, Judd P, Moshovos A (2018) Loom: exploiting weight and activation precisions to accelerate convolutional neural networks. In: 2018 55th ACM/ESDA/IEEE design automation conference (DAC), IEEE, pp 1–6
Sharma H, Park J, Suda N, Lai L, Chau B, Chandra V, Esmaeilzadeh H (2018) Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network. In: 2018 ACM/IEEE 45th annual international symposium on computer architecture (ISCA), IEEE, pp 764–775
Ryu S, Kim H, Yi W, Kim J-J (2019) Bitblade: area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In: Proceedings of the 56th annual design automation conference, pp 1–6
Burgess N, Milanovic J, Stephens N, Monachopoulos K, Mansell D (2019) Bfloat16 processing for neural networks. In: 2019 IEEE 26th symposium on computer arithmetic (ARITH), IEEE, pp 88–91
Agrawal A, Mueller SM, Fleischer BM, Sun X, Wang N, Choi J, Gopalakrishnan K (2019) Dlfloat: a 16-b floating point format designed for deep learning training and inference. In: 2019 IEEE 26th symposium on computer arithmetic (ARITH), IEEE, pp 92–95
Benmouhoub F, Garoche P-L, Martel M (2022) An efficient summation algorithm for the accuracy, convergence and reproducibility of parallel numerical methods. In: Software Verification: 13th international conference, VSTTE 2021, New Haven, CT, USA, October 18–19, 2021, and 14th International workshop, NSV 2021, Los Angeles, CA, USA, July 18–19, 2021, Revised Selected Papers, Springer, pp 165–181
Gross T, Blüthgen N (2017) Sorting sums of binary decision summands, arXiv preprint arXiv:1704.05795
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms. 2001
Astrachan O (2003) Bubble sort: an archaeological algorithmic analysis. ACM Sigcse Bull 35(1):1–5
Article Google Scholar
Hoare CA (1962) Quicksort. Comput J 5(1):10–16
Article MathSciNet Google Scholar
Schaffer R, Sedgewick R (1993) The analysis of heapsort. J Algorithms 15(1):76–100
Article MathSciNet Google Scholar
McIlroy PM, Bostic K, McIlroy MD (1993) Engineering radix sort. Comput Syst 6(1):5–27
Google Scholar
Yoon DH, Petrini F (2014) Hourglass: a bandwidth-driven performance model for sorting algorithms. In: Supercomputing: 29th international conference, ISC 2014, Leipzig, Germany, June 22–26, 2014. Proceedings 29, Springer, pp 93–108
Prasad AK, Rezaalipour M, Dehyadegari M, Bojnordi MN, Memristive data ranking. In: 2021 IEEE international symposium on high-performance computer architecture (HPCA), IEEE, pp 440–452
Deng L (2012) The mnist database of handwritten digit images for machine learning research. IEEE Sign Process Mag 29(6):141–142
Article Google Scholar
Krizhevsky A, Nair V, Hinton G. Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/~kriz/cifar.html
Seber GA, Lee AJ (2012) Linear regression analysis. Wiley
Google Scholar
Albawi S, Mohammed TA, Al-Zawi S, Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET), IEEE, pp 1–6

Download references

Funding

The authors received no financial assistance from any organization for their submitted work.

Author information

Authors and Affiliations

Faculty of Computer Engineering, K. N. Toosi University of Technology, Tehran, 16315-1355, Iran
Alireza Dehghanpour, Javad Khodamoradi Kordestani & Masoud Dehyadegari
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, 19395-5746, Iran
Masoud Dehyadegari

Authors

Alireza Dehghanpour
View author publications
You can also search for this author in PubMed Google Scholar
Javad Khodamoradi Kordestani
View author publications
You can also search for this author in PubMed Google Scholar
Masoud Dehyadegari
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All of the authors contributed equally to the writing of the manuscript, AD and JKK developed the mathematics and models under the supervision of Dr. D.

Corresponding author

Correspondence to Masoud Dehyadegari.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Ethical Approval

This paper has no studies involving human or animal subjects.

Consent to participate

All authors declare that they have the consent to participate.

Consent to publish

All authors declare that they have consent for publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dehghanpour, A., Khodamoradi Kordestani, J. & Dehyadegari, M. Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers. Neural Process Lett 55, 12061–12078 (2023). https://doi.org/10.1007/s11063-023-11409-8

Download citation

Accepted: 27 August 2023
Published: 15 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11063-023-11409-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Neural network training with limited precision and asymmetric exponent

Extremely Fast Neural Computation Using Tally Numeral Arithmetic

Recent advances in efficient computation of deep convolutional neural networks

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to participate

Consent to publish

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Neural network training with limited precision and asymmetric exponent

Extremely Fast Neural Computation Using Tally Numeral Arithmetic

Recent advances in efficient computation of deep convolutional neural networks

Explore related subjects

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to participate

Consent to publish

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation