Abstract
A 32-bit floating-point format is often used for the development and training of deep neural networks. Training and inference in deep learning-optimized codecs can result in enormous performance and energy efficiency advantages. However, training and inferring low-bit neural networks still pose a significant challenge. In this study, we propose a sorting method that maintains accuracy in numerical formats with a low number of bits. We tested this method on convolutional neural networks, including AlexNet. Using our method, we found that in our convolutional neural network, the accuracy achieved with 11 bits matches that of the IEEE 32-bit format. Similarly, in AlexNet, the accuracy achieved with 10 bits matches that of the IEEE 32-bit format. These results suggest that the sorting method shows promise for calculations with limited accuracy.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
Available on request
References
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, vol 1, Lake Tahoe, Nevada, NIPS’12. Curran Associates Inc, Red Hook, NY, USA, pp 1097–1105. https://doi.org/10.5555/2999134.2999257
Deng L, Li J, Huang J-T, Yao K, Yu D, Seide F, Seltzer M, Zweig G, He X, Williams J, Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8604–8608
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In: European conference on computer vision, Springer, pp 525–542
Courbariaux M, Hubara I, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or -1, arXiv preprint arXiv:1602.02830
Cai Z, He X, Sun J, Vasconcelos N (2017) Deep learning with low precision by half-wave gaussian quantization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5918–5926
Lin D, Talathi S, Annapureddy S (2016) Fixed point quantization of deep convolutional networks. In: International conference on machine learning, PMLR, pp 2849–2858
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Kahan W (1996) Ieee standard 754 for binary floating-point arithmetic. Lect Note Status of IEEE 754(94720–1776):11
Park J, Lee S, Jeon D (2021) A neural network training processor with 8-bit shared exponent bias floating point and multiple-way fused multiply-add trees. IEEE J Solid-State Circuits 57(3):965–977
Zhang H, Ko S-B (2022) Variable-precision approximate floating-point multiplier for efficient deep learning computation. IEEE Trans Circuits Syst II: Exp Br 69(5):2503–2507
Hanif MA, Marchisio A, Arif T, Hafiz R, Rehman S, Shafique M (2018) X-dnns: systematic cross-layer approximations for energy-efficient deep neural networks. J Low Power Electron 14(4):520–534
Sun X, Choi J, Chen C-Y, Wang N, Venkataramani S, Srinivasan VV, Cui X, Zhang W, Gopalakrishnan K (2019) Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc, Red Hook, NY, USA, Article No 441, pp 10
Agarwal RP, Meehan M, O’regan D (2001) Fixed point theory and applications, vol 141. Cambridge University Press
Wu C, Wang M, Chu X, Wang K, He L (2020) Low precision floating-point arithmetic for high performance fpga-based cnn acceleration, arXiv preprint arXiv:2003.03852
Liu F, Zhao W, He Z, Wang Y, Wang Z, Dai C, Liang X, Jiang L (2021) Improving neural network efficiency via post-training quantization with adaptive floating-point. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5281–5290
Prost-Boucle A, Bourge A, Pétrot F, Alemdar H, Caldwell N, Leroy V (2017) Scalable high-performance architecture for convolutional ternary neural networks on fpga. In: 2017 27th International conference on field programmable logic and applications (FPL), IEEE, pp 1–7
Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789
Judd P, Albericio J, Hetherington T, Aamodt TM, Moshovos A (2016) Stripes: bit-serial deep neural network computing. In: 2016 49th annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE, pp 1–12
Lee J, Kim C, Kang S, Shin D, Kim S, Yoo H-J (2018) An energy-efficient unified deep neural network accelerator with fully-variable weight precision for mobile deep learning applications. In: Hot chips: a symposium on high performance chips, hot chips: a symposium on high performance chips
Sharify S, Lascorz AD, Siu K, Judd P, Moshovos A (2018) Loom: exploiting weight and activation precisions to accelerate convolutional neural networks. In: 2018 55th ACM/ESDA/IEEE design automation conference (DAC), IEEE, pp 1–6
Sharma H, Park J, Suda N, Lai L, Chau B, Chandra V, Esmaeilzadeh H (2018) Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network. In: 2018 ACM/IEEE 45th annual international symposium on computer architecture (ISCA), IEEE, pp 764–775
Ryu S, Kim H, Yi W, Kim J-J (2019) Bitblade: area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In: Proceedings of the 56th annual design automation conference, pp 1–6
Burgess N, Milanovic J, Stephens N, Monachopoulos K, Mansell D (2019) Bfloat16 processing for neural networks. In: 2019 IEEE 26th symposium on computer arithmetic (ARITH), IEEE, pp 88–91
Agrawal A, Mueller SM, Fleischer BM, Sun X, Wang N, Choi J, Gopalakrishnan K (2019) Dlfloat: a 16-b floating point format designed for deep learning training and inference. In: 2019 IEEE 26th symposium on computer arithmetic (ARITH), IEEE, pp 92–95
Benmouhoub F, Garoche P-L, Martel M (2022) An efficient summation algorithm for the accuracy, convergence and reproducibility of parallel numerical methods. In: Software Verification: 13th international conference, VSTTE 2021, New Haven, CT, USA, October 18–19, 2021, and 14th International workshop, NSV 2021, Los Angeles, CA, USA, July 18–19, 2021, Revised Selected Papers, Springer, pp 165–181
Gross T, Blüthgen N (2017) Sorting sums of binary decision summands, arXiv preprint arXiv:1704.05795
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms. 2001
Astrachan O (2003) Bubble sort: an archaeological algorithmic analysis. ACM Sigcse Bull 35(1):1–5
Hoare CA (1962) Quicksort. Comput J 5(1):10–16
Schaffer R, Sedgewick R (1993) The analysis of heapsort. J Algorithms 15(1):76–100
McIlroy PM, Bostic K, McIlroy MD (1993) Engineering radix sort. Comput Syst 6(1):5–27
Yoon DH, Petrini F (2014) Hourglass: a bandwidth-driven performance model for sorting algorithms. In: Supercomputing: 29th international conference, ISC 2014, Leipzig, Germany, June 22–26, 2014. Proceedings 29, Springer, pp 93–108
Prasad AK, Rezaalipour M, Dehyadegari M, Bojnordi MN, Memristive data ranking. In: 2021 IEEE international symposium on high-performance computer architecture (HPCA), IEEE, pp 440–452
Deng L (2012) The mnist database of handwritten digit images for machine learning research. IEEE Sign Process Mag 29(6):141–142
Krizhevsky A, Nair V, Hinton G. Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/~kriz/cifar.html
Seber GA, Lee AJ (2012) Linear regression analysis. Wiley
Albawi S, Mohammed TA, Al-Zawi S, Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET), IEEE, pp 1–6
Funding
The authors received no financial assistance from any organization for their submitted work.
Author information
Authors and Affiliations
Contributions
All of the authors contributed equally to the writing of the manuscript, AD and JKK developed the mathematics and models under the supervision of Dr. D.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Ethical Approval
This paper has no studies involving human or animal subjects.
Consent to participate
All authors declare that they have the consent to participate.
Consent to publish
All authors declare that they have consent for publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dehghanpour, A., Khodamoradi Kordestani, J. & Dehyadegari, M. Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers. Neural Process Lett 55, 12061–12078 (2023). https://doi.org/10.1007/s11063-023-11409-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-023-11409-8