Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

A 32-bit floating-point format is often used for the development and training of deep neural networks. Training and inference in deep learning-optimized codecs can result in enormous performance and energy efficiency advantages. However, training and inferring low-bit neural networks still pose a significant challenge. In this study, we propose a sorting method that maintains accuracy in numerical formats with a low number of bits. We tested this method on convolutional neural networks, including AlexNet. Using our method, we found that in our convolutional neural network, the accuracy achieved with 11 bits matches that of the IEEE 32-bit format. Similarly, in AlexNet, the accuracy achieved with 10 bits matches that of the IEEE 32-bit format. These results suggest that the sorting method shows promise for calculations with limited accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

Available on request

References

  1. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, vol 1, Lake Tahoe, Nevada, NIPS’12. Curran Associates Inc, Red Hook, NY, USA, pp 1097–1105. https://doi.org/10.5555/2999134.2999257

  2. Deng L, Li J, Huang J-T, Yao K, Yu D, Seide F, Seltzer M, Zweig G, He X, Williams J, Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8604–8608

  3. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  4. Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In: European conference on computer vision, Springer, pp 525–542

  5. Courbariaux M, Hubara I, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or -1, arXiv preprint arXiv:1602.02830

  6. Cai Z, He X, Sun J, Vasconcelos N (2017) Deep learning with low precision by half-wave gaussian quantization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5918–5926

  7. Lin D, Talathi S, Annapureddy S (2016) Fixed point quantization of deep convolutional networks. In: International conference on machine learning, PMLR, pp 2849–2858

  8. Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  9. Kahan W (1996) Ieee standard 754 for binary floating-point arithmetic. Lect Note Status of IEEE 754(94720–1776):11

    Google Scholar 

  10. Park J, Lee S, Jeon D (2021) A neural network training processor with 8-bit shared exponent bias floating point and multiple-way fused multiply-add trees. IEEE J Solid-State Circuits 57(3):965–977

    Article  Google Scholar 

  11. Zhang H, Ko S-B (2022) Variable-precision approximate floating-point multiplier for efficient deep learning computation. IEEE Trans Circuits Syst II: Exp Br 69(5):2503–2507

    Google Scholar 

  12. Hanif MA, Marchisio A, Arif T, Hafiz R, Rehman S, Shafique M (2018) X-dnns: systematic cross-layer approximations for energy-efficient deep neural networks. J Low Power Electron 14(4):520–534

    Article  Google Scholar 

  13. Sun X, Choi J, Chen C-Y, Wang N, Venkataramani S, Srinivasan VV, Cui X, Zhang W, Gopalakrishnan K (2019) Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc, Red Hook, NY, USA, Article No 441, pp 10

  14. Agarwal RP, Meehan M, O’regan D (2001) Fixed point theory and applications, vol 141. Cambridge University Press

    Book  Google Scholar 

  15. Wu C, Wang M, Chu X, Wang K, He L (2020) Low precision floating-point arithmetic for high performance fpga-based cnn acceleration, arXiv preprint arXiv:2003.03852

  16. Liu F, Zhao W, He Z, Wang Y, Wang Z, Dai C, Liang X, Jiang L (2021) Improving neural network efficiency via post-training quantization with adaptive floating-point. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5281–5290

  17. Prost-Boucle A, Bourge A, Pétrot F, Alemdar H, Caldwell N, Leroy V (2017) Scalable high-performance architecture for convolutional ternary neural networks on fpga. In: 2017 27th International conference on field programmable logic and applications (FPL), IEEE, pp 1–7

  18. Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789

    Article  Google Scholar 

  19. Judd P, Albericio J, Hetherington T, Aamodt TM, Moshovos A (2016) Stripes: bit-serial deep neural network computing. In: 2016 49th annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE, pp 1–12

  20. Lee J, Kim C, Kang S, Shin D, Kim S, Yoo H-J (2018) An energy-efficient unified deep neural network accelerator with fully-variable weight precision for mobile deep learning applications. In: Hot chips: a symposium on high performance chips, hot chips: a symposium on high performance chips

  21. Sharify S, Lascorz AD, Siu K, Judd P, Moshovos A (2018) Loom: exploiting weight and activation precisions to accelerate convolutional neural networks. In: 2018 55th ACM/ESDA/IEEE design automation conference (DAC), IEEE, pp 1–6

  22. Sharma H, Park J, Suda N, Lai L, Chau B, Chandra V, Esmaeilzadeh H (2018) Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network. In: 2018 ACM/IEEE 45th annual international symposium on computer architecture (ISCA), IEEE, pp 764–775

  23. Ryu S, Kim H, Yi W, Kim J-J (2019) Bitblade: area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In: Proceedings of the 56th annual design automation conference, pp 1–6

  24. Burgess N, Milanovic J, Stephens N, Monachopoulos K, Mansell D (2019) Bfloat16 processing for neural networks. In: 2019 IEEE 26th symposium on computer arithmetic (ARITH), IEEE, pp 88–91

  25. Agrawal A, Mueller SM, Fleischer BM, Sun X, Wang N, Choi J, Gopalakrishnan K (2019) Dlfloat: a 16-b floating point format designed for deep learning training and inference. In: 2019 IEEE 26th symposium on computer arithmetic (ARITH), IEEE, pp 92–95

  26. Benmouhoub F, Garoche P-L, Martel M (2022) An efficient summation algorithm for the accuracy, convergence and reproducibility of parallel numerical methods. In: Software Verification: 13th international conference, VSTTE 2021, New Haven, CT, USA, October 18–19, 2021, and 14th International workshop, NSV 2021, Los Angeles, CA, USA, July 18–19, 2021, Revised Selected Papers, Springer, pp 165–181

  27. Gross T, Blüthgen N (2017) Sorting sums of binary decision summands, arXiv preprint arXiv:1704.05795

  28. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556

  29. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  30. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms. 2001

  31. Astrachan O (2003) Bubble sort: an archaeological algorithmic analysis. ACM Sigcse Bull 35(1):1–5

    Article  Google Scholar 

  32. Hoare CA (1962) Quicksort. Comput J 5(1):10–16

    Article  MathSciNet  Google Scholar 

  33. Schaffer R, Sedgewick R (1993) The analysis of heapsort. J Algorithms 15(1):76–100

    Article  MathSciNet  Google Scholar 

  34. McIlroy PM, Bostic K, McIlroy MD (1993) Engineering radix sort. Comput Syst 6(1):5–27

    Google Scholar 

  35. Yoon DH, Petrini F (2014) Hourglass: a bandwidth-driven performance model for sorting algorithms. In: Supercomputing: 29th international conference, ISC 2014, Leipzig, Germany, June 22–26, 2014. Proceedings 29, Springer, pp 93–108

  36. Prasad AK, Rezaalipour M, Dehyadegari M, Bojnordi MN, Memristive data ranking. In: 2021 IEEE international symposium on high-performance computer architecture (HPCA), IEEE, pp 440–452

  37. Deng L (2012) The mnist database of handwritten digit images for machine learning research. IEEE Sign Process Mag 29(6):141–142

    Article  Google Scholar 

  38. Krizhevsky A, Nair V, Hinton G. Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/~kriz/cifar.html

  39. Seber GA, Lee AJ (2012) Linear regression analysis. Wiley

    Google Scholar 

  40. Albawi S, Mohammed TA, Al-Zawi S, Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET), IEEE, pp 1–6

Download references

Funding

The authors received no financial assistance from any organization for their submitted work.

Author information

Authors and Affiliations

Authors

Contributions

All of the authors contributed equally to the writing of the manuscript, AD and JKK developed the mathematics and models under the supervision of Dr. D.

Corresponding author

Correspondence to Masoud Dehyadegari.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Ethical Approval

This paper has no studies involving human or animal subjects.

Consent to participate

All authors declare that they have the consent to participate.

Consent to publish

All authors declare that they have consent for publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dehghanpour, A., Khodamoradi Kordestani, J. & Dehyadegari, M. Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers. Neural Process Lett 55, 12061–12078 (2023). https://doi.org/10.1007/s11063-023-11409-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11409-8

Keywords