Abstract
In this paper, a multiple-precision and mixed-precision floating-point fused multiply-accumulate (FMA) unit is proposed base on the practical requirements of high performance computing (HPC) and artificial intelligence (AI) applications. In addition to the double-precision and single-precision formats used in high performance computing, three types of low-precision formats, TensorFloat-32, BFloat16, and half-precision, dedicated to deep learning tasks are also supported by this FMA unit. The proposed FMA architecture can execute one double-precision operation, or two parallel single-precision operations, or four half-precision operations at each clock cycle. Moreover, the mixed-precision FMA operations are also supported by this proposed FMA, the products of two lower precision multiplications can be accumulated to a higher precision addend. One mixed-precision operation using single-precision multiplication and double-precision addition, or two parallel mixed-precision operations using low-precision (TensorFloat-32, BFloat16, or half-precision) multiplication and single-precision addition is performed every clock cycle. The presented FMA design uses both segmentation and reusing methods to trade off performance, such as throughput and latency, against area and power. The proposed FMA unit has only 17.0% larger area than a standard double-precision FMA implementation, but can support multiple-precision and mixed-precision operations. Compared to the state-of-the-art multiple-precision FMA design, the proposed FMA supports more types of precisions such as TensorFloat-32 and BFloat16 with less hardware overhead.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arunachalam, V., Raj, A.N.J., Hampannavar, N., Bidul, C.: Efficient dual-precision floating-point fused-multiply-add architecture. Microprocess. Microsyst. 57, 23–31 (2018)
Baboulin, M., et al.: Accelerating scientific computations with mixed precision algorithms. Comput. Phys. Commun. 180(12), 2526–2533 (2009)
Booth, A.D.: A signed binary multiplication technique. Q. J. Mech. Appl. Math. 4(2), 236–240 (1951)
Bruguera, J.D., Lang, T.: Floating-point fused multiply-add: reduced latency for floating-point addition. In: 17th IEEE Symposium on Computer Arithmetic (ARITH 2005), pp. 42–51. IEEE (2005)
Chowdhary, K.: Natural language processing. In: Fundamentals of Artificial Intelligence, pp. 603–649 (2020)
Dan, Z., et al.: IEEE standard for floating-point arithmetic. IEEE Std 754-2008, pp. 1–70 (2008)
Fasi, M., Higham, N.J., Mikaitis, M., Pranesh, S.: Numerical behavior of NVIDIA tensor cores. PeerJ Comput. Sci. 7, e330 (2021)
Haidar, A., Tomov, S., Dongarra, J., Higham, N.J.: Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 603–613. IEEE (2018)
Hauser, J.: Berkeley testfloat, June 2018. http://www.jhauser.us/arithmetic/TestFloat.html
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, L., Ma, S., Shen, L., Wang, Z., Xiao, N.: Low-cost binary128 floating-point FMA unit design with SIMD support. IEEE Trans. Comput. 61(5), 745–751 (2011)
Huang, L., Shen, L., Dai, K., Wang, Z.: A new architecture for multiple-precision floating-point multiply-add fused unit design. In: 18th IEEE Symposium on Computer Arithmetic (ARITH 2007), pp. 69–76. IEEE (2007)
Kalamkar, D., et al.: A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)
Karatsuba, A.A., Ofman, Y.P.: Multiplication of many-digital numbers by automatic computers. In: Doklady Akademii Nauk, vol. 145, pp. 293–294. Russian Academy of Sciences (1962)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)
Lang, T., Bruguera, J.D.: Floating-point multiply-add-fused with reduced latency. IEEE Trans. Comput. 53(8), 988–1003 (2004)
Langou, J., Langou, J., Luszczek, P., Kurzak, J., Buttari, A., Dongarra, J.: Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pp. 50–50. IEEE (2006)
Manolopoulos, K., Reisis, D., Chouliaras, V.A.: An efficient dual-mode floating-point multiply-add fused unit. In: 2010 17th IEEE International Conference on Electronics, Circuits and Systems, pp. 5–8. IEEE (2010)
Manolopoulos, K., Reisis, D., Chouliaras, V.A.: An efficient multiple precision floating-point multiply-add fused unit. Microelectron. J. 49, 10–18 (2016)
Mathuriya, A., et al.: CosmoFlow: using deep learning to learn the universe at scale. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 819–829. IEEE (2018)
Quinnell, E., Swartzlander, E.E., Lemonds, C.: Bridge floating-point fused multiply-add design. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 16(12), 1727–1731 (2008)
Rifaioglu, A.S., Atas, H., Martin, M.J., Cetin-Atalay, R., Atalay, V., Doğan, T.: Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief. Bioinform. 20(5), 1878–1912 (2019)
Schmookler, M.S., Nowka, K.J.: Leading zero anticipation and detection-a comparison of methods. In: Proceedings 15th IEEE Symposium on Computer Arithmetic, ARITH-15 2001, pp. 7–12. IEEE (2001)
Yu, D., Deng, L.: Automatic Speech Recognition. SCT, Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3
Zhang, H., Chen, D., Ko, S.B.: Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 68(7), 1035–1048 (2019)
Zhang, H., Chen, D., Ko, S.B.: New flexible multiple-precision multiply-accumulate unit for deep neural network training and inference. IEEE Trans. Comput. 69(1), 26–38 (2019)
Zhang, H., Lee, H.J., Ko, S.B.: Efficient fixed/floating-point merged mixed-precision multiply-accumulate unit for deep learning processors. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018)
Acknowledgements
This work is supported in part by NSFC (No. 61872374, 62090023, 62172430), NSFHN (No. 2022JJ10064, 2021JJ10052) and NKRDP (No. 2021YFB0300300).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Tan, H., Yan, R., Yang, L., Huang, L., Xiao, L., Yang, Q. (2023). Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-22677-9_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)