research-article

Design of Synthesis-time Vectorized Arithmetic Hardware for Tapered Floating-point Addition and Subtraction

Authors:

Ashish Reddy Bommana,

Susheel Ujwal Siddamshetty,

Dhilleswararao Pudi,

Arvind Thumatti K. R.,

Srinivas Boppu,

M Sabarimalai Manikandan,

Linga Reddy CenkeramaddiAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems, Volume 28, Issue 3

Article No.: 32, Pages 1 - 35

https://doi.org/10.1145/3567423

Published: 22 March 2023 Publication History

Abstract

Energy efficiency has become the new performance criterion in this era of pervasive embedded computing; thus, accelerator-rich multi-processor system-on-chips are commonly used in embedded computing hardware. Once computationally intensive machine learning applications gained much traction, they are now deployed in many application domains due to abundant and cheaply available computational capacity. In addition, there is a growing trend toward developing hardware accelerators for machine learning applications for embedded edge devices where performance and energy efficiency are critical. Although these hardware accelerators frequently use floating-point operations for accuracy, reduced-width floating-point formats are also used to reduce hardware complexity; thus, power consumption while maintaining accuracy. Vectorization concepts can also be used to improve performance, energy efficiency, and memory bandwidth. We propose the design of a vectorized floating-point adder/subtractor that supports arbitrary length floating-point formats with varying exponent and mantissa widths in this article. In comparison to existing designs in the literature, the proposed design is 2.57× area- and 1.56× power-efficient, and it supports true vectorization with no restrictions on exponent and mantissa widths.

References

[1]

Aqil M. Azmi and Fabrizio Lombardi. 1989. On a tapered floating point system. In Proceedings of the 9th Symposium on Computer Arithmetic. IEEE, 2–9.

[2]

David H. Bailey. 2005. High-precision floating-point arithmetic in scientific computation. Comput. Sci. Eng. 7, 3 (2005), 54–61.

Digital Library

[3]

Léopold Cambier, Anahita Bhiwandiwalla, Ting Gong, Mehran Nekuii, Oguz H. Elibol, and Hanlin Tang. 2020. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. arXiv preprint arXiv:2001.05674 (2020).

[4]

Florent De Dinechin, Bogdan Pasca, and E. Normale. 2011. Custom arithmetic datapath design for FPGAs using the FloPoCo core generator. IEEE Des. Test Comput. 28, 4 (2011), 18–27.

Digital Library

[5]

Jean-Pierre Deschamps, Gery J. A. Bioul, and Gustavo D. Sutter. 2006. Synthesis of Arithmetic Circuits: FPGA, ASIC and Embedded Systems. John Wiley & Sons.

[6]

Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2013. Power challenges may end the multicore era. Commun. ACM 56, 2 (2013).

Digital Library

[7]

Fernando D. Franke, Diego S. Picada, and André L. Aita. 2001. A Logarithmic Shifter for a Floating-Point Adder. In Proceeding of the Student Forum on Microelectronics, SBMicro.

[8]

David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (1991), 5–48.

Digital Library

[9]

Neha Gupta. 2021. Introduction to hardware accelerator systems for artificial intelligence and machine learning. In Advances in Computers. Vol. 122. Elsevier, 1–21.

[10]

Libo Huang, Li Shen, Kui Dai, and Zhiying Wang. 2007. A new architecture for multiple-precision floating-point multiply-add fused unit design. In Proceedings of the 18th IEEE Symposium on Computer Arithmetic (ARITH’07). IEEE, 69–76.

Digital Library

[11]

Christopher Inacio and Denise Ombres. 1996. The DSP decision: Fixed point or floating?IEEE Spect. 33, 9 (1996), 72–74.

Digital Library

[12]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.

Digital Library

[13]

Himanshu Kaul, Mark Anders, Sanu Mathew, Steven Hsu, Amit Agarwal, Farhana Sheikh, Ram Krishnamurthy, and Shekhar Borkar. 2012. A 1.45 GHz 52-to-162GFLOPS/W variable-precision floating-point fused multiply-add unit with certainty tracking in 32nm CMOS. In Proceedings of the IEEE International Solid-state Circuits Conference. IEEE, 182–184.

[14]

Stefan Mach. 2021. Floating-point Architectures for Energy-efficient Transprecision Computing. Ph. D. Dissertation. ETH Zurich.

[15]

Stefan Mach. 2021. FPnew - New Floating-Point Unit with Transprecision Capabilities. Retrieved from https://github.com/pulp-platform/fpnew.

[16]

Stefan Mach, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. FPnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans. Very Large Scale Integ. Syst. 29, 4 (2020), 774–787.

[17]

A. Cristiano, I. Malossi, Michael Schaffner, Anca Molnos, Luca Gammaitoni, Giuseppe Tagliavini, Andrew Emerson, Andrés Tomás, Dimitrios S. Nikolopoulos, Eric Flamand, and Norbert Wehn. 2018. The transprecision computing paradigm: Concept, design, and applications. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE). 1105–1110. DOI:.

[18]

Naveen Mellempudi, Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul. 2019. Mixed precision training with 8-bit floating point. arXiv preprint arXiv:1905.12334 (2019).

[19]

Robert Morris. 1971. Tapered floating point: A new floating-point representation. IEEE Trans. Comput. 100, 12 (1971), 1578–1579.

Digital Library

[20]

Jean-Michel Muller, Nicolas Brisebarre, Florent De Dinechin, Claude-Pierre Jeannerod, Vincent Lefevre, Guillaume Melquiond, Nathalie Revol, Damien Stehlé, Serge Torres, et al. 2018. Handbook of Floating-point Arithmetic. Vol. 1. Springer.

[21]

Alberto Nannarelli. 2019. Tunable floating-point adder. IEEE Trans. Comput. 68, 10 (2019), 1553–1560.

[22]

Michael L. Overton. 2001. Numerical Computing with IEEE Floating Point Arithmetic. SIAM.

[23]

Hyunbin Park and Shiho Kim. 2021. Hardware accelerator systems for artificial intelligence and machine learning. In Advances in Computers. Vol. 122. Elsevier, 51–95.

[24]

Stefan Payer, Cedric Lichtenau, Michael Klein, Kerstin Schelm, Petra Leber, Nicol Hofmann, and Tina Babinsky. 2020. SIMD multi format floating-point unit on the IBM z15 (TM). In Proceedings of the IEEE 27th Symposium on Computer Arithmetic (ARITH). IEEE, 125–128.

[25]

Tayyar Rzayev, Saber Moradi, David H. Albonesi, and Rajit Manchar. 2017. DeepRecon: Dynamically reconfigurable architecture for accelerating deep neural networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN). IEEE, 116–124.

[26]

Michael B. Taylor. 2013. A landscape of the new dark silicon design regime. IEEE Micro 33, 5 (2013), 8–19.

Digital Library

[27]

Jonathan Ying Fai Tong, David Nagle, and Rob A. Rutenbar. 2000. Reducing power by optimizing the necessary precision/range of floating-point arithmetic. IEEE Trans. Very Large Scale Integ. Syst. 8, 3 (2000), 273–286.

Digital Library

[28]

Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. 2018. Training deep neural networks with 8-bit floating point numbers. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 7686–7695.

Digital Library

[29]

Shaolin Xie, Scott Davidson, Ikuo Magaki, Moein Khazraee, Luis Vega, Lu Zhang, and Michael B. Taylor. 2018. Extreme datacenter specialization for planet-scale computing: ASIC clouds. ACM SIGOPS Oper. Syst. Rev. 52, 1 (2018), 96–108.

Digital Library

[30]

Xilinx. 2020. Floating-Point Operator, LogiCORE IP Product Guide, Version 7.1. Retrieved from https://www.xilinx.com/products/intellectual-property/floating_pt.html.

[31]

Pedram Zamirai, Jian Zhang, Christopher R. Aberger, and Christopher De Sa. 2020. Revisiting BFloat16 training. arXiv preprint arXiv:2010.06192 (2020).

[32]

Hao Zhang, Dongdong Chen, and Seok-Bum Ko. 2019. New flexible multiple-precision multiply-accumulate unit for deep neural network training and inference. IEEE Trans. Comput. 69, 1 (2019), 26–38.

Digital Library

Index Terms

Design of Synthesis-time Vectorized Arithmetic Hardware for Tapered Floating-point Addition and Subtraction
1. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific processors

Recommendations

Hardware Designs for Decimal Floating-Point Addition and Related Operations

Decimal arithmetic is often used in commercial, financial, and Internet-based applications. Due to the growing importance of decimal floating-point (DFP) arithmetic, the IEEE 754 Draft Standard for Floating-Point Arithmetic (IEEE P754) includes ...
The IBM zEnterprise-196 Decimal Floating-Point Accelerator
ARITH '11: Proceedings of the 2011 IEEE 20th Symposium on Computer Arithmetic

Decimal floating-point Arithmetic is widely used in commercial computing applications, such as financial transactions, where rounding errors prevent the use of binary floating-point operations. The revised IEEE Standard for Floating-Point Arithmetic (...
Synthesis of Floating-Point Addition Clusters on FPGAs Using Carry-Save Arithmetic
FPL '10: Proceedings of the 2010 International Conference on Field Programmable Logic and Applications

A new method to synthesize clusters of floating-point addition operations on FPGAs is presented. Similar to Altera’s floating-point data path compiler, it performs normalization once, at the output of the cluster operation. All significands in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 28, Issue 3

May 2023

456 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3587887

Editor:
X. Sharon Hu
University of Notre Dame, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 22 March 2023

Online AM: 08 October 2022

Accepted: 23 September 2022

Revised: 04 July 2022

Received: 16 March 2022

Published in TODAES Volume 28, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
311
Total Downloads

Downloads (Last 12 months)124
Downloads (Last 6 weeks)15

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents