research-article

The Unified Accumulator Architecture: A Configurable, Portable, and Extensible Floating-Point Accumulator

Authors:

Greg StittAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 9, Issue 3

Article No.: 21, Pages 1 - 23

https://doi.org/10.1145/2809432

Published: 20 May 2016 Publication History

Abstract

Applications accelerated by field-programmable gate arrays (FPGAs) often require pipelined floating-point accumulators with a variety of different trade-offs. Although previous work has introduced numerous floating-point accumulation architectures, few cores are available for public use, which forces designers to use fixed-point implementations or vendor-provided cores that are not portable and are often not optimized for the desired set of trade-offs. In this article, we combine and extend previous floating-point accumulator architectures into a configurable, open-source core, referred to as the unified accumulator architecture (UAA), which enables designers to choose between different trade-offs for different applications. UAA is portable across FPGAs and allows designers to specialize the underlying adder core to take advantage of device-specific optimizations. By providing an extensible, open-source implementation, we hope for the research community to extend the provided core with new architectures and optimizations.

References

[1]

Altera. 2014. Floating-Point IP Cores User Guide. Retrieved April 18, 2016, from https://www.altera.com/en_US/pdfs/literature/ug/ug_altfp_mfug.pdf.

[2]

S. Asano, T. Maruyama, and Y. Yamaguchi. 2009. Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the 2009 International Conference on Field Programmable Logic and Applications (FPL’09). 126--131.

[3]

T. O. Bachir and J.-P. David. 2010. Performing floating-point accumulation on a modern FPGA in single and double precision. In Proceedings of the 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’10). 105--108.

Digital Library

[4]

F. de Dinechin, B. Pasca, O. Cret, and R. Tudoran. 2008. An FPGA-specific approach to floating-point accumulation and sum-of-products. In Proceedings of the 2008 International Conference on ICECE Technology (FPT’08). 33--40.

[5]

James Demmel and Yozo Hida. 2003. Accurate and efficient floating point summation. SIAM Journal on Scientific Computing 25, 4, 1214--1248.

Digital Library

[6]

Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56.

Digital Library

[7]

Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. 2014. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication. In Proceedings of the 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’14). 36--43.

Digital Library

[8]

Nicholas J. Higham. 1993. The accuracy of floating point summation. SIAM Journal on Scientific Computing 14, 4, 783--799.

Digital Library

[9]

E. Kadric, P. Gurniak, and A. DeHon. 2013. Accurate parallel floating-point accumulation. In Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH’13). 153--162.

Digital Library

[10]

N. Kapre and A. DeHon. 2007. Optimistic parallelization of floating-point accumulation. In Proceedings of the 18th IEEE Symposium on Computer Arithmetic (ARITH’07). 205--216.

Digital Library

[11]

S. Kestur, J. D. Davis, and O. Williams. 2010. BLAS comparison on FPGA, CPU and GPU. In Proceedings of the 2010 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’10). 288--293.

Digital Library

[12]

G. Lienhart, A. Kugel, and R. Manner. 2002. Using floating-point arithmetic on FPGAs to accelerate scientific N-body simulations. In Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 182--191.

Digital Library

[13]

Wayne Luk and Ce Guo. 2013. Accelerating HAC estimation for multivariate time series. In Proceedings of the 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP’13). IEEE, Los Alamitos, CA, 42--49.

Digital Library

[14]

A. Paidimarri, A. Cevrero, P. Brisk, and P. Ienne. 2009. FPGA implementation of a single-precision floating-point multiply-accumulator with single-cycle accumulation. In Proceedings of the 17th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’09). 267--270.

Digital Library

[15]

M. Papadonikolakis, C. Bouganis, and G. Constantinides. 2009. Performance comparison of GPU and FPGA architectures for the SVM training problem. In Proceedings of the 2009 International Conference on Field-Programmable Technology (FPT’09). 388--391.

[16]

Qingzeng Song, Junhua Gu, and Jinzhu Zhang. 2011. Design and implementation of an FPGA-based high-performance improved vector-reduction method. In Proceedings of the 2011 International Conference on Electronics and Optoelectronics (ICEOE’11), Vol. 2. V2-52--V2-55.

[17]

S. Song and J. Zambreno. 2009. A floating-point accumulator for FPGA-based high performance computing applications. In Proceedings of the International Conference on Field-Programmable Technology (FPT’09).

[18]

Xilinx. 2014. LogiCORE IP Floating-Point Operator v7.0. Retrieved April 18, 2016, from http://www.xilinx.com/support/documentation/ip_documentation/floating_point/v7_0/pg060-floating-point.pdf.

[19]

L. Zhuo, G. R. Morris, and V. K. Prasanna. 2005. Designing scalable FPGA-based reduction circuits using pipelined floating-point cores. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. 147a.

Digital Library

[20]

L. Zhuo, G. R. Morris, and V. K. Prasanna. 2007. High-performance reduction circuits using deeply pipelined operators on FPGAs. IEEE Transactions on Parallel and Distributed Systems 18, 10, 1377--1392.

Digital Library

[21]

L. Zhuo and V. K. Prasanna. 2005. High-performance and area-efficient reduction circuits on FPGAs. In Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05). 52--59.

Digital Library

Cited By

Aggarwal SDamsgaard HPappalardo AFranco GPreußer TBlott MMitra T(2024)Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00048(297-303)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00048
Crary CPiard WStitt GBean CHicks B(2023)Using FPGA Devices to Accelerate Tree-Based Genetic Programming: A Preliminary Exploration with Recent TechnologiesGenetic Programming10.1007/978-3-031-29573-7_12(182-197)Online publication date: 12-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-29573-7_12
Zhou BWang GJie GLiu QWang Z(2021)A High-Speed Floating-Point Multiply-Accumulator Based on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2021.310526829:10(1782-1789)Online publication date: Oct-2021
https://doi.org/10.1109/TVLSI.2021.3105268

Index Terms

The Unified Accumulator Architecture: A Configurable, Portable, and Extensible Floating-Point Accumulator
1. Hardware
  1. Integrated circuits
    1. Logic circuits
      1. Arithmetic and datapath circuits

Recommendations

An Unified Architecture for Single, Double, Double-Extended, and Quadruple Precision Division

A hardware architecture for quadruple precision floating point division arithmetic with multi-precision support is presented. Division is an important yet far more complex arithmetic operation than addition and multiplication, which demands significant ...
An efficient CSA architecture for montgomery modular multiplication

Montgomery multipliers of carry save adder (CSA) architecture require a full addition to convert the carry save representation of the result into a conventional form. In this paper, we reuse the CSA architecture to perform the result format conversion, ...
A Parallel-Serial Decimal Multiplier Architecture
CSE '12: Proceedings of the 2012 IEEE 15th International Conference on Computational Science and Engineering

Derived from a parallel multiplier, a parallel-serial decimal multiplier is proposed in which the multiplicand is assumed in parallel whereas the multiplier is in digit-serial form. A scheme for a parallel-serial decimal multiplier is presented, using ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 9, Issue 3

Special Issue on Reconfigurable Components with Source Code

September 2016

128 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/2940351

Editor:
Steve Wilton
Department of Electrical and Computer Engineering / University of British Columbia / Kaiser 4112, 5500-2332 Main Mall / Vancouver, BC V6T 1Z4 Canada /

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2016

Accepted: 01 July 2015

Revised: 01 April 2015

Received: 01 October 2014

Published in TRETS Volume 9, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Aggarwal SDamsgaard HPappalardo AFranco GPreußer TBlott MMitra T(2024)Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00048(297-303)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00048
Crary CPiard WStitt GBean CHicks B(2023)Using FPGA Devices to Accelerate Tree-Based Genetic Programming: A Preliminary Exploration with Recent TechnologiesGenetic Programming10.1007/978-3-031-29573-7_12(182-197)Online publication date: 12-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-29573-7_12
Zhou BWang GJie GLiu QWang Z(2021)A High-Speed Floating-Point Multiply-Accumulator Based on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2021.310526829:10(1782-1789)Online publication date: Oct-2021
https://doi.org/10.1109/TVLSI.2021.3105268
Huang YHuang WChen RWu HWei M(2020)A Tag Based Random Order Vector Reduction CircuitIEEE Access10.1109/ACCESS.2020.29767648(41502-41515)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2976764

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents