poster

Computation reuse in domain-specific optimization of signal recognition

Authors:

Yi WangAuthors Info & Claims

FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Page 281

https://doi.org/10.1145/1508128.1508190

Published: 22 February 2009 Publication History

Abstract

Domain-specific optimizations that exploit specific arithmetic and representation formats have been shown to achieve significant performance/area gains in FPGA hardware designs. In this work, we describe an approach to domain-specific optimization that goes beyond this representation level. We perform a joint optimization from a high-level mathematical abstract representation and hardware implementation point of view. We focus on a signal recognition system that distinguishes between spoken digits. We construct transform matrices from Walsh wavelet packets in conjunction with a BestBasis algorithm. The resulting transform matrices exhibit a rich algebraic structure and contain significant overlap across rows, exhibiting significant computation reuse in the dot-product operation of the transform matrix applied to the signal vector. We have developed an algorithm for identifying the computation reuse and scheduling the row computations across various computation units to significantly reduce the overall amount of computation.

We have implemented a custom-built dot-product multiplication unit targeting a Virtex-II-Pro FPGA device that exploits computation reuse. A baseline dot-product multiplication unit, without reuse, exhibits a maximum clock rate of 199.3 MHz while utilizing only 2% of the device capacity. The optimized system that exploits reuse also includes a computation scheduler and attains a respectable clock rate of 196 MHz while using 8,183 (57%) slices of the FPGA device. The FPGA hardware implementation reduces the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single pipelined dot-product unit over the baseline implementation. Although it is larger in area than the baseline, the implementation that exploits reuse even achieves a 2× computation reduction when compared to 3 concurrently-executing simpler accumulation units with the same aggregate FPGA design area.

While the results in this paper reflect the opportunities of a specific signal processing problem, this work highlights the concept of exploiting computation reuse derived from a higher-level abstract representation at a mathematical and hardware level. As such, we believe this approach can also be leveraged in other signal recognition problems with specific well-characterized computational structures and signal dictionaries.

References

[1]

R. Coifman and M. Wickerhauser. Entropy-based algorithms for best basis selection. IEEE Trans. on Information Theory, 38(2):713--718, 1992.

Digital Library

Google Scholar

[2]

P. d'Alberto, P. Milder, A. Sandryhaila, F. Franchetti, J. Hoe, J. Moura, M. P¨uschel, and J. Johnson. Generating FPGA accelerated DFT libraries. In IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM'07), pages 173--184, 2007.

Digital Library

Google Scholar

[3]

M. deLorimier and A. DeHon. Floating-Point Sparse Matrix-Vector Multiply for FPGAs. In Proc. of the Intl. Symp. on Field-Programmable Gate Arrays (FPGA'05), February 2005.

Digital Library

Google Scholar

[4]

M. Demertzi, P. Diniz, M. Hall, A. Gilbert, and Y. Wang. A combined hardware/software optimization framework for signal representation and recognition. In Proc. of the 2007 Data-Driven Dynamic Application Systems (DDDAS) Workshop, 2007.

Digital Library

Google Scholar

[5]

Y. Dou, S. Vassiliadis, G. Kuzmanov, and G. Gaydadjiev. 64-bit floating-point FPGA Matrix Multiplication. In Proc. of the 2005 ACM/SIGDA 13th Intl. Symp. on Field-programmable gate arrays (FPGA'05), pages 86--95, New York, NY, USA, 2005. ACM Press.

Digital Library

Google Scholar

[6]

D. Ellis. Recoded digits archive at columbia university. M. Frigo. A fast Fourier transform compiler. In Proc. of the Conference on Programming Language Design and Implementation, May 1999.

Digital Library

Google Scholar

[7]

O. C. D. G. A. N. M. Peardon. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proc. of the Intl. Conf. on Field Programmable Logic and Applications (FPL'06), pages 1--6, August 2006.

Google Scholar

[8]

M. Puschel, J. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. Johnson, and N. Rizzolo. Spiral: Code generation for dsp transforms. Proc. of the IEEE special issue on Program Generation, Optimization, and Adaptation, 93(2):232--275, 2005.

Crossref

Google Scholar

[9]

N. Saito and R. Coifman. Local discriminant bases. Mathematical Imaging: Wavelet Applications in Signal and Image Processing, Proc. SPIE, 2303, July 1994.

Google Scholar

[10]

L. Zhuo and V. K. Prasanna. Sparse matrix-vector multiplication on fpgas. In Proc. of the 2005 ACM/SIGDA 13th Intl. Symp. on Field-Programmable Gate Arrays, pages 63--74, New York, NY, USA, 2005. ACM.

Digital Library

Google Scholar

Index Terms

Computation reuse in domain-specific optimization of signal recognition
1. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems
  2. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific processors

Recommendations

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore ...
In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

FPGAs or ASICs? There is a long-running debate on this. FPGAs are extremely flexible while ASICs offer top efficiency but inflexible. We believe that FPGAs and ASICs are better together, to offer both flexible and efficient solutions. We propose single-...
Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation

This paper presents the domain-specific programmable design of custom computing machines for high-performance stencil computation. Stencil computation is one of the typical kernels in scientific computations, however its low operational-intensity makes ...

Comments

Information & Contributors

Information

Published In

FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

February 2009

302 pages

ISBN:9781605584102

DOI:10.1145/1508128

General Chair:
Paul Chow
University of Toronto, Canada
,
Program Chair:
Peter Cheung
Imperial College London, UK

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 February 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

FPGA '09

Sponsor:

FPGA '09: ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 22 - 24, 2009

California, Monterey, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Index Terms

Recommendations

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)

Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Abstract

References

Index Terms

Recommendations

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)

Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations