research-article

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Authors:

Yi WangAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 4, Issue 2

Article No.: 17, Pages 1 - 26

https://doi.org/10.1145/1968502.1968508

Published: 01 May 2011 Publication History

Get Access

Abstract

Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore the application of data-driven optimizations to reduce both storage and computation requirements to the problem of signal recognition from a known dictionary. By starting with a high-level mathematical representation of a signal recognition problem, we perform optimizations across the layers of the system, exploiting mathematical structure to improve implementation efficiency. Specifically, we use Walsh wavelet packets in conjunction with a BestBasis algorithm to distinguish between spoken digits. The resulting transform matrices are quite sparse, and exhibit a rich algebraic structure that contains significant overlap across rows. As a consequence, dot-product computations of the transform matrix and signal vectors exhibit significant computation reuse, or repeated identical computations. We present an algorithm for identifying this computation reuse and scheduling of the row computations. We exploit this reuse to derive FPGA hardware implementations that reduce the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single dot-product unit. The implementation that exploits reuse achieves a 2× computation reduction compared to three concurrently-executing simpler accumulator units with the same aggregate design area and outperforms software implementations on high-end desktop personal computers.

References

[1]

Aho, A., Lam, M., Sethi, R., and Ullman, J. 2006. Compilers: Principles, Techniques and Toos 2nd Ed. Addison-Wesley, Reading, MA.

Digital Library

Google Scholar

[2]

Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., and van der Vorst, H., Eds. 2000. Templates for the Soution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.

Digital Library

Google Scholar

[3]

Beauchamp, M., Hauck, S., Underwood, K., and Hemmert, S. 2006. Embedded floating-point units in FPGAs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays (FPGA’06). ACM Press, New York, 12--20.

Digital Library

Google Scholar

[4]

Böhm, W., Draper, B., Najjar, W., Hammes, J., Rinker, R., Chawathe, M., and Ross, C. 2001. One-Step compilation of image processing applications to FPGAs. In Proceedings of the the 9th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’01). IEEE Computer Society Press, Los Alamitos, CA, 209--218.

Digital Library

Google Scholar

[5]

Bouganis, C.-S., Park, S.-B., Constantinides, G., and Cheung, P. 2009. Synthesis and optimization of 2D filter designs for heterogeneous FPGAs. ACM Trans. Reconfig. Technol. Syst. 1, 4, 1--28.

Digital Library

Google Scholar

[6]

Callanan, O., Gregg, D., Nisbet, A., and Peardon, M. 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’06). 1--6.

Google Scholar

[7]

Cohen, A., DeVore, R. A., and Hochmuth, R. 2000. Restricted nonlinear approximation. Construct. Approx. 16, 1, 85--113.

Crossref

Google Scholar

[8]

Coifman, R. and Wickerhauser, M. 1992. Entropy-Based algorithms for best basis selection. IEEE Trans. Inf. Theory 38, 2, 713--718.

Digital Library

Google Scholar

[9]

Constantinides, G., Cheung, P., and Luk, W. 2003. Synthesis of saturation arithmetic architectures. ACM Trans. Des. Autom. Electron. Syst. 8, 3, 334--354.

Digital Library

Google Scholar

[10]

d’Alberto, P., Milder, P., Sandryhaila, A., Franchetti, F., Hoe, J., Moura, J., Püschel, M., and Johnson, J. 2007. Generating FPGA accelerated DFT libraries. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). IEEE Computer Society Press, Los Alamitos, CA, 173--184.

Digital Library

Google Scholar

[11]

deLorimier, M. and DeHon, A. 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York.

Digital Library

Google Scholar

[12]

DeVore, R. A., Jawerth, B., and Popov, V. 1992. Compression of wavelet decompositions. Amer. J. Math. 114, 737--785.

Crossref

Google Scholar

[13]

Donoho, D. and Elad, M. 2003. Optimal sparse representations in general (nonorthogonal) dictionaries via &ell;1 minimization. Proc. Nat. Acad. Sci. 100, 2197--2202.

Crossref

Google Scholar

[14]

Dou, Y., Vassiliadis, S., Kuzmanov, G., and Gaydadjiev, G. 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 86--95.

Digital Library

Google Scholar

[15]

Ellis, D. 2003. Recoded digits archive at Columbia University.

Google Scholar

[16]

Fahmy, S., Cheung, P., and Luk, W. 2005. Novel FPGA-based implementation of median and weighted median filters for image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05). 142--147.

Google Scholar

[17]

Frigo, M. 1999. A fast fourier transform compiler. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI’99). ACM Press, New York.

Digital Library

Google Scholar

[18]

Mallat, S. and Zhang, Z. 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41, 12, 3397--3415.

Digital Library

Google Scholar

[19]

Marcus, G. and Nolazco-FIores, J. A. 2005. An FPGA-based coprocessor for the SPHINX speech recognition system: Early experiences. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’05). IEEE Computer Society, 27.

Digital Library

Google Scholar

[20]

Mathew, B., Davis, A., and Fang, Z. 2003. A low-power accelerator for the SPHINX 3 speech recognition system. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’03).

Digital Library

Google Scholar

[21]

Melnikoff, S., Quigley, S., and Russell, M. 2002. Implementing a simple continuous speech recognition system on an FPGA. In Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’02).

Digital Library

Google Scholar

[22]

Milder, P., Franchetti, F., Hoe, J., and Pschel, M. 2008. Formal datapath representation and manipulation for implementing DSP transforms. In Proceedings Design Automation Conference (DAC). 385--390.

Digital Library

Google Scholar

[23]

Ortigosa, E., Ortigosa, P., Canas, A., Ros, E., Agis, R., and Ortega, J. 2003. FPGA implementation of multi-layer perceptrons for speech recognition. In Proceedings of the 13th International Conference on Field Programmable Logic and Applications (FPL’03). Lecture Notes in Computer Science, vol. 2778. Springer.

Google Scholar

[24]

Park, I.-C. and Kang, H.-J. 2001. Digital filter synthesis based on minimal signed digit representation. In Proceedings of the 38th Conference on Design Automation (DAC’01). ACM Press, New York, 468--473.

Digital Library

Google Scholar

[25]

Pati, Y. C., Rezaiifar, R., and Krishnaprasad, P. S. 1993. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers. 40--44.

Google Scholar

[26]

Proakis, J. G. and Manolakis, D. K. 1996. Digital Signal Processing: Principles, Algorithms and Applications 3rd Ed. Prentice-Hall, NJ.

Digital Library

Google Scholar

[27]

Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE (Special Issue on Program Generation, Optimization, and Adaptation) 93, 2, 232--275.

Crossref

Google Scholar

[28]

Saito, N. and Coifman, R. 1994. Local discriminant bases. In SPIE Math. Imag. Wavelet Appl. Signal Image Process. (SPIE). vol. 2303.

Google Scholar

[29]

Temam, O. and Jalby, W. 1992. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing (SC’92). IEEE Computer Society Press, 578--587.

Digital Library

Google Scholar

[30]

Thiele, C. M. and Villemoes, L. F. 1996. A fast algorithm for adapted time frequency tilings. Appl. Comput. Harmon. Anal. 3, 91--99.

Crossref

Google Scholar

[31]

Tropp, J. A. 2003. Greed is good: Algorithmic results for sparse approximation. Tech. rep. 2003--04, Texas Institute for Computational and Applied Mathematics.

Google Scholar

[32]

Underwood, K. 2004. FPGAs vs. CPUs: Trends in peak floating-point performance. In Proceedings of the 12th ACM International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM Press, New York, 171--180.

Digital Library

Google Scholar

[33]

Vuduc, R. and Moon, H.-J. 2005. Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proceedings of the International Conference on High Performance Computing and Communcations (HPCC). Lecture Notes in Computer Science, vol. 3726. Springer, 807--816.

Digital Library

Google Scholar

[34]

Woods, R., McAllister, J., and Lightbody, G. 2008. FPGA-Based Implementation of Signal Processing Systems. John Wiley & Sons.

Digital Library

Google Scholar

[35]

Xilinx Corp. 2007. Virtex-II ProTM platform FPGAs: Complete data sheet.

Google Scholar

[36]

Xilinx Corp. 2008. Virtex-3TM platform FPGAs: Complete data sheet.

Google Scholar

[37]

Xilinx Corp. 2009. Virtex-5TM platform FPGAs: Complete data sheet.

Google Scholar

[38]

Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 63--74.

Digital Library

Google Scholar

Index Terms

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Recommendations

Computation reuse in domain-specific optimization of signal recognition
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Domain-specific optimizations that exploit specific arithmetic and representation formats have been shown to achieve significant performance/area gains in FPGA hardware designs. In this work, we describe an approach to domain-specific optimization that ...
In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

FPGAs or ASICs? There is a long-running debate on this. FPGAs are extremely flexible while ASICs offer top efficiency but inflexible. We believe that FPGAs and ASICs are better together, to offer both flexible and efficient solutions. We propose single-...
Domain-Specific Language for HW/SW Co-design for FPGAs
DSL '09: Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages

This article describes FSMLanguage, a domain-specific language for HW/SW co-design targeting platform FPGAs. Modern platform FPGAs provide a wealth of configurable logic in addition to embedded processors, distributed RAM blocks, and DSP slices in order ...

Reviews

Reviewer: Vivek Venugopal

Demertzi et al. present a data optimization algorithm to identify the reuse and scheduling of computation blocks in signal processing applications. The authors demonstrate their algorithm implementation using Xilinx Spartan-3 (xc3s1500), Virtex-II Pro (xc2vp30), and Virtex-5 (xc5vlx110) devices. The paper describes a signal recognition and processing application with modified mathematical, software, and hardware optimizations for implementing it. It uses mostly fast Fourier transform (FFT) algorithms for signal recognition, which require floating-point arithmetic and significant usage of memory processor bandwidth, power, and energy. The authors use the Walsh-Hadamard transform combined with the BestBasis algorithm to reduce the number of arithmetic operations. They use a common subexpression elimination step to reduce the unnecessary data accesses and the memory footprint. The computation reuse algorithm is a greedy algorithm, and starts with longer patterns in order to reuse large computation blocks. The authors use a training set of ten 4096 x 4096 matrices for their experiments, and evaluate their algorithm on the basis of computation reduction, impact of using variations in the reuse algorithm, impact of parallelization, impact on storage requirements, and area/speedup comparison. The authors also compare the field-programmable gate array (FPGA) implementations with a software implementation targeted on an Intel Core quad core central processing unit (CPU), clocked at 2.33 GHz, and find the FPGA implementation to be 10.25 times faster. This paper is a good starting point for researchers who want to implement signal recognition processing algorithms that use less computational resources. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 4, Issue 2

May 2011

216 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/1968502

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2011

Accepted: 01 December 2009

Revised: 01 September 2009

Received: 01 February 2009

Published in TRETS Volume 4, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
274
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Computation reuse in domain-specific optimization of signal recognition

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)

Domain-Specific Language for HW/SW Co-design for FPGAs

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Login options

Full Access

PDF

eReader

Abstract

References

Index Terms

Recommendations

Computation reuse in domain-specific optimization of signal recognition

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)

Domain-Specific Language for HW/SW Co-design for FPGAs

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations