Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Published: 01 May 2011 Publication History

Abstract

Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore the application of data-driven optimizations to reduce both storage and computation requirements to the problem of signal recognition from a known dictionary. By starting with a high-level mathematical representation of a signal recognition problem, we perform optimizations across the layers of the system, exploiting mathematical structure to improve implementation efficiency. Specifically, we use Walsh wavelet packets in conjunction with a BestBasis algorithm to distinguish between spoken digits. The resulting transform matrices are quite sparse, and exhibit a rich algebraic structure that contains significant overlap across rows. As a consequence, dot-product computations of the transform matrix and signal vectors exhibit significant computation reuse, or repeated identical computations. We present an algorithm for identifying this computation reuse and scheduling of the row computations. We exploit this reuse to derive FPGA hardware implementations that reduce the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single dot-product unit. The implementation that exploits reuse achieves a 2× computation reduction compared to three concurrently-executing simpler accumulator units with the same aggregate design area and outperforms software implementations on high-end desktop personal computers.

References

[1]
Aho, A., Lam, M., Sethi, R., and Ullman, J. 2006. Compilers: Principles, Techniques and Toos 2nd Ed. Addison-Wesley, Reading, MA.
[2]
Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., and van der Vorst, H., Eds. 2000. Templates for the Soution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.
[3]
Beauchamp, M., Hauck, S., Underwood, K., and Hemmert, S. 2006. Embedded floating-point units in FPGAs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays (FPGA’06). ACM Press, New York, 12--20.
[4]
Böhm, W., Draper, B., Najjar, W., Hammes, J., Rinker, R., Chawathe, M., and Ross, C. 2001. One-Step compilation of image processing applications to FPGAs. In Proceedings of the the 9th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’01). IEEE Computer Society Press, Los Alamitos, CA, 209--218.
[5]
Bouganis, C.-S., Park, S.-B., Constantinides, G., and Cheung, P. 2009. Synthesis and optimization of 2D filter designs for heterogeneous FPGAs. ACM Trans. Reconfig. Technol. Syst. 1, 4, 1--28.
[6]
Callanan, O., Gregg, D., Nisbet, A., and Peardon, M. 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’06). 1--6.
[7]
Cohen, A., DeVore, R. A., and Hochmuth, R. 2000. Restricted nonlinear approximation. Construct. Approx. 16, 1, 85--113.
[8]
Coifman, R. and Wickerhauser, M. 1992. Entropy-Based algorithms for best basis selection. IEEE Trans. Inf. Theory 38, 2, 713--718.
[9]
Constantinides, G., Cheung, P., and Luk, W. 2003. Synthesis of saturation arithmetic architectures. ACM Trans. Des. Autom. Electron. Syst. 8, 3, 334--354.
[10]
d’Alberto, P., Milder, P., Sandryhaila, A., Franchetti, F., Hoe, J., Moura, J., Püschel, M., and Johnson, J. 2007. Generating FPGA accelerated DFT libraries. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). IEEE Computer Society Press, Los Alamitos, CA, 173--184.
[11]
deLorimier, M. and DeHon, A. 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York.
[12]
DeVore, R. A., Jawerth, B., and Popov, V. 1992. Compression of wavelet decompositions. Amer. J. Math. 114, 737--785.
[13]
Donoho, D. and Elad, M. 2003. Optimal sparse representations in general (nonorthogonal) dictionaries via ℓ<sup>1</sup> minimization. Proc. Nat. Acad. Sci. 100, 2197--2202.
[14]
Dou, Y., Vassiliadis, S., Kuzmanov, G., and Gaydadjiev, G. 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 86--95.
[15]
Ellis, D. 2003. Recoded digits archive at Columbia University.
[16]
Fahmy, S., Cheung, P., and Luk, W. 2005. Novel FPGA-based implementation of median and weighted median filters for image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05). 142--147.
[17]
Frigo, M. 1999. A fast fourier transform compiler. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI’99). ACM Press, New York.
[18]
Mallat, S. and Zhang, Z. 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41, 12, 3397--3415.
[19]
Marcus, G. and Nolazco-FIores, J. A. 2005. An FPGA-based coprocessor for the SPHINX speech recognition system: Early experiences. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’05). IEEE Computer Society, 27.
[20]
Mathew, B., Davis, A., and Fang, Z. 2003. A low-power accelerator for the SPHINX 3 speech recognition system. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’03).
[21]
Melnikoff, S., Quigley, S., and Russell, M. 2002. Implementing a simple continuous speech recognition system on an FPGA. In Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’02).
[22]
Milder, P., Franchetti, F., Hoe, J., and Pschel, M. 2008. Formal datapath representation and manipulation for implementing DSP transforms. In Proceedings Design Automation Conference (DAC). 385--390.
[23]
Ortigosa, E., Ortigosa, P., Canas, A., Ros, E., Agis, R., and Ortega, J. 2003. FPGA implementation of multi-layer perceptrons for speech recognition. In Proceedings of the 13th International Conference on Field Programmable Logic and Applications (FPL’03). Lecture Notes in Computer Science, vol. 2778. Springer.
[24]
Park, I.-C. and Kang, H.-J. 2001. Digital filter synthesis based on minimal signed digit representation. In Proceedings of the 38th Conference on Design Automation (DAC’01). ACM Press, New York, 468--473.
[25]
Pati, Y. C., Rezaiifar, R., and Krishnaprasad, P. S. 1993. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers. 40--44.
[26]
Proakis, J. G. and Manolakis, D. K. 1996. Digital Signal Processing: Principles, Algorithms and Applications 3rd Ed. Prentice-Hall, NJ.
[27]
Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE (Special Issue on Program Generation, Optimization, and Adaptation) 93, 2, 232--275.
[28]
Saito, N. and Coifman, R. 1994. Local discriminant bases. In SPIE Math. Imag. Wavelet Appl. Signal Image Process. (SPIE). vol. 2303.
[29]
Temam, O. and Jalby, W. 1992. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing (SC’92). IEEE Computer Society Press, 578--587.
[30]
Thiele, C. M. and Villemoes, L. F. 1996. A fast algorithm for adapted time frequency tilings. Appl. Comput. Harmon. Anal. 3, 91--99.
[31]
Tropp, J. A. 2003. Greed is good: Algorithmic results for sparse approximation. Tech. rep. 2003--04, Texas Institute for Computational and Applied Mathematics.
[32]
Underwood, K. 2004. FPGAs vs. CPUs: Trends in peak floating-point performance. In Proceedings of the 12th ACM International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM Press, New York, 171--180.
[33]
Vuduc, R. and Moon, H.-J. 2005. Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proceedings of the International Conference on High Performance Computing and Communcations (HPCC). Lecture Notes in Computer Science, vol. 3726. Springer, 807--816.
[34]
Woods, R., McAllister, J., and Lightbody, G. 2008. FPGA-Based Implementation of Signal Processing Systems. John Wiley & Sons.
[35]
Xilinx Corp. 2007. Virtex-II Pro<sup>TM</sup> platform FPGAs: Complete data sheet.
[36]
Xilinx Corp. 2008. Virtex-3<sup>TM</sup> platform FPGAs: Complete data sheet.
[37]
Xilinx Corp. 2009. Virtex-5<sup>TM</sup> platform FPGAs: Complete data sheet.
[38]
Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 63--74.

Recommendations

Reviews

Vivek Venugopal

Demertzi et al. present a data optimization algorithm to identify the reuse and scheduling of computation blocks in signal processing applications. The authors demonstrate their algorithm implementation using Xilinx Spartan-3 (xc3s1500), Virtex-II Pro (xc2vp30), and Virtex-5 (xc5vlx110) devices. The paper describes a signal recognition and processing application with modified mathematical, software, and hardware optimizations for implementing it. It uses mostly fast Fourier transform (FFT) algorithms for signal recognition, which require floating-point arithmetic and significant usage of memory processor bandwidth, power, and energy. The authors use the Walsh-Hadamard transform combined with the BestBasis algorithm to reduce the number of arithmetic operations. They use a common subexpression elimination step to reduce the unnecessary data accesses and the memory footprint. The computation reuse algorithm is a greedy algorithm, and starts with longer patterns in order to reuse large computation blocks. The authors use a training set of ten 4096 x 4096 matrices for their experiments, and evaluate their algorithm on the basis of computation reduction, impact of using variations in the reuse algorithm, impact of parallelization, impact on storage requirements, and area/speedup comparison. The authors also compare the field-programmable gate array (FPGA) implementations with a software implementation targeted on an Intel Core quad core central processing unit (CPU), clocked at 2.33 GHz, and find the FPGA implementation to be 10.25 times faster. This paper is a good starting point for researchers who want to implement signal recognition processing algorithms that use less computational resources. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 4, Issue 2
May 2011
216 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/1968502
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2011
Accepted: 01 December 2009
Revised: 01 September 2009
Received: 01 February 2009
Published in TRETS Volume 4, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Configurable architectures
  2. FPGA
  3. computation reuse
  4. signal recognition

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 274
    Total Downloads
  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media