Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Code size reduction technique and implementation for software-pipelined DSP applications

Published: 01 November 2003 Publication History

Abstract

Software pipelining technique is extensively used to exploit instruction-level parallelism of loops, but also significantly expands the code size. For embedded systems with very limited on-chip memory resources, code size becomes one of the most important optimization concerns. This paper presents the theoretical foundation of code size reduction for software-pipelined loops based on retiming concept. We propose a general Code-size REDuction technique (CRED) for various kinds of processors. Our CRED algorithms integrate the code size reduction with software pipelining. The experimental results show the effectiveness of the CRED technique on both code size reduction and code size/performance trade-off space exploration.

References

[1]
Araujo, G., Devadas, S., Keutzer, K., Liao, S., Malik, S., Sudarsanam, A., Tjiang, S., and Wang, A. 1995. Challenges in code generation for embedded processors. In Code Generation For Embedded Processors, P. Marwedel and G. Goossens, Eds. Kluwer Academic Publishers, Dordrecht, Ch. 1, 4--17.
[2]
Chao, L.-F., LaPaugh, A. S., and Sha, E. H.-M. 1997. Rotation scheduling: A loop pipelining algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 16, 3 (March), 229--239.
[3]
Chao, L.-F. and Sha, E. H.-M. 1995. Static scheduling for synthesis of DSP algorithms on various models. Journal of VLSI Signal Processing 10, 207--223.
[4]
Chao, L.-F. and Sha, E. H.-M. 1997. Scheduling data-flow graphs via retiming and unfolding. IEEE Transactions on Parallel and Distributed Systems 8, 12 (Dec.), 1259--1267.
[5]
Chen, F., O'Neil, T. W., and Sha, E. H.-M. 2000. Optimizing overall loop schedules using prefetching and partitioning. IEEE Transactions on Parallel and Distributed Systems 11, 604--614.
[6]
Chen, F., Tongsima, S., and Sha, E. H.-M. 1998. Loop scheduling algorithm for timing and memory operation minimization with register constraint. In Proceedings 1998 IEEE Workshop on Signal Processing Systems (SiPS), 579--588.
[7]
Granston, E., Scales, R., Stotzer, E., Ward, A., and Zbiciak, J. 2001. Controlling code size of software-pipelined loops on the TMS320C6000 VLIW DSP architecture. In Proceedings 3rd IEEE/ACM Workshop on Media and Streaming Processors, 29--38.
[8]
Hennessy, J. and Patterson, D. 1995. Computer Architecture: A Quantitive Approach, 2nd ed. Morgan Kaufmann, San Mateo, CA.
[9]
Huff, R. A. 1993. Lifetime-sensitive modulo scheduling. In Proceedings SIGPLAN'93 ACM Conference on Programming Language Design and Implementation, 258--267.
[10]
Intel Corporation 2001. Intel Itanium Architecture Software Developer's Manual Volume 1: Application Architecture. Intel Corporation. (literature number 245317-003).
[11]
Kuck, D. J., Kuhn, R. H., Padua, D. A., Leasure, B., and Wolfe, M. 1981. Dependence graphs and compiler optimizations. In Proceedings of the ACM Symposium on Principles of Programming Languages, 207--218.
[12]
Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings SIGPLAN'88 ACM Conference on Programming Language Design and Implementation, 318--328.
[13]
Lanneer, D., Praet, J. V., Kifli, A., Schoofs, K., W.Geurts, Thoen, F., and Goossens, G. 1995. CHESS: Retargetable code generation for embedded processors. In Code Generation for Embedded Processors, P. Marwedel and G. Goossens, Eds. Kluwer Academic Publishers, Dordrcht, Ch. 5, 85--296.
[14]
Leiserson, C. E. and Saxe, J. B. 1991. Retiming synchronous circuitry. Algorithmica 6, 5--35.
[15]
Motorola Digital DNA & Agere Systems 2001. StarCore SC140 DSP Core Reference Manual. Motorola Digital DNA & Agere Systems.
[16]
Philips, Inc. 2000. TM-1300 Media Processor Data Book. Philips, Inc.
[17]
Rau, B. R. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th IEEE/ACM Annual International Symposium on Microarchitecture (MICRO), 63--74.
[18]
Rau, B. R. and Fisher, J. A. 1993. Instruction-level parallel processing: History, overview and perspective. Journal of Supercomputing 7, 1/2 (July), 9--50.
[19]
Rau, B. R. and Glaeser, C. D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings 14th ACM/IEEE Annual Workshop on Microprogramming, 183--198.
[20]
Rau, B. R., Schlansker, M. S., and Tirumalai, P. P. 1992. Code generation schema for modulo scheduled loops. In Proc. 25th IEEE/ACM Annual International Symposium on Microarchitecture (MICRO), 158--169.
[21]
Seal, D., Ed. 2000. ARM Architecture Reference Manual, 2nd ed. Addison-Wesley, Reading, MA.
[22]
Texas Instruments, Inc. 2000. TMS320C6000 CPU and Instruction Set Reference Guide. Texas Instruments, Inc. (literature number SPRU189F).
[23]
Texas Instruments, Inc. 2001a. Code Composer Studio IDE v2 White Paper. Texas Instruments, Inc. (literature number SPRA004).
[24]
Texas Instruments, Inc. 2001b. TMS320C6000 Optimizing Compiler User's Guide. Texas Instruments, Inc. (literature number SPRU187).
[25]
Wang, Z., O'Neil, T. W., and Sha, E. H.-M. 2001. Minimizing average schedule length under memory constraints by optimal partitioning and prefetching. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 27, 215--233.

Cited By

View all
  • (2016)Properties of Self-Timed Ring Architectures for Deadlock-Free and Consistent Configuration Reaching Maximum ThroughputJournal of Signal Processing Systems10.1007/s11265-015-0984-684:1(123-137)Online publication date: 1-Jul-2016
  • (2014)On self-timed ring for consistent mapping and maximum throughput2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications10.1109/RTCSA.2014.6910511(1-9)Online publication date: Aug-2014
  • (2013)Loop Transformations for Power Consumption Reduction in Wireless Sensor Networks MemoryProceedings of the 2013 European Modelling Symposium10.1109/EMS.2013.108(647-651)Online publication date: 20-Nov-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 2, Issue 4
November 2003
165 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/950162
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 01 November 2003
Published in TECS Volume 2, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DSP processors
  2. Retiming
  3. scheduling
  4. software pipelining

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2016)Properties of Self-Timed Ring Architectures for Deadlock-Free and Consistent Configuration Reaching Maximum ThroughputJournal of Signal Processing Systems10.1007/s11265-015-0984-684:1(123-137)Online publication date: 1-Jul-2016
  • (2014)On self-timed ring for consistent mapping and maximum throughput2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications10.1109/RTCSA.2014.6910511(1-9)Online publication date: Aug-2014
  • (2013)Loop Transformations for Power Consumption Reduction in Wireless Sensor Networks MemoryProceedings of the 2013 European Modelling Symposium10.1109/EMS.2013.108(647-651)Online publication date: 20-Nov-2013
  • (2013)Efficient Loop Scheduling for Chip Multiprocessors with Non-Volatile Main MemoryJournal of Signal Processing Systems10.1007/s11265-012-0703-571:3(261-273)Online publication date: 1-Jun-2013
  • (2012)Loop scheduling optimization for chip-multiprocessors with non-volatile main memory2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2012.6288188(1553-1556)Online publication date: Mar-2012
  • (2011)BibliographyReal-Time Embedded Systems10.1201/b10935-12(187-207)Online publication date: 7-Jun-2011
  • (2011)Hardware/Software Co-reconfigurable Instruction Decoder for Adaptive Multi-core DSP ArchitecturesJournal of Signal Processing Systems10.1007/s11265-010-0461-162:3(273-285)Online publication date: 1-Mar-2011
  • (2008)Overhead-Aware System-Level Joint Energy and Performance Optimization for Streaming Applications on Multiprocessor Systems-on-ChipProceedings of the 2008 Euromicro Conference on Real-Time Systems10.1109/ECRTS.2008.18(92-101)Online publication date: 2-Jul-2008
  • (2008)Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSPJournal of Parallel and Distributed Computing10.1016/j.jpdc.2007.06.01468:4(443-455)Online publication date: 1-Apr-2008
  • (2007)Real-time loop scheduling with energy optimization via DVS and ABB for multi-core embedded systemProceedings of the 2007 international conference on Embedded and ubiquitous computing10.5555/1780745.1780747(1-12)Online publication date: 17-Dec-2007
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media