Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Optimizing scheduling and intercluster connection for application-specific DSP processors

Published: 01 November 2009 Publication History

Abstract

Signal processing applications have high instruction level parallelism (ILP) and real-time performance requirements. Embedded and application specific multicluster architecture is desirable to provide the large computation power that these applications need. As technology moves to deep submicron level, it becomes more important and challenging to design an efficient intercluster connection network to satisfy the rapid growing intercluster data transfer needs under the power and cost constraints. This paper addresses the automatic generation of intercluster connection network with partially connected buses. An application specific approach is proposed in this paper to determine the minimum number of required partially connected buses without performance degradation for a given schedule in polynomial time. The intercluster connection topology is then generated with the determined minimum number of partially connected buses to minimize the connection bus segments. Further, a scheduling algorithm is presented in this paper to minimize the intercluster communication needs for the given application and to reduce the minimum number of partially connected buses required in the intercluster connection network under schedule length constraint. Experimental results indicate that an average reduction up to 50.6% in the number of minimum required buses and an average reduction of 64.5% in bus segments can be achieved compared to commonly used intercluster communication aware scheduling techniques and as soon as possible (ASAP) data transfer scheme.

References

[1]
N. Bambha and S. Bhattacharyya, "Joint application mapping/interconnect synthesis techniques for embedded chip-scale multiprocessors," IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 2, pp. 99-112, Feb. 2005.
[2]
M. Bekooij, "Phase coupled operation assignment for vliw processors with distributed register files," in Proc. Int. Symp. Syst. Synthesis, Oct. 2001, pp. 118-123.
[3]
L. Chao and E. H.-M. Sha, "Scheduling data-flow graphs via retiming and unfolding," IEEE Trans. Parallel Distrib. Syst., vol. 8, no. 12, pp. 1259-1267, Dec. 1997.
[4]
L. Chao, E. H.-M. Sha, and A. LaPaugh, "Rotation scheduling: A loop pipelining algorithm," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 16, no. 3, pp. 229-239, Mar. 1997.
[5]
P. Faraboschi, G. Brown, J. Fisher, G. Desoll, and F. Homewood, "LX: A technology platform for customizable VLIW embedded processing," in Proc. Int. Symp. Comput. Architecture, 2000, pp. 203-213.
[6]
M. Fredman and R. Tarjan, "Fibonacci heaps and their uses in improved network optimization algorithms," J. Assoc. Comput. Mach., vol. 34, no. 3, pp. 596-615, 1987.
[7]
A. Gangwar, M. Balakrishnan, and A. Kumar, "Impact of intercluster communication mechanisms on ILP in clustered VLIW architecture," ACM Trans. Design Autom. Electron. Syst., vol. 12, no. 1, pp. 1-29, Jan. 2007.
[8]
E. Özer and S. Banerjia, "Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures," in Proc. 31st Annu. ACM/IEEE Int. Symp. Microarchitecture, Dallas, TX, Nov. 30-Dec. 2, 1998, pp. 308-315.
[9]
"TMS320C6000 CPU and Instruction Set Reference Guide," Texas Instruments, Jul. 2006 {Online}. Available: http://focus.ti.com/lit/ug/ spru189g/spru189g.pdf
[10]
M. Jacome and G. D. Veciana, "Design challenges for new application specific processors," IEEE Des. Test Comput., no. 2, pp. 40-50, 2000.
[11]
Y. Jiang, T. Lee, T. Huang, and Y. Lin, "Performance-driven interconnection optimization for microarchitecture synthesis," IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 13, no. 2, pp. 137-149, Feb. 1994.
[12]
N. karmarkar, "A new polynomial time algorithm for linear programming," Combinatorica, vol. 4, no. 4, pp. 373-395, 1984.
[13]
Y.-K. Kwok and I. Ahmad, "Dynamic critical-path scheduling: An effective technique for allocating task graphs to multiprocessors," IEEE Trans. Parallel Distrib. Syst., vol. 7, no. 5, pp. 506-521, May 1996.
[14]
Y.-K. Kwok and I. Ahmad, "Link contention-constrained scheduling and mapping of tasks and messages to a network of heterogeneous processors," Cluster Comput., vol. 3, no. 2, pp. 113-124, 2000.
[15]
R. Lepèpre and D. Trystram, "A new clustering algorithm for scheduling task graphs with large communication delays," in Proc. Int. Parallel Distrib. Process. Symp., 2002.
[16]
D. Lewis, D. Galloway, M. Ierssel, J. Rose, and P. Chow, "The transmogrifier-2: A 1-million gate rapid prototyping system," IEEE Trans. Very Large Scale Integr. Syst., vol. 6, no. 2, pp. 188-198, Jun. 1998.
[17]
C. H. Papadimitriou and K. Steiglitz, Combinational Optimization, Algorithms and Complexity. New York: Dover, 1998.
[18]
S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. Owens, "Register organization for media processing," in Proc. High Perform. Comput. Architecture, 2000, pp. 375-386.
[19]
S. Roos, H. Corporaal, and R. Lamberts, "Clustering on the move," Proc. 4th Int. Conf. Massively Parallel Comput. Syst., Apr. 2002.
[20]
J. Sanchez and A. Gonzales, "Instruction scheduling for clustered VLIW architecture," in Proc. Int. Symp. Syst. Synthesis, Jan. 2000, vol. 12, no. 1.
[21]
Z. Shao, M. Wang, Y. Chen, C. Xue, M. Qiu, L. Yang, and E.-M. Sha, "Real-time dynamic voltage loop scheduling for multi-core embedded systems," IEEE Trans. Circuits Sys., vol. 54, no. 5, pp. 445-449, May 2007.
[22]
Z. Shao, C. Xue, Q. Zhuge, B. Xiao, and E.-M. Sha, "Loop scheduling with timing and switching-activity minimization for VLIW DSP," ACM Trans. Des. Autom. Electron. Syst., vol. 11, no. 1, pp. 165-185, Jan. 2006.
[23]
P. Song, "Demystifying EPIC and IA-64,"Microprocessor Rep. 12(1), Jan. 26, 1998, pp. 21-27 {Online}. Available: http://www.cs.virginia. edu/~gjp5j/cs854/120104.pdf
[24]
S. Sriram and S. S. Bhattacharyya, Embedded Multiprocessors: Scheduling and Synchronization. New York: Marcel Dekker, 2000.
[25]
A. Terechko, E. Thenaff, M. Garg, J. Eijndhoven, and H. Corporaal, "Inter-cluster communication models for clustered VLIW processors," in Proc. High Perform. Comput. Architecture, 2003, pp. 354-364.
[26]
C. Xue, Z. Jia, Z. Shao, M. Wang, and E.-M. Sha, "Optimizing address assignment for scheduling DSPS with multiple functional units," IEEE Trans. Circuits Syst., vol. 55, no. 1, pp. 379-389, Feb. 2008.
[27]
C. Xue, Z. Shao, and E.-M. Sha, "Maximizing parallelism for nested loops via loop striping," J. VLSI Signal Process. Syst. Signal Image Video Technol., vol. 41, no. 2, pp. 153-167, May 2007.
[28]
C. Xue, Z. Shao, Q. Zhuge, B. Xiao, M. Liu, and E.-M. Sha, "Optimizing address assignment for scheduling DSPS with multiple functional units," IEEE Trans. Circuits Syst., vol. 53, no. 9, pp. 976-980, Sep. 2006.
[29]
W. Yu, "LU Decomposition on a multiprocessing system with communication delay," Ph.D. dissertation, Univ. of California at Berkeley, Berkeley, CA, 1984.
[30]
Y. Zhang and K. Kennedy, "Relative performance of scheduling algorithms in grid environments," in Proc. 7th IEEE Int. Symp. Cluster Comput. Grid, May 2007, pp. 521-528.
[31]
V. Zivojnovic, J. Martinez, C. Schlager, and H. Meyr, "Dspstone: A DSP-oriented benchmarking methodology," in Proc. Int. Conf. Signal Process. Appl. Technol., Oct. 1994, pp. 715-720.

Cited By

View all
  • (2012)WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architectureACM SIGPLAN Notices10.1145/2345141.224842447:5(31-40)Online publication date: 12-Jun-2012
  • (2012)WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architectureProceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems10.1145/2248418.2248424(31-40)Online publication date: 12-Jun-2012
  • (2010)Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputationProceedings of the 47th Design Automation Conference10.1145/1837274.1837363(350-355)Online publication date: 13-Jun-2010

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Signal Processing
IEEE Transactions on Signal Processing  Volume 57, Issue 11
November 2009
434 pages

Publisher

IEEE Press

Publication History

Published: 01 November 2009
Accepted: 06 May 2009
Received: 06 November 2008

Author Tags

  1. Architecture
  2. architecture
  3. clustered processors
  4. data path synthesis
  5. intercluster connection network

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2012)WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architectureACM SIGPLAN Notices10.1145/2345141.224842447:5(31-40)Online publication date: 12-Jun-2012
  • (2012)WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architectureProceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems10.1145/2248418.2248424(31-40)Online publication date: 12-Jun-2012
  • (2010)Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputationProceedings of the 47th Design Automation Conference10.1145/1837274.1837363(350-355)Online publication date: 13-Jun-2010

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media