research-article

Optimizing scheduling and intercluster connection for application-specific DSP processors

Authors:

Chun Jason Xue,

Edwin Hsing-Mean ShaAuthors Info & Claims

IEEE Transactions on Signal Processing, Volume 57, Issue 11

Pages 4538 - 4547

https://doi.org/10.1109/TSP.2009.2024870

Published: 01 November 2009 Publication History

Abstract

Signal processing applications have high instruction level parallelism (ILP) and real-time performance requirements. Embedded and application specific multicluster architecture is desirable to provide the large computation power that these applications need. As technology moves to deep submicron level, it becomes more important and challenging to design an efficient intercluster connection network to satisfy the rapid growing intercluster data transfer needs under the power and cost constraints. This paper addresses the automatic generation of intercluster connection network with partially connected buses. An application specific approach is proposed in this paper to determine the minimum number of required partially connected buses without performance degradation for a given schedule in polynomial time. The intercluster connection topology is then generated with the determined minimum number of partially connected buses to minimize the connection bus segments. Further, a scheduling algorithm is presented in this paper to minimize the intercluster communication needs for the given application and to reduce the minimum number of partially connected buses required in the intercluster connection network under schedule length constraint. Experimental results indicate that an average reduction up to 50.6% in the number of minimum required buses and an average reduction of 64.5% in bus segments can be achieved compared to commonly used intercluster communication aware scheduling techniques and as soon as possible (ASAP) data transfer scheme.

References

[1]

N. Bambha and S. Bhattacharyya, "Joint application mapping/interconnect synthesis techniques for embedded chip-scale multiprocessors," IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 2, pp. 99-112, Feb. 2005.

Digital Library

[2]

M. Bekooij, "Phase coupled operation assignment for vliw processors with distributed register files," in Proc. Int. Symp. Syst. Synthesis, Oct. 2001, pp. 118-123.

Digital Library

[3]

L. Chao and E. H.-M. Sha, "Scheduling data-flow graphs via retiming and unfolding," IEEE Trans. Parallel Distrib. Syst., vol. 8, no. 12, pp. 1259-1267, Dec. 1997.

Digital Library

[4]

L. Chao, E. H.-M. Sha, and A. LaPaugh, "Rotation scheduling: A loop pipelining algorithm," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 16, no. 3, pp. 229-239, Mar. 1997.

Digital Library

[5]

P. Faraboschi, G. Brown, J. Fisher, G. Desoll, and F. Homewood, "LX: A technology platform for customizable VLIW embedded processing," in Proc. Int. Symp. Comput. Architecture, 2000, pp. 203-213.

Digital Library

[6]

M. Fredman and R. Tarjan, "Fibonacci heaps and their uses in improved network optimization algorithms," J. Assoc. Comput. Mach., vol. 34, no. 3, pp. 596-615, 1987.

Digital Library

[7]

A. Gangwar, M. Balakrishnan, and A. Kumar, "Impact of intercluster communication mechanisms on ILP in clustered VLIW architecture," ACM Trans. Design Autom. Electron. Syst., vol. 12, no. 1, pp. 1-29, Jan. 2007.

Digital Library

[8]

E. Özer and S. Banerjia, "Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures," in Proc. 31st Annu. ACM/IEEE Int. Symp. Microarchitecture, Dallas, TX, Nov. 30-Dec. 2, 1998, pp. 308-315.

Digital Library

[9]

"TMS320C6000 CPU and Instruction Set Reference Guide," Texas Instruments, Jul. 2006 {Online}. Available: http://focus.ti.com/lit/ug/ spru189g/spru189g.pdf

[10]

M. Jacome and G. D. Veciana, "Design challenges for new application specific processors," IEEE Des. Test Comput., no. 2, pp. 40-50, 2000.

Digital Library

[11]

Y. Jiang, T. Lee, T. Huang, and Y. Lin, "Performance-driven interconnection optimization for microarchitecture synthesis," IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 13, no. 2, pp. 137-149, Feb. 1994.

Digital Library

[12]

N. karmarkar, "A new polynomial time algorithm for linear programming," Combinatorica, vol. 4, no. 4, pp. 373-395, 1984.

Digital Library

[13]

Y.-K. Kwok and I. Ahmad, "Dynamic critical-path scheduling: An effective technique for allocating task graphs to multiprocessors," IEEE Trans. Parallel Distrib. Syst., vol. 7, no. 5, pp. 506-521, May 1996.

Digital Library

[14]

Y.-K. Kwok and I. Ahmad, "Link contention-constrained scheduling and mapping of tasks and messages to a network of heterogeneous processors," Cluster Comput., vol. 3, no. 2, pp. 113-124, 2000.

Digital Library

[15]

R. Lepèpre and D. Trystram, "A new clustering algorithm for scheduling task graphs with large communication delays," in Proc. Int. Parallel Distrib. Process. Symp., 2002.

Digital Library

[16]

D. Lewis, D. Galloway, M. Ierssel, J. Rose, and P. Chow, "The transmogrifier-2: A 1-million gate rapid prototyping system," IEEE Trans. Very Large Scale Integr. Syst., vol. 6, no. 2, pp. 188-198, Jun. 1998.

Digital Library

[17]

C. H. Papadimitriou and K. Steiglitz, Combinational Optimization, Algorithms and Complexity. New York: Dover, 1998.

Digital Library

[18]

S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. Owens, "Register organization for media processing," in Proc. High Perform. Comput. Architecture, 2000, pp. 375-386.

[19]

S. Roos, H. Corporaal, and R. Lamberts, "Clustering on the move," Proc. 4th Int. Conf. Massively Parallel Comput. Syst., Apr. 2002.

[20]

J. Sanchez and A. Gonzales, "Instruction scheduling for clustered VLIW architecture," in Proc. Int. Symp. Syst. Synthesis, Jan. 2000, vol. 12, no. 1.

Digital Library

[21]

Z. Shao, M. Wang, Y. Chen, C. Xue, M. Qiu, L. Yang, and E.-M. Sha, "Real-time dynamic voltage loop scheduling for multi-core embedded systems," IEEE Trans. Circuits Sys., vol. 54, no. 5, pp. 445-449, May 2007.

[22]

Z. Shao, C. Xue, Q. Zhuge, B. Xiao, and E.-M. Sha, "Loop scheduling with timing and switching-activity minimization for VLIW DSP," ACM Trans. Des. Autom. Electron. Syst., vol. 11, no. 1, pp. 165-185, Jan. 2006.

Digital Library

[23]

P. Song, "Demystifying EPIC and IA-64,"Microprocessor Rep. 12(1), Jan. 26, 1998, pp. 21-27 {Online}. Available: http://www.cs.virginia. edu/~gjp5j/cs854/120104.pdf

[24]

S. Sriram and S. S. Bhattacharyya, Embedded Multiprocessors: Scheduling and Synchronization. New York: Marcel Dekker, 2000.

Digital Library

[25]

A. Terechko, E. Thenaff, M. Garg, J. Eijndhoven, and H. Corporaal, "Inter-cluster communication models for clustered VLIW processors," in Proc. High Perform. Comput. Architecture, 2003, pp. 354-364.

Digital Library

[26]

C. Xue, Z. Jia, Z. Shao, M. Wang, and E.-M. Sha, "Optimizing address assignment for scheduling DSPS with multiple functional units," IEEE Trans. Circuits Syst., vol. 55, no. 1, pp. 379-389, Feb. 2008.

[27]

C. Xue, Z. Shao, and E.-M. Sha, "Maximizing parallelism for nested loops via loop striping," J. VLSI Signal Process. Syst. Signal Image Video Technol., vol. 41, no. 2, pp. 153-167, May 2007.

Digital Library

[28]

C. Xue, Z. Shao, Q. Zhuge, B. Xiao, M. Liu, and E.-M. Sha, "Optimizing address assignment for scheduling DSPS with multiple functional units," IEEE Trans. Circuits Syst., vol. 53, no. 9, pp. 976-980, Sep. 2006.

[29]

W. Yu, "LU Decomposition on a multiprocessing system with communication delay," Ph.D. dissertation, Univ. of California at Berkeley, Berkeley, CA, 1984.

Digital Library

[30]

Y. Zhang and K. Kennedy, "Relative performance of scheduling algorithms in grid environments," in Proc. 7th IEEE Int. Symp. Cluster Comput. Grid, May 2007, pp. 521-528.

Digital Library

[31]

V. Zivojnovic, J. Martinez, C. Schlager, and H. Meyr, "Dspstone: A DSP-oriented benchmarking methodology," in Proc. Int. Conf. Signal Process. Appl. Technol., Oct. 1994, pp. 715-720.

Cited By

Huang YZhao MXue C(2012)WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architectureACM SIGPLAN Notices10.1145/2345141.224842447:5(31-40)Online publication date: 12-Jun-2012
https://dl.acm.org/doi/10.1145/2345141.2248424
Huang YZhao MXue CWilhelm RFalk HYi W(2012)WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architectureProceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems10.1145/2248418.2248424(31-40)Online publication date: 12-Jun-2012
https://dl.acm.org/doi/10.1145/2248418.2248424
Hu JXue CTseng WHe YQiu MSha ESapatnekar S(2010)Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputationProceedings of the 47th Design Automation Conference10.1145/1837274.1837363(350-355)Online publication date: 13-Jun-2010
https://dl.acm.org/doi/10.1145/1837274.1837363

Recommendations

A design methodology for application-specific networks-on-chip

With the help of HW/SW codesign, system-on-chip (SoC) can effectively reduce cost, improve reliability, and produce versatile products. The growing complexity of SoC designs makes on-chip communication subsystem design as important as computation ...
Computation and data transfer co-scheduling for interconnection bus minimization
ASP-DAC '09: Proceedings of the 2009 Asia and South Pacific Design Automation Conference

High Instruction-Level-Parallelism in DSP and media applications demands highly clustered architecture. It is challenge to design an efficient, flexible yet cost saving interconnection network to satisfy the rapid increasing inter-cluster data transfer ...
Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect
ISC 2014: Proceedings of the 29th International Conference on Supercomputing - Volume 8488

The Tofu Interconnect 2 Tofu2 is a system interconnect designed for the Fujitsu's next generation successor to the PRIMEHPC FX10 supercomputer. Tofu2 inherited the 6-dimensional mesh/torus network topology from its predecessor, and it increases the link ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Signal Processing

IEEE Transactions on Signal Processing Volume 57, Issue 11

November 2009

434 pages

ISSN:1053-587X

Issue’s Table of Contents

Copyright © 2009.

Publisher

IEEE Press

Publication History

Published: 01 November 2009

Accepted: 06 May 2009

Received: 06 November 2008

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang YZhao MXue C(2012)WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architectureACM SIGPLAN Notices10.1145/2345141.224842447:5(31-40)Online publication date: 12-Jun-2012
https://dl.acm.org/doi/10.1145/2345141.2248424
Huang YZhao MXue CWilhelm RFalk HYi W(2012)WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architectureProceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems10.1145/2248418.2248424(31-40)Online publication date: 12-Jun-2012
https://dl.acm.org/doi/10.1145/2248418.2248424
Hu JXue CTseng WHe YQiu MSha ESapatnekar S(2010)Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputationProceedings of the 47th Design Automation Conference10.1145/1837274.1837363(350-355)Online publication date: 13-Jun-2010
https://dl.acm.org/doi/10.1145/1837274.1837363

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents