Article

Free access

Processor coupling: integrating compile time and runtime scheduling for parallelism

Authors:

Stephem W. Keckler,

William J. DallyAuthors Info & Claims

ISCA '92: Proceedings of the 19th annual international symposium on Computer architecture

Pages 202 - 213

https://doi.org/10.1145/139669.139728

Published: 01 April 1992 Publication History

Abstract

The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling, a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-thread parallelism, by using compile time and runtime scheduling. The compiler statically schedules individual threads to discover available intra-thread instruction-level parallelism. The runtime scheduling mechanism interleaves threads, exploiting inter-thread parallelism to maintain high ALU utilization. ALUs are assigned to threads on a cycle by cycle basis, and several threads can be active concurrently. We provide simulation results demonstrating that, on four simple numerical benchmarks, processor coupling achieves better performance than purely statically scheduled or multi-processor machine organizations. We examine how performance is affected by restricted communication between ALUs and by long memory latencies. We also present an implementation and feasibility study of a processor coupled node.

References

[1]

ALVERSON, R., CALLAHAN, D., CUMMI/qGS, D., KOBLENZ, B., PORTERFIELD, A., AND SMITH, 1}. The Tera computer system. In Proceedings of the International Conference on Supercomputing (June 1990), pp. 1-6.

Digital Library

[2]

ARVIND, AND CULLER, D. E. Dataflow architectures. AnnualReviews in Computer Science 1 (February 1986), 225-53.

Digital Library

[3]

COLWELL, R. P., HALL, W. E., JOSHI, C. S., PAPWORTH, D. B., RODMAN, P. K., AND TORNES, J. E. Architecture and implementation of a VLIW supercomputer. In Proceedings of Supercomputing '90 (November 1990), IEEE Computer Society Press, pp. 910-919.

Digital Library

[4]

COLWELL, R. P., NIX, R. P., O'DONNELL, J. J., PAPWORTH, D. B., AND RODMAN, P. K. A VLIW architecture for a trace scheduling compiler. IEEE Transactions on Computers 37, 8 (August 1988), 967-979.

Digital Library

[5]

ELLIS, J. R. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA, 1986.

Digital Library

[6]

FISHER, J. A., AND RAU, B. R. Instruction-level parallel processing. Science 253 (September 1991), 1233-1241.

[7]

GUPTA, A., AND WEBER, W.-D. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results. In Proceedings of the 16th Annual Symposium on Computer Architecture (May 1989), IEEE, pp. 273-280.

Digital Library

[8]

HAI.STEAD, R. H., AND FUJITA, T. MASA: A mtdtithreadedprocessor architecture for parallel symbolic computing. In Proceedings of the 15th Annual Symposium on Computer Architecture (1988), IEEE, pp. 443--451.

Digital Library

[9]

IANUCCI, R. A. Toward a dataflow/Von Neumann hybrid architecture. In Proceedings of the 15th Annual Symposium on Computer Architecture (1988), IEEE, pp. 131-140.

Digital Library

[10]

JOHNSON, W. M. Superscalar Microprocessor Design. Prentice Hall, Englewood Cliffs, N J, 199I.

[11]

JouPPI, N. e., AND WALL, D. W. Available instruction level parallelism for superscalar and superpipelined machines. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (April 1989), ACM Press, pp. 272-282.

Digital Library

[12]

KECKLER, S. W. A coupled multi-ALU processing node for a highly parallel computer. Tech. Rep. 1355, Massachusetts Institute of Technology Artificial Intelligence Laboratory, Cambridge, MA 02139, 1992.

[13]

KUErtN, J. T., AND SMITH, B. J. The Horizon supercomputing system: Architecture and software. In Proceedings of Supercomputing '88 (Orlando, Florida, November 1988), pp. 28-34.

Digital Library

[14]

LAM, M. Software pipelining: An effective scheduling technique for VLIW machines. In ACM Sigplan '88 Conference on Programming Language Design and Implementation (1988), pp. 318-328,

Digital Library

[15]

NUTH, P. Px., AND DALLY, W.J. A mechanism for efficient context switching. In Proceedings of the International Conference on Computer Design (October 1991), IEEE, pp. 301-304.

Digital Library

[16]

SADAYAPPAN, P., AND VlSVANATHAN, V. Circuit simulation on shared memory multiprocessors. IEEE Transactions on Computers 37, 12 (December 1988), 1634-1642.

Digital Library

[17]

SMrrH, B. J. Architecture and applications of the HEP multiprocessor computer system. SPIE 298 (I981), 241-248.

[18]

TOMASULO, R. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal 11 (January 1967), 25-33.

Digital Library

[19]

WOLFE, A., AND SHEN, J, P. A variable instruction stream extension to the VLIW archttecture. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (April 1991), ACM Press, pp. 2-14.

Digital Library

Cited By

Bonasu AKarmunchi SWang N(2020)Design of Efficient Dynamic Scheduling of RISC Processor Instructions2020 11th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON51383.2020.9284902(0236-0240)Online publication date: 4-Nov-2020
https://doi.org/10.1109/IEMCON51383.2020.9284902
Hoogerbrugge JTerechko A(2011)A multithreaded multicore system for embedded media processingTransactions on high-performance embedded architectures and compilers III10.5555/1980776.1980787(154-173)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.5555/1980776.1980787
Hoogerbrugge JTerechko A(2011)A Multithreaded Multicore System for Embedded Media ProcessingProceedings of the 2011 conference on Transactions on High-Performance Embedded Architectures and Compilers III - Volume 659010.1007/978-3-642-19448-1_9(154-173)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.1007/978-3-642-19448-1_9
Show More Cited By

Index Terms

Processor coupling: integrating compile time and runtime scheduling for parallelism

Recommendations

Processor coupling: integrating compile time and runtime scheduling for parallelism
Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)

The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling, a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-...
Super-scalar processor design
The Superthreaded Processor Architecture

The common single-threaded execution model limits processors to exploiting only the relatively small amount of instruction-level parallelism available in application programs. The superthreaded processor, on the other hand, is a concurrent multithreaded ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '92: Proceedings of the 19th annual international symposium on Computer architecture

May 1992

439 pages

ISBN:0897915097

DOI:10.1145/139669

Chairman:
Allan Gottlieb
New York Unvi., New York, NY

ACM SIGARCH Computer Architecture News Volume 20, Issue 2
Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)
May 1992
429 pages
ISSN:0163-5964
DOI:10.1145/146628
Editor:
Allan Gotlieb
New York Univ., New York, NY
Issue’s Table of Contents

Copyright © 1992 Authors.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 1992

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

ISCA92

Sponsor:

SIGARCH
IEEE-CS

ISCA92: International Conference on Computer Architecture

May 19 - 21, 1992

Queensland, Australia

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

85
Total Citations
View Citations
615
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)14

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bonasu AKarmunchi SWang N(2020)Design of Efficient Dynamic Scheduling of RISC Processor Instructions2020 11th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON51383.2020.9284902(0236-0240)Online publication date: 4-Nov-2020
https://doi.org/10.1109/IEMCON51383.2020.9284902
Hoogerbrugge JTerechko A(2011)A multithreaded multicore system for embedded media processingTransactions on high-performance embedded architectures and compilers III10.5555/1980776.1980787(154-173)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.5555/1980776.1980787
Hoogerbrugge JTerechko A(2011)A Multithreaded Multicore System for Embedded Media ProcessingProceedings of the 2011 conference on Transactions on High-Performance Embedded Architectures and Compilers III - Volume 659010.1007/978-3-642-19448-1_9(154-173)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.1007/978-3-642-19448-1_9
Baniwal RPandey K(2010)Recent Trends in Superscalar Architecture to Exploit More Instruction Level ParallelismInformation and Communication Technologies10.1007/978-3-642-15766-0_116(660-665)Online publication date: 2010
https://doi.org/10.1007/978-3-642-15766-0_116
Shen ZHe HSun Y(2009)Simultaneous Multithreading VLIW DSP Architecture with Dynamic Dispatch MechanismProceedings of the 2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools10.1109/DSD.2009.128(505-512)Online publication date: 27-Aug-2009
https://dl.acm.org/doi/10.1109/DSD.2009.128
Ozer EConte T(2005)High-Performance and Low-Cost Dual-Thread VLIW Processor Using Weld Architecture ParadigmIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2005.15016:12(1132-1142)Online publication date: 1-Dec-2005
https://dl.acm.org/doi/10.1109/TPDS.2005.150
Wan JChen S(2005)Reducing Conflicts in SMT VLIW Processor for Higher ThroughputProceedings of the Second International Conference on Embedded Software and Systems10.1109/ICESS.2005.80(5-11)Online publication date: 16-Dec-2005
https://dl.acm.org/doi/10.1109/ICESS.2005.80
Balasubramonian RMuralimanohar NRamani KVenkatachalapathy V(2005)Microarchitectural Wire Management for Performance and Power in Partitioned ArchitecturesProceedings of the 11th International Symposium on High-Performance Computer Architecture10.1109/HPCA.2005.21(28-39)Online publication date: 12-Feb-2005
https://dl.acm.org/doi/10.1109/HPCA.2005.21
Chiou DAng BGreiner RArvind Hoe JBeckerle MHicks JBoughton A(2005)StarT-NG: Delivering seamless parallel computingEURO-PAR '95 Parallel Processing10.1007/BFb0020458(101-116)Online publication date: 9-Jun-2005
https://doi.org/10.1007/BFb0020458
Chou YSiewiorek DShen J(2005)A realistic study on multithreaded superscalar processor designEuro-Par'97 Parallel Processing10.1007/BFb0002858(1092-1101)Online publication date: 26-Sep-2005
https://doi.org/10.1007/BFb0002858
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents