Article

Evaluating kilo-instruction multiprocessors

Authors:

Marco Galluzzi,

Ramón Beivide,

Valentin Puente,

José-Ángel Gregorio,

Adrian Cristal,

Mateo ValeroAuthors Info & Claims

WMPI '04: Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture

Pages 72 - 79

https://doi.org/10.1145/1054943.1054953

Published: 20 June 2004 Publication History

Abstract

The ever increasing gap in processor and memory speeds has a very negative impact on performance. One possible solution to overcome this problem is the Kilo-instruction processor. It is a recent proposed architecture able to hide large memory latencies by having thousands of in-flight instructions. Current multiprocessor systems also have to deal with this increasing memory latency while facing other sources of latencies: those coming from communication among processors. What we propose, in this paper, is the use of Kilo-instruction processors as computing nodes for small-scale CCNUMA multiprocessors. We evaluate what we appropriately call Kilo-instruction Multiprocessors. This kind of systems appears to achieve very good performance while showing two interesting behaviours. First, the great amount of in-flight instructions makes the system not just to hide the latencies coming from the memory accesses but also the inherent communication latencies involved in remote memory accesses. Second, the significant pressure imposed by many in-flight instructions translates into a very high contention for the interconnection network, what indicates us that more efforts need to be employed in designing routers capable of managing high traffic levels.

References

[1]

J.-L. Baer and T.-F. Chen. An Effective On-chip Preloading Scheme to Reduce Data Access Penalty. In Proceedings of Supercomputing '91, pages 176--186, November 1991.

Digital Library

[2]

C. Carrion, R. Beivide, J. Gregorio, and F. Vallejo. A Flow Control Mechanism to Avoid Message Deadlock in K-ary N-cube Networks. Fourth International Conference on High Performance Computing, pages 322--329, December 1997.

Digital Library

[3]

A. Cristal, J. F. Martinez, J. Llosa, and M. Valero. A Case for Resource-conscious Out-of-order Processors. In IEEE TCCA Computer Architecture Letters, 2, October 2003.

Digital Library

[4]

A. Cristal, D. Ortega, J. Llosa, and M. Valero. Kilo-instuction Processors. Proceedings of the 5th International Symposium on High Performance Computing (invited paper), pages 10--25, October 2003.

[5]

A. Cristal, D. Ortega, J. Llosa, and M. Valero. Out-of-Order Commit Processors. Proceedings of the 10th Intl. Conference on High Performance Computer Architecture, February 2004.

Digital Library

[6]

A. Cristal, M. Valero, A. Gonzalez, and J. Llosa. Large Virtual ROBs by Processor Checkpointing. Technical Report UPC-DAC-2002-39, Universidad Politécnica de Cataluña, July 2002.

[7]

M. Dubois and Y. Song. Assisted Execution. Technical Report CENG 98-25, Department of EE-Systems, University of Southern California, October 1998.

[8]

M. Galles. Spider: A High-Speed Network Interconnect. IEEE Micro, 17(1):34--39, Jan.-Feb. 1997.

Digital Library

[9]

K. Gharachorloo, A. Gupta, and H. Hennessy. Hiding Memory Latency Using Dynamic Scheduling in Shared-memory Multiprocessors. Proceedings of the 19th Annual Intl. Symposium on Computer Architecture, pages 22--33, May 1992.

Digital Library

[10]

K. Gharachorloo, A. Gupta, and J. Hennessy. Performance Evaluation of Memory Consistency Models for Shared-memory Multiprocessors. Proceedings of the 4th Intl. Conference on Architectural Support for Programming Languages and Operating Systems, pages 245--257, April 1991.

Digital Library

[11]

M. Karlsson, F. Dahlgren, and P. Stenstrom. A Prefetching Technique for Irregular Accesses to Linked Data Structures. Proceedings of the 6th Intl. Conference on High Performance Computer Architecture, pages 206--217, January 2000.

[12]

A. Klaiber and H. Levy. An Architecture for Software-Controlled Data Prefetching. Proceedings of the 18th Annual Intl. Symposium on Computer Architecture, pages 43--53, May 1991.

Digital Library

[13]

M. Blumrich et al. Design and Analysis of the BlueGene/L Torus Interconnection Network. Technical Report RC23025 (W0312-022), IBM Thomas J. Watson Research Center, December 2003.

[14]

J. Martinez, A. Cristal, M. Valero, and J. Llosa. Ephemeral Registers. Technical Report CSL-TR-2003-1035, Cornell Computer Systems Lab, 2003.

[15]

N. R. Adiga et al. An Overview of the BlueGene/L Supercomputer. In Proceedings of Supercomputing '02, November 2002.

Digital Library

[16]

V. Pai, P. Ranganathan, and S. Adve. RSIM: An execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors. IEEE TCCA Newsletter, 35(11):37--48, October 1997.

[17]

V. Puente, J. Gregorio, and R. Beivide. SICOSYS: An Integrated Framework for Studying Interconnection Networks in Multiprocessor Systems. Proceedings of the 10th Euromicro Workshop on Parallel and Distributed Processing, pages 360--368, January 2002.

Digital Library

[18]

V. Puente, J. Gregorio, R. Beivide, and C. Izu. On the Design of a High-Performance Adaptive Router for CC-NUMA Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 14(5), May 2003.

Digital Library

[19]

V. Puente, C. Izu, J. Gregorio, R. Beivide, and F. Vallejo. The Adaptive Bubble Router. Journal on Parallel and Distributed Computing, 61(9):1180--1208, September 2001.

Digital Library

[20]

P. Ranganathan, V. Pai, and S. Adve. Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap between Memory Consistency Models. In Proceedings of the 9th Symposium on Parallel Algorithms and Architectures, June 1997.

Digital Library

[21]

R. Sivaram, C. Stunkel, and D. Panda. HIPQS: a High-Performance Switch Architecture Using Input Queuing. IEEE Transactions on Parallel and Distributed Systems, 13(3):275--289, March 2002.

Digital Library

[22]

A. Smith. Cache Memories. Computing surveys, 14(3):473--530, September 1982.

Digital Library

[23]

C. Stunkel, J. Herring, B. Abali, and R. Sivaram. A New Switch Chip for IBM RS/6000 SP Systems. In Proceedings of Supercomputing '99, November 1999.

Digital Library

[24]

Y. Tamir and G. Frazier. Dynamically-allocated Multiqueue Buffers for VLSI Communication Switches. IEEE Transactions on Computers, 41(2):725--737, June 1992.

Digital Library

[25]

M. Wilkes. Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Computers, 14(2):270--271, April 1965.

[26]

S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. Proceedings of the 22nd Annual Intl. Symposium on Computer Architecture, pages 24--36, June 1995.

Digital Library

[27]

W. Wulf and S. McKee. Hitting the Memory Wall: Implications of the Obvious. Computer Architecture News, 23(1):20--24, March 1995.

Digital Library

Cited By

Ceze LTuck JTorrellas JCascaval C(2019)Bulk Disambiguation of Speculative Threads in MultiprocessorsACM SIGARCH Computer Architecture News10.1145/1150019.113650634:2(227-238)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1145/1150019.1136506
Ceze LTuck JTorrellas JCascaval C(2006)Bulk Disambiguation of Speculative Threads in MultiprocessorsProceedings of the 33rd annual international symposium on Computer Architecture10.1109/ISCA.2006.13(227-238)Online publication date: 17-Jun-2006
https://dl.acm.org/doi/10.1109/ISCA.2006.13

Index Terms

Evaluating kilo-instruction multiprocessors
1. Computer systems organization
  1. Architectures

Recommendations

A first glance at Kilo-instruction based multiprocessors
CF '04: Proceedings of the 1st conference on Computing frontiers

The ever increasing gap between processor and memory speed, sometimes referred to as the Memory Wall problem [42], has a very negative impact on performance. This mismatch will be more severe in future processor's generation. Modern cache organizations ...
Kilo-Instruction Processors: Overcoming the Memory Wall

Kilo-instruction processors are a new type of out-of-order superscalar processor that overlaps long memory access delays by maintaining thousands of in-flight instructions, in a scalable, efficient manner.
Kilo-instruction processors, runahead and prefetching
CF '06: Proceedings of the 3rd conference on Computing frontiers

There is a continuous research effort devoted to overcome the memory wall problem. Prefetching is one of the most frequently used techniques. A prefetch mechanism anticipates the processor requests by moving data into the lower levels of the memory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WMPI '04: Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture

June 2004

146 pages

ISBN:159593040X

DOI:10.1145/1054943

Conference Chairs:
John Carter
University of Utah
,
Lixin Zhang
IBM Austin Research Lab

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
179
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ceze LTuck JTorrellas JCascaval C(2019)Bulk Disambiguation of Speculative Threads in MultiprocessorsACM SIGARCH Computer Architecture News10.1145/1150019.113650634:2(227-238)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1145/1150019.1136506
Ceze LTuck JTorrellas JCascaval C(2006)Bulk Disambiguation of Speculative Threads in MultiprocessorsProceedings of the 33rd annual international symposium on Computer Architecture10.1109/ISCA.2006.13(227-238)Online publication date: 17-Jun-2006
https://dl.acm.org/doi/10.1109/ISCA.2006.13

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten