article

SMTp: An Architecture for Next-generation Scalable Multi-threading

Authors:

Mainak Chaudhuri,

Mark HeinrichAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 32, Issue 2

Page 124

https://doi.org/10.1145/1028176.1006712

Published: 02 March 2004 Publication History

Abstract

We introduce the SMTp architecture-an SMT processoraugmented with a coherence protocol thread context,that together with a standard integrated memory controllercan enable the design of (among other possibilities) scalablecache-coherent hardware distributed shared memory(DSM) machines from commodity nodes. We describe theminor changes needed to a conventional out-of-order multi-threadedcore to realize SMTp, discussing issues related toboth deadlock avoidance and performance. We then compareSMTp performance to that of various conventionalDSM machines with normal SMT processors both with andwithout integrated memory controllers. On configurationsfrom 1 to 32 nodes, with 1 to 4 application threads pernode, we find that SMTp delivers performance comparableto, and sometimes better than, machines with more complexintegrated DSM-specific memory controllers. Our resultsalso show that the protocol thread has extremely lowpipeline overhead. Given the simplicity and the flexibility ofthe SMTp mechanism, we argue that next-generation multi-threadedprocessors with integrated memory controllersshould adopt this mechanism as a way of building less complexhigh-performance DSM multiprocessors.

References

[1]

{1} G. Abandah and E. Davidson. Effects of Architectural and Technological Advances on the HP/Convex Exemplar's Memory and Communication Performance. In Proceedings of the 25th International Symposium on Computer Architecture , pages 318-329, June 1998.

Digital Library

[2]

{2} A. Agarwal et al. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 2-13, June 1995.

Digital Library

[3]

{3} L. Barroso et al. Piranha: A Scalable Architecture Based on Single-chip Multiprocessing. In Proceedings of the 27th International Symposium on Computer Architecture, pages 282-293, June 2000.

Digital Library

[4]

{4} R. S. Chappell et al. Difficult-Path Branch Prediction Using Subordinate Microthreads. In Proceedings of the 29th International Symposium on Computer Architecture, pages 307- 317, May 2002.

Digital Library

[5]

{5} J. D. Collins et al. Dynamic Speculative Precomputation. In Proceedings of the 34th ACM/IEEE International Symposium on Microarchitecture, pages 306-317, December 2001.

Digital Library

[6]

{6} D. E. Culler, J. P. Singh with A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, Inc., 1999.

Digital Library

[7]

{7} Z. Cvetanovic. Performance Analysis of the Alpha 21364- Based HP GS1280 Multiprocessor. In Proceedings of the 30th International Symposium on Computer Architecture, pages 218-228, June 2003.

Digital Library

[8]

{8} M. Frigo and S. G. Johnson. FFTW: An Adaptive Software Architecture for the FFT. In Proceedings of the 23rd International Conference on Acoustics, Speech, and Signal Processing , pages 1381-1384, May 1998.

[9]

{9} M. Galles. Spider: A High-Speed Network Interconnect. In IEEE Micro, 17(1):34-39, January-February 1997.

Digital Library

[10]

{10} J. Gibson et al. FLASH vs. (Simulated) FLASH: Closing the Simulation Loop. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 49-58, November 2000.

Digital Library

[11]

{11} P. Glaskowsky. IBM Raises Curtain on Power5. In Microprocessor Watch, Issue#113, October 27, 2003.

[12]

{12} H. Grahn and P. Stenströom. Efficient Strategies for Software-Only Directory Protocols in Shared-Memory Multiprocessors. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 38-47, June 1995.

Digital Library

[13]

{13} M. Heinrich and M. Chaudhuri. Ocean Warning: Avoid Drowning. In ACM SIGARCH Computer Architecture News, 31(3):30-32, June 2003.

Digital Library

[14]

{14} M. Heinrich et al. The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 274-285, October 1994.

Digital Library

[15]

{15} M. Heinrich, E. Speight, and M. Chaudhuri. Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters. In Proceedings of the Fourth International Symposium on High-Performance Computing, Lecture Notes in Computer Science, Vol. 2327, pages 78-92, Springer-Verlag, May 2002.

Digital Library

[16]

{16} InfiniBand Architecture Specification, Volume 1.0, Release 1.0. InfiniBand Trade Association, October 24, 2000.

[17]

{17} R. Kalla, B. Sinharoy, and J. Tendler. Simultaneous Multithreading Implementation in POWER5-IBM's Next Generation POWER Microprocessor. In Hot Chips 15, August 2003.

[18]

{18} C. N. Keltcher et al. The AMD Opteron Processor for Multiprocessor Servers. In IEEE Micro 23(2):66-76, March-April 2003.

Digital Library

[19]

{19} R. E. Kessler. The Alpha 21264 Microprocessor. In IEEE Micro , 19(2):24-36, March-April 1999.

Digital Library

[20]

{20} D. Kim et al. Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems. In IEEE Transactions on Computers, 53(2):288-307, February 2004.

Digital Library

[21]

{21} D. Koufaty and D. T. Marr. Hyperthreading Technology in the Netburst Microarchitecture. In IEEE Micro, 23(2):56-65, March-April 2003.

Digital Library

[22]

{22} J. Kuskin et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pages 302-313, April 1994.

Digital Library

[23]

{23} J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th International Symposium on Computer Architecture, pages 241- 251, June 1997.

Digital Library

[24]

{24} D. Lenoski et al. The Stanford DASH Multiprocessor. In IEEE Computer, 25(3):63-79, March 1992.

Digital Library

[25]

{25} T. D. Lovett, R. M. Clapp, and R. J. Safranek. NUMA-Q: An SCI-based Enterprise Server. Sequent Computer Systems Inc., 1996.

[26]

{26} T. D. Lovett and R. M. Clapp. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. In Proceedings of the 23rd International Symposium on Computer Architecture , pages 308-317, May 1996.

Digital Library

[27]

{27} D. T. Marr et al. Hyper-Threading Technology Architecture and Microarchitecture. In Intel Technology Journal, Vol. 6, Issue 1, pages 4-15, February 2002.

[28]

{28} M. Michael at al. Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors. In Proceedings of the 24th International Symposium on Computer Architecture , pages 219-228, June 1997.

Digital Library

[29]

{29} S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In Proceedings of the 29th International Symposium on Computer Architecture, pages 99-110, May 2002.

Digital Library

[30]

{30} A. Nowatzyk et al. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 24th International Conference on Parallel Processing, Vol. 1, pages 1-10, August 1995.

[31]

{31} M. Parker, A. Davis, and W. Hsieh. Message-Passing for the 21st Century: Integrating User-Level Networks with SMT. In Proceedings of the 5th Workshop on Multithreaded Execution, Architecture and Compilation, December 2001.

[32]

{32} M. Parker. A Case for User-Level Interrupts. In HPCA Work-In-Progress , February 2002.

Digital Library

[33]

{33} PCI Express Advanced Switching. Intel Press Release. Available at http://www.intel.com/pressroom/ archive/releases/20030626net.htm.

[34]

{34} M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. In Proceedings of the 29th International Symposium on Computer Architecture, pages 111-122, May 2002.

Digital Library

[35]

{35} S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 34-43, May 1996.

Digital Library

[36]

{36} A. Roth and G. S. Sohi. Speculative Data-Driven Multithreading. In Proceedings of the 7th International Conference on High Performance Computer Architecture, pages 191-202, January 2001.

Digital Library

[37]

{37} K. Skadron et al. Improving Prediction for Procedure Returns with Return-Address-Stack Repair Mechanisms. In Proceedings of the 31st ACM/IEEE International Symposium on Microarchitecture, pages 259-271, December 1998.

Digital Library

[38]

{38} K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 257-268, November 2000.

Digital Library

[39]

{39} Sun Microsystems. An Overview of UltraSPARC III Cu. White Paper, September 2003. Available at http:// www.sun.com/processors/whitepapers/USIIICuoverview.pdf.

[40]

{40} Sun Microsystems. UltraSPARC IV Processor Architecture Overview. White Paper, February 2004. Available at http://www.sun.com/processors/whitepapers/ us4_whitepaper.pdf.

[41]

{41} D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 392-403, June 1995.

Digital Library

[42]

{42} D. M. Tullsen et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 191-202, May 1996.

Digital Library

[43]

{43} T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-Fault Recovery Using Simultaneous Multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, pages 87-98, May 2002.

Digital Library

[44]

{44} S. C. Woo et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd International Symposium on Computer Architecture , pages 24-36, June 1995.

Digital Library

[45]

{45} K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. In IEEE Micro, 16(2):28-40, April 1996.

Digital Library

[46]

{46} C. B. Zilles and G. S. Sohi. Execution-based Prediction Using Speculative Slices. In Proceedings of the 28th International Symposium on Computer Architecture, pages 2-13, July 2001.

Digital Library

Cited By

Zhu L(2015)Processing Recommender Top-N Queries in Relational DatabasesJournal of Software10.17706/jsw.10.2.162-17110:2(162-171)Online publication date: Feb-2015
https://doi.org/10.17706/jsw.10.2.162-171
Tatas KSiozios KSoudris DJantsch ATatas KSiozios KSoudris DJantsch A(2013)Middleware Memory Management in NoCDesigning 2D and 3D Network-on-Chip Architectures10.1007/978-1-4614-4274-5_8(191-208)Online publication date: 9-Oct-2013
https://doi.org/10.1007/978-1-4614-4274-5_8
Fensch CCintra M(2008)An OS-based alternative to full hardware coherence on tiled CMPs2008 IEEE 14th International Symposium on High Performance Computer Architecture10.1109/HPCA.2008.4658652(355-366)Online publication date: Feb-2008
https://doi.org/10.1109/HPCA.2008.4658652
Show More Cited By

Recommendations

SMTp: An Architecture for Next-generation Scalable Multi-threading
ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture

We introduce the SMTp architecture-an SMT processoraugmented with a coherence protocol thread context,that together with a standard integrated memory controllercan enable the design of (among other possibilities) scalablecache-coherent hardware ...
RFC1891: SMTP Service Extension for Delivery Status Notifications
RFC 3865: A No Soliciting Simple Mail Transfer Protocol (SMTP) Service Extension

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 32, Issue 2

ISCA 2004

March 2004

373 pages

ISSN:0163-5964

DOI:10.1145/1028176

Issue’s Table of Contents

ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture
June 2004
373 pages
ISBN:0769521436

Copyright © 2004 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2004

Published in SIGARCH Volume 32, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
1,013
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu L(2015)Processing Recommender Top-N Queries in Relational DatabasesJournal of Software10.17706/jsw.10.2.162-17110:2(162-171)Online publication date: Feb-2015
https://doi.org/10.17706/jsw.10.2.162-171
Tatas KSiozios KSoudris DJantsch ATatas KSiozios KSoudris DJantsch A(2013)Middleware Memory Management in NoCDesigning 2D and 3D Network-on-Chip Architectures10.1007/978-1-4614-4274-5_8(191-208)Online publication date: 9-Oct-2013
https://doi.org/10.1007/978-1-4614-4274-5_8
Fensch CCintra M(2008)An OS-based alternative to full hardware coherence on tiled CMPs2008 IEEE 14th International Symposium on High Performance Computer Architecture10.1109/HPCA.2008.4658652(355-366)Online publication date: Feb-2008
https://doi.org/10.1109/HPCA.2008.4658652
Zeffer HRadovic ZHagersten E(2006)Exploiting locality: a flexible DSM approachProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639273(10 pp.)Online publication date: 2006
https://doi.org/10.1109/IPDPS.2006.1639273
Ku WChou SChu JKang CChen TGuo J(2006)Collaborative Multithreading: An Open Scalable Processor Architecture for Embedded Multimedia Applications2006 IEEE International Conference on Multimedia and Expo10.1109/ICME.2006.262505(25-28)Online publication date: Dec-2006
https://doi.org/10.1109/ICME.2006.262505
Chen XLu ZJantsch AChen SDe Micheli GAl-Hashimi BMueller WMacii E(2010)Supporting distributed shared memory on multi-core network-on-chips using a dual microcoded controllerProceedings of the Conference on Design, Automation and Test in Europe10.5555/1870926.1870939(39-44)Online publication date: 8-Mar-2010
https://dl.acm.org/doi/10.5555/1870926.1870939
Zeffer HHagersten EVerastegui B(2007)A case for low-complexity MP architecturesProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362648(1-12)Online publication date: 16-Nov-2007
https://dl.acm.org/doi/10.1145/1362622.1362648
Chu JKu WChou SChen TGuo JLevitan S(2007)An embedded coherent-multithreading multimedia processor and its programming modelProceedings of the 44th annual Design Automation Conference10.1145/1278480.1278646(652-657)Online publication date: 4-Jun-2007
https://dl.acm.org/doi/10.1145/1278480.1278646
Zeffer HRadović ZHagersten E(2006)Exploiting localityProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1898969(33-33)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898953.1898969
Zeffer HRadović ZKarlsson MHagersten EEgan GMuraoka Y(2006)TMAProceedings of the 20th annual international conference on Supercomputing10.1145/1183401.1183438(259-268)Online publication date: 28-Jun-2006
https://dl.acm.org/doi/10.1145/1183401.1183438

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents