Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

SMTp: An Architecture for Next-generation Scalable Multi-threading

Published: 02 March 2004 Publication History

Abstract

We introduce the SMTp architecture-an SMT processoraugmented with a coherence protocol thread context,that together with a standard integrated memory controllercan enable the design of (among other possibilities) scalablecache-coherent hardware distributed shared memory(DSM) machines from commodity nodes. We describe theminor changes needed to a conventional out-of-order multi-threadedcore to realize SMTp, discussing issues related toboth deadlock avoidance and performance. We then compareSMTp performance to that of various conventionalDSM machines with normal SMT processors both with andwithout integrated memory controllers. On configurationsfrom 1 to 32 nodes, with 1 to 4 application threads pernode, we find that SMTp delivers performance comparableto, and sometimes better than, machines with more complexintegrated DSM-specific memory controllers. Our resultsalso show that the protocol thread has extremely lowpipeline overhead. Given the simplicity and the flexibility ofthe SMTp mechanism, we argue that next-generation multi-threadedprocessors with integrated memory controllersshould adopt this mechanism as a way of building less complexhigh-performance DSM multiprocessors.

References

[1]
{1} G. Abandah and E. Davidson. Effects of Architectural and Technological Advances on the HP/Convex Exemplar's Memory and Communication Performance. In Proceedings of the 25th International Symposium on Computer Architecture , pages 318-329, June 1998.
[2]
{2} A. Agarwal et al. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 2-13, June 1995.
[3]
{3} L. Barroso et al. Piranha: A Scalable Architecture Based on Single-chip Multiprocessing. In Proceedings of the 27th International Symposium on Computer Architecture, pages 282-293, June 2000.
[4]
{4} R. S. Chappell et al. Difficult-Path Branch Prediction Using Subordinate Microthreads. In Proceedings of the 29th International Symposium on Computer Architecture, pages 307- 317, May 2002.
[5]
{5} J. D. Collins et al. Dynamic Speculative Precomputation. In Proceedings of the 34th ACM/IEEE International Symposium on Microarchitecture, pages 306-317, December 2001.
[6]
{6} D. E. Culler, J. P. Singh with A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, Inc., 1999.
[7]
{7} Z. Cvetanovic. Performance Analysis of the Alpha 21364- Based HP GS1280 Multiprocessor. In Proceedings of the 30th International Symposium on Computer Architecture, pages 218-228, June 2003.
[8]
{8} M. Frigo and S. G. Johnson. FFTW: An Adaptive Software Architecture for the FFT. In Proceedings of the 23rd International Conference on Acoustics, Speech, and Signal Processing , pages 1381-1384, May 1998.
[9]
{9} M. Galles. Spider: A High-Speed Network Interconnect. In IEEE Micro, 17(1):34-39, January-February 1997.
[10]
{10} J. Gibson et al. FLASH vs. (Simulated) FLASH: Closing the Simulation Loop. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 49-58, November 2000.
[11]
{11} P. Glaskowsky. IBM Raises Curtain on Power5. In Microprocessor Watch, Issue#113, October 27, 2003.
[12]
{12} H. Grahn and P. Stenströom. Efficient Strategies for Software-Only Directory Protocols in Shared-Memory Multiprocessors. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 38-47, June 1995.
[13]
{13} M. Heinrich and M. Chaudhuri. Ocean Warning: Avoid Drowning. In ACM SIGARCH Computer Architecture News, 31(3):30-32, June 2003.
[14]
{14} M. Heinrich et al. The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 274-285, October 1994.
[15]
{15} M. Heinrich, E. Speight, and M. Chaudhuri. Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters. In Proceedings of the Fourth International Symposium on High-Performance Computing, Lecture Notes in Computer Science, Vol. 2327, pages 78-92, Springer-Verlag, May 2002.
[16]
{16} InfiniBand Architecture Specification, Volume 1.0, Release 1.0. InfiniBand Trade Association, October 24, 2000.
[17]
{17} R. Kalla, B. Sinharoy, and J. Tendler. Simultaneous Multithreading Implementation in POWER5-IBM's Next Generation POWER Microprocessor. In Hot Chips 15, August 2003.
[18]
{18} C. N. Keltcher et al. The AMD Opteron Processor for Multiprocessor Servers. In IEEE Micro 23(2):66-76, March-April 2003.
[19]
{19} R. E. Kessler. The Alpha 21264 Microprocessor. In IEEE Micro , 19(2):24-36, March-April 1999.
[20]
{20} D. Kim et al. Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems. In IEEE Transactions on Computers, 53(2):288-307, February 2004.
[21]
{21} D. Koufaty and D. T. Marr. Hyperthreading Technology in the Netburst Microarchitecture. In IEEE Micro, 23(2):56-65, March-April 2003.
[22]
{22} J. Kuskin et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pages 302-313, April 1994.
[23]
{23} J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th International Symposium on Computer Architecture, pages 241- 251, June 1997.
[24]
{24} D. Lenoski et al. The Stanford DASH Multiprocessor. In IEEE Computer, 25(3):63-79, March 1992.
[25]
{25} T. D. Lovett, R. M. Clapp, and R. J. Safranek. NUMA-Q: An SCI-based Enterprise Server. Sequent Computer Systems Inc., 1996.
[26]
{26} T. D. Lovett and R. M. Clapp. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. In Proceedings of the 23rd International Symposium on Computer Architecture , pages 308-317, May 1996.
[27]
{27} D. T. Marr et al. Hyper-Threading Technology Architecture and Microarchitecture. In Intel Technology Journal, Vol. 6, Issue 1, pages 4-15, February 2002.
[28]
{28} M. Michael at al. Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors. In Proceedings of the 24th International Symposium on Computer Architecture , pages 219-228, June 1997.
[29]
{29} S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In Proceedings of the 29th International Symposium on Computer Architecture, pages 99-110, May 2002.
[30]
{30} A. Nowatzyk et al. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 24th International Conference on Parallel Processing, Vol. 1, pages 1-10, August 1995.
[31]
{31} M. Parker, A. Davis, and W. Hsieh. Message-Passing for the 21st Century: Integrating User-Level Networks with SMT. In Proceedings of the 5th Workshop on Multithreaded Execution, Architecture and Compilation, December 2001.
[32]
{32} M. Parker. A Case for User-Level Interrupts. In HPCA Work-In-Progress , February 2002.
[33]
{33} PCI Express Advanced Switching. Intel Press Release. Available at http://www.intel.com/pressroom/ archive/releases/20030626net.htm.
[34]
{34} M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. In Proceedings of the 29th International Symposium on Computer Architecture, pages 111-122, May 2002.
[35]
{35} S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 34-43, May 1996.
[36]
{36} A. Roth and G. S. Sohi. Speculative Data-Driven Multithreading. In Proceedings of the 7th International Conference on High Performance Computer Architecture, pages 191-202, January 2001.
[37]
{37} K. Skadron et al. Improving Prediction for Procedure Returns with Return-Address-Stack Repair Mechanisms. In Proceedings of the 31st ACM/IEEE International Symposium on Microarchitecture, pages 259-271, December 1998.
[38]
{38} K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 257-268, November 2000.
[39]
{39} Sun Microsystems. An Overview of UltraSPARC III Cu. White Paper, September 2003. Available at http:// www.sun.com/processors/whitepapers/USIIICuoverview.pdf.
[40]
{40} Sun Microsystems. UltraSPARC IV Processor Architecture Overview. White Paper, February 2004. Available at http://www.sun.com/processors/whitepapers/ us4_whitepaper.pdf.
[41]
{41} D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 392-403, June 1995.
[42]
{42} D. M. Tullsen et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 191-202, May 1996.
[43]
{43} T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-Fault Recovery Using Simultaneous Multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, pages 87-98, May 2002.
[44]
{44} S. C. Woo et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd International Symposium on Computer Architecture , pages 24-36, June 1995.
[45]
{45} K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. In IEEE Micro, 16(2):28-40, April 1996.
[46]
{46} C. B. Zilles and G. S. Sohi. Execution-based Prediction Using Speculative Slices. In Proceedings of the 28th International Symposium on Computer Architecture, pages 2-13, July 2001.

Cited By

View all
  • (2015)Processing Recommender Top-N Queries in Relational DatabasesJournal of Software10.17706/jsw.10.2.162-17110:2(162-171)Online publication date: Feb-2015
  • (2013)Middleware Memory Management in NoCDesigning 2D and 3D Network-on-Chip Architectures10.1007/978-1-4614-4274-5_8(191-208)Online publication date: 9-Oct-2013
  • (2008)An OS-based alternative to full hardware coherence on tiled CMPs2008 IEEE 14th International Symposium on High Performance Computer Architecture10.1109/HPCA.2008.4658652(355-366)Online publication date: Feb-2008
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 32, Issue 2
ISCA 2004
March 2004
373 pages
ISSN:0163-5964
DOI:10.1145/1028176
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture
    June 2004
    373 pages
    ISBN:0769521436

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2004
Published in SIGARCH Volume 32, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2015)Processing Recommender Top-N Queries in Relational DatabasesJournal of Software10.17706/jsw.10.2.162-17110:2(162-171)Online publication date: Feb-2015
  • (2013)Middleware Memory Management in NoCDesigning 2D and 3D Network-on-Chip Architectures10.1007/978-1-4614-4274-5_8(191-208)Online publication date: 9-Oct-2013
  • (2008)An OS-based alternative to full hardware coherence on tiled CMPs2008 IEEE 14th International Symposium on High Performance Computer Architecture10.1109/HPCA.2008.4658652(355-366)Online publication date: Feb-2008
  • (2006)Exploiting locality: a flexible DSM approachProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639273(10 pp.)Online publication date: 2006
  • (2006)Collaborative Multithreading: An Open Scalable Processor Architecture for Embedded Multimedia Applications2006 IEEE International Conference on Multimedia and Expo10.1109/ICME.2006.262505(25-28)Online publication date: Dec-2006
  • (2010)Supporting distributed shared memory on multi-core network-on-chips using a dual microcoded controllerProceedings of the Conference on Design, Automation and Test in Europe10.5555/1870926.1870939(39-44)Online publication date: 8-Mar-2010
  • (2007)A case for low-complexity MP architecturesProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362648(1-12)Online publication date: 16-Nov-2007
  • (2007)An embedded coherent-multithreading multimedia processor and its programming modelProceedings of the 44th annual Design Automation Conference10.1145/1278480.1278646(652-657)Online publication date: 4-Jun-2007
  • (2006)Exploiting localityProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1898969(33-33)Online publication date: 25-Apr-2006
  • (2006)TMAProceedings of the 20th annual international conference on Supercomputing10.1145/1183401.1183438(259-268)Online publication date: 28-Jun-2006

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media