Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores

Published: 29 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    This work introduces Single Instruction Multi-Thread Express (SIMT-X), a general-purpose Central Processing Unit (CPU) microarchitecture that enables Graphics Processing Units (GPUs)-style SIMT execution across multiple threads of the same program for high throughput, while retaining the latency benefits of out-of-order execution, and the programming convenience of homogeneous multi-thread processors. SIMT-X leverages the existing Single Instruction Multiple Data (SIMD) back-end to provide CPU/GPU-like processing on a single core with minimal overhead. We demonstrate that although SIMT-X invokes a restricted form of Out-of-Order (OoO), the microarchitecture successfully captures a majority of the benefits of aggressive OoO execution using at most two concurrent register mappings per architectural register, while addressing issues of partial dependencies and supporting a general-purpose Instruction Set Architecture (ISA).

    References

    [1]
    Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In PACT. ACM, 72--81.
    [2]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. IEEE Workload Characterization Symposium 0 (2009), 44--54.
    [3]
    Ahmed ElTantawy and Tor M. Aamodt. 2016. MIMD synchronization on SIMT architectures. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture.
    [4]
    Roger Espasa, Federico Ardanaz, Joel Emer, Stephen Felix, Julio Gago, Roger Gramunt, Isaac Hernandez, Toni Juan, Geoff Lowney, Matthew Mattina, et al. 2002. Tarantula: A vector extension to the alpha architecture. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, 281--292.
    [5]
    Roger Espasa, Mateo Valero, and James E. Smith. 1997. Out-of-order vector architectures. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 160--170.
    [6]
    Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization (TACO) 6, 2 (2009), 7.
    [7]
    Azzam Haidar, Ahmad Abdelfatah, Stanimire Tomov, and Jack Dongarra. 2017. High-performance Cholesky factorization for GPU-only execution. In Proceedings of the General Purpose GPUs. ACM, 42--52.
    [8]
    Sébastien Hily and André Seznec. 1999. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture, 1999. IEEE, 64--67.
    [9]
    Intel Corporation. 2017. Intel 64 and IA-32 Architectures Optimization Reference Manual. Intel Corporation.
    [10]
    S. Kalathingal, S. Collange, B. N. Swamy, and A. Seznec. 2017. DITVA: Dynamic inter-thread vectorization architecture. J. Parallel and Distrib. Comput. (2017).
    [11]
    Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In CGO. IEEE, 141--150.
    [12]
    Ji Kim, Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten. 2017. Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 759--773.
    [13]
    Yunsup Lee, Vinod Grover, Ronny Krashinsky, Mark Stephenson, Stephen W. Keckler, and Krste Asanovic. 2014. Exploring the design space of SPMD divergence management on data-parallel architectures. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 101--113.
    [14]
    Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. MICRO-42. IEEE, 469--480.
    [15]
    John Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro. 28, 2 (2008), 39--55.
    [16]
    Guoping Long, Diana Franklin, Susmit Biswas, Pablo Ortiz, Jason Oberg, Dongrui Fan, and Frederic T. Chong. 2010. Minimal multi-threading: Finding and removing redundant instructions in multi-threaded processors. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 337--348.
    [17]
    Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. SIGPLAN Not. 40, 6 (2005), 190--200.
    [18]
    Daniel S. McFarlin, Charles Tucker, and Craig Zilles. 2013. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 241--252.
    [19]
    Michael Mckeown, Jonathan Balkind, and David Wentzlaff. 2014. Execution drafting: Energy efficiency through computation deduplication. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). 432--444.
    [20]
    T. Milanez, S. Collange, F. M. Q. Pereira, W. Meira, and R. Ferreira. 2014. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Comput. 40, 9 (2014), 548--558.
    [21]
    Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput. Surv. 47, 4, Article 69 (July 2015), 35 pages.
    [22]
    John Nickolls and William J. Dally. 2010. The GPU computing era. IEEE Micro. 30 (March 2010), 56--69. Issue 2. http://dx.doi.org/10.1109/MM.2010.41.
    [23]
    NVIDIA2017. NVIDIA Tesla V100 GPU Architecture Whitepaper. NVIDIA.
    [24]
    Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, 271--280.
    [25]
    Alex Pajuelo, Antonio González, and Mateo Valero. 2005. Control-flow independence reuse via dynamic vectorization. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, 10pp.
    [26]
    Matt Pharr and William R. Mark. 2012. ISPC: A SPMD compiler for high-performance CPU programming. In Innovative Parallel Computing (InPar), 2012. IEEE.
    [27]
    Nathanaël Prémillieu and André Seznec. 2012. SYRANT: SYmmetric resource allocation on not-taken and taken paths. ACM Trans. Archit. Code Optim. (TACO)—HIPEAC Papers 8, 4 (2012).
    [28]
    Nathanael Prémillieu and André Seznec. 2014. Efficient out-of-order execution of guarded ISAs. ACM Trans. Archit. Code Optimization 11 (12 2014), 1--21.
    [29]
    E. Safi, A. Moshovos, and A. Veneris. 2011. Two-stage, pipelined register renaming. IEEE Trans. Very Large Scale Integr. VLSI Syst. 19, 10 (2011), 1926--1931.
    [30]
    André Seznec. 2011. A new case for the TAGE branch predictor. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). 117--127.
    [31]
    Faissal M. Sleiman and Thomas F. Wenisch. 2016. Efficiently scaling out-of-order cores for simultaneous multithreading. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 431--443.
    [32]
    N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker. 2017. The ARM scalable vector extension. IEEE Micro. 37, 2 (Mar 2017), 26--39.
    [33]
    Sriram Vajapeyam, P. J. Joseph, and Tulika Mitra. 1999. Dynamic vectorization: A mechanism for exploiting far-flung ILP in ordinary programs. In Proceedings of the 26th International Symposium on Computer Architecture. 16--27.
    [34]
    Perry H. Wang, Hong Wang, Ralph-Michael Kling, Kalpana Ramakrishnan, and John Paul Shen. 2001. Register renaming and scheduling for dynamic execution of predicated code. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 15--25.
    [35]
    Henry Wong and Tor M. Aamodt. 2009. The performance potential for single application heterogeneous systems. In Proceedings of the 8th Workshop on Duplicating, Deconstructing, and Debunking.

    Cited By

    View all
    • (2024)Infrastructure for Exploring SIMT Architecture in General-Purpose Processors2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00043(316-318)Online publication date: 5-May-2024
    • (2023)The Effects of Different Process Parameters of PLA+ on Tensile Strengths in 3D Printer Produced by Fused Deposition ModelingBirleştirme Yığma Modellemesiyle Üretilen 3B Yazıcıda PLA+’ın Farklı Proses Parametrelerinin Çekme Dayanımları Üzerindeki EtkileriEl-Cezeri Fen ve Mühendislik Dergisi10.31202/ecjse.1179492Online publication date: 21-Jan-2023
    • (2023)Artificial Intelligence AcceleratorsArtificial Intelligence and Hardware Accelerators10.1007/978-3-031-22170-5_1(1-52)Online publication date: 16-Mar-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 2
    June 2020
    169 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3403597
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 May 2020
    Online AM: 07 May 2020
    Accepted: 01 March 2020
    Revised: 01 March 2020
    Received: 01 September 2019
    Published in TACO Volume 17, Issue 2

    Check for updates

    Author Tags

    1. SIMT
    2. computer architecture
    3. hardware
    4. microarchitecture
    5. multi-threading
    6. out-of-order

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)674
    • Downloads (Last 6 weeks)74
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Infrastructure for Exploring SIMT Architecture in General-Purpose Processors2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00043(316-318)Online publication date: 5-May-2024
    • (2023)The Effects of Different Process Parameters of PLA+ on Tensile Strengths in 3D Printer Produced by Fused Deposition ModelingBirleştirme Yığma Modellemesiyle Üretilen 3B Yazıcıda PLA+’ın Farklı Proses Parametrelerinin Çekme Dayanımları Üzerindeki EtkileriEl-Cezeri Fen ve Mühendislik Dergisi10.31202/ecjse.1179492Online publication date: 21-Jan-2023
    • (2023)Artificial Intelligence AcceleratorsArtificial Intelligence and Hardware Accelerators10.1007/978-3-031-22170-5_1(1-52)Online publication date: 16-Mar-2023
    • (2020)Supporting Irregularity in Throughput-Oriented Computing by SIMT-SIMD Integration2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA351965.2020.00010(31-35)Online publication date: Dec-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media