Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Revisiting Clustered Microarchitecture for Future Superscalar Cores: A Case for Wide Issue Clusters

Published: 31 August 2015 Publication History

Abstract

During the past 10 years, the clock frequency of high-end superscalar processors has not increased. Performance keeps growing mainly by integrating more cores on the same chip and by introducing new instruction set extensions. However, this benefits only some applications and requires rewriting and/or recompiling these applications. A more general way to accelerate applications is to increase the IPC, the number of instructions executed per cycle. Although the focus of academic microarchitecture research moved away from IPC techniques, the IPC of commercial processors was continuously improved during these years.
We argue that some of the benefits of technology scaling should be used to raise the IPC of future superscalar cores further. Starting from microarchitecture parameters similar to recent commercial high-end cores, we show that an effective way to increase the IPC is to allow the out-of-order engine to issue more micro-ops per cycle. But this must be done without impacting the clock cycle. We propose combining two techniques: clustering and register write specialization. Past research on clustered microarchitectures focused on narrow issue clusters, as the emphasis at that time was on allowing high clock frequencies.
Instead, in this study, we consider wide issue clusters, with the goal of increasing the IPC under a constant clock frequency. We show that on a wide issue dual cluster, a very simple steering policy that sends 64 consecutive instructions to the same cluster, the next 64 instructions to the other cluster, and so forth, permits tolerating an intercluster delay of three cycles. We also propose a method for decreasing the energy cost of sending results from one cluster to the other cluster.

Supplementary Material

TACO1203-28 (taco1203-28.pdf)
Slide deck associated with this paper

References

[1]
A. Baniasadi and A. Moshovos. 2000. Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors. In Proceedings of the International Symposium on Microarchitecture (MICRO’00).
[2]
L. Baugh and C. Zilles. 2006. Decomposing the load-store queue by function for power reduction and scalability. IBM Journal of Research and Development 50, 2--3, 287--297.
[3]
G. Blake, R. G. Dreslinski, T. Mudge, and K. Flautner. 2010. Evolution of thread-level parallelism in desktop applications. In Proceedings of the International Symposium on Computer Architecture (ISCA’10).
[4]
M. Boyer, D. Tarjan, and K. Skadron. 2010. Federation: Boosting per-thread performance of throughput-oriented manycore architectures. ACM Transactions on Architecture and Code Optimization 7, 4, Article No. 19.
[5]
Q. Cai, J. M. Codina, J. González, and A. González. 2008. A software-hardware hybrid steering mechanism for clustered microarchitectures. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS’08).
[6]
H. W. Cain and M. H. Lipasti. 2004. Memory ordering: A value-based approach. In Proceedings of the International Symposium on Computer Architecture (ISCA’04).
[7]
R. Canal, J.-M. Parcerisa, and A. González. 1999. A cost-effective clustered architecture. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’99).
[8]
R. Canal, J. M. Parcerisa, and A. González. 2000. Dynamic cluster assignment mechanisms. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’00).
[9]
G. Z. Chrysos and J. S. Emer. 1998. Memory dependence prediction using store sets. In Proceedings of the International Symposium on Computer Architecture (ISCA’98).
[10]
S. Curtis, R. J. Murray, and H. Opie. 1999. Multiported bypass cache in a bypass network. U.S. Patent 6000016.
[11]
K. Czechowski, V. W. Lee, E. Grochowski, and R. Ronnen. 2014. Improving the energy efficiency of big cores. In Proceedings of the International Symposium on Computer Architecture (ISCA’14).
[12]
R. H. Dennard, F. H. Gaensslen, H.-N. Yu, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5, 256--268.
[13]
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the International Symposium on Computer Architecture (ISCA’11).
[14]
S. Eyerman and L. Eeckhout. 2014. The benefit of SMT in the multi-core era: Flexibility towards degrees of thread-level parallelism. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14).
[15]
K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. 1997. The multicluster architecture: Reducing cycle time through partitioning. In Proceedings of the International Symposium on Microarchitecture (MICRO’97).
[16]
J. A. Farrell and T. C. Fischer. 1998. Issue logic for a 600-MHz out-of-order execution microprocessor. IEEE Journal of Solid-State Circuits 33, 5, 707--712.
[17]
B. Fields, S. Rubin, and R. Bodik. 2001. Focusing processor policies via critical-path prediction. In Proceedings of the International Symposium on Computer Architecture (ISCA’01).
[18]
M. Golden, S. Arekapudi, and J. Vinh. 2011. 40-entry unified out-of-order scheduler and integer execution unit for AMD Bulldozer x86-64 core. In IEEE International Solid-State Circuits Conference (ISSCC’11).
[19]
A. González, F. Latorre, and G. Magklis. 2011. Execute. Processor Microarchitecture. Morgan and Claypool, 78--90.
[20]
J. González, F. Latorre, and A. González. 2004. Cache organizations for clustered microarchitecture. In Proceedings of the Workshop on Memory Performance Issues (WMPI’04).
[21]
M. Goshima, K. Nishino, Y. Nakashima, S. I. Mori, T. Kitamura, and S. Tomita. 2001. A high-speed dynamic instruction scheduling scheme for superscalar processors. In Proceedings of the International Symposium on Microarchitecture (MICRO’01).
[22]
W. Herrick. 2000. Design Challenges in Multi-GHz Microprocessors. Keynote address at the Asia and South Pacific Design Automation Conference (ASP-DAC’00).
[23]
Intel. 2014. Intel 64 and IA-32 Architectures Optimization Reference Manual. Intel Corp.
[24]
E. İpek, M. K ırman, N. Kırman, and J. F. Martínez. 2007. Core fusion: Accommodating software diversity in chip multiprocessors. In Proceedings of the International Symposium on Computer Architecture.
[25]
ITRS. 2013. International Technology Roadmap for Semiconductors—Process Integration, Devices, and Structures. Retrieved July 30, 2015, from http://www.itrs.net/.
[26]
T. S. Karkhanis and J. E. Smith. 2004. A first-order superscalar processor model. In Proceedings of the International Symposium on Computer Architecture (ISCA’04).
[27]
R. E. Kessler. 1999. The Alpha 21264 microprocessor. IEEE Micro 19, 2, 24--36.
[28]
P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S. O’Donnell, and J. C. Ruttenberg. 1993. The multiflow trace scheduling compiler. Journal of Supercomputing 7, 1--2, 51--142.
[29]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Janapa Reddi, and K. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05).
[30]
P. Michaud, Y. Sazeides, A. Seznec, T. Constantinou, and D. Fetis. 2005. An Analytical Model of Temperature in Microprocessors. Technical Report RR-5744. Inria.
[31]
P. Michaud, A. Seznec, and S. Jourdan. 2001. An exploration of instruction fetch requirement in out-of-order superscalar processors. International Journal of Parallel Programming 29, 1, 35--58.
[32]
S. Palacharla, N. P. Jouppi, and J. E. Smith. 1997. Complexity-effective superscalar processors. In Proceedings of the International Symposium on Computer Architecture (ISCA’97).
[33]
R. P. Preston, R. W. Badeau, D. W. Bailey, S. L. Bell, L. L. Biro, W. J. Bowhill, D. E. Dever, S. Felix, R. Gammack, V. Germini, M. K. Gowan, P. Gronowski, D. B. Jackson, S. Mehta, S. V. Morton, J. D. Pickholtz, M. H. Reilly, and M. J. Smith. 2002. Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’02).
[34]
E. M. Riseman and C. C. Foster. 1972. The inhibition of potential parallelism by conditional jumps. IEEE Transactions on Computing 21, 12, 1405--1411.
[35]
E. Rotenberg. 1999. Trace Processors: Exploiting Hierarchy and Speculation. Ph.D. Dissertation. University of Wisconsin, Madison.
[36]
E. Rotenberg, S. Bennett, and J. E. Smith. 1996. Trace cache: A low latency approach to high bandwidth instruction fetching. In Proceedings of the International Symposium on Microarchitecture (MICRO’96).
[37]
E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. E. Smith. 1997. Trace processors. In Proceedings of the International Symposium on Microarchitecture (MICRO’97).
[38]
P. Salverda and C. Zilles. 2005. A criticality analysis of clustering in superscalar processors. In Proceedings of the International Symposium on Microarchitecture (MICRO’05).
[39]
A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides. 2002a. Design tradeoffs for the alpha EV8 conditional branch predictor. In Proceedings of the International Symposium on Computer Architecture (ISCA’02).
[40]
A. Seznec, E. Toullec, and O. Rochecouste. 2002b. Register write specialization register read specialization: A path to complexity-effective wide-issue superscalar processors. In Proceedings of the International Symposium on Microarchitecture (MICRO’02).
[41]
A. Seznec and P. Michaud. 2006. A case for (partially) tagged geometric history length branch prediction. Journal of Instruction-Level Parallelism, vol. 8, February 2006.
[42]
T. Sha, M. M. K. Martin, and A. Roth. 2005. Scalable store-load forwarding via store queue index prediction. In Proceedings of the International Symposium on Microarchitecture (MICRO’05).
[43]
B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. 2011. IBM POWER7 multicore server processor. IBM Journal of Research and Development 55, 3, 191--219.
[44]
B. Sinharoy, J. A. Van Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky, J. W. Bishop, M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbach, T. Karkhanis, and K. M. Fernsler. 2015. IBM POWER8 processor core microarchitecture. IBM Journal of Research and Development 59, 1, 2:1--2:21.
[45]
S. Subramaniam and G. H. Loh. 2006. Fire-and-forget: Load/store scheduling with no store queue at all. In Proceedings of the International Symposium on Microarchitecture (MICRO’06).
[46]
P. H. Thomas. 1957. Some conduction problems in the heating of small areas on large solids. Quarterly Journal of Mechanics and Applied Mathematics 10, 4, 482--493.
[47]
S. Vajapeyam and T. Mitra. 1997. Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences. In Proceedings of the International Symposium on Computer Architecture (ISCA’97).
[48]
V. V. Zyuban and P. M. Kogge. 2001. Inherently lower-power high-performance superscalar architectures. IEEE Transactions on Computers 50, 3, 268--285.

Cited By

View all
  • (2023)Toward Practical 128-Bit General Purpose MicroarchitecturesIEEE Computer Architecture Letters10.1109/LCA.2023.328776222:2(81-84)Online publication date: 1-Jul-2023
  • (2019)Recycling Data Slack in Out-of-Order Cores2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00065(545-557)Online publication date: Feb-2019

Index Terms

  1. Revisiting Clustered Microarchitecture for Future Superscalar Cores: A Case for Wide Issue Clusters

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 3
    October 2015
    168 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2818748
    Issue’s Table of Contents
    Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 August 2015
    Accepted: 01 June 2015
    Revised: 01 June 2015
    Received: 01 April 2015
    Published in TACO Volume 12, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Clustered microarchitecture
    2. instruction-level parallelism
    3. steering policy
    4. superscalar core

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)161
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 12 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Toward Practical 128-Bit General Purpose MicroarchitecturesIEEE Computer Architecture Letters10.1109/LCA.2023.328776222:2(81-84)Online publication date: 1-Jul-2023
    • (2019)Recycling Data Slack in Out-of-Order Cores2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00065(545-557)Online publication date: Feb-2019

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media