article

Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Authors:

Koushik Chakraborty,

Philip M. Wells,

Gurindar S. SohiAuthors Info & Claims

ACM SIGPLAN Notices, Volume 41, Issue 11

Pages 283 - 292

https://doi.org/10.1145/1168918.1168893

Published: 20 October 2006 Publication History

Abstract

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.

References

[1]

Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Programming, Dec 2005.

[2]

A. Agarwal, J. Hennessy, and M. Horowitz. Cache performance of operating system and multiprogramming workloads. ACM Trans. Comput. Syst., 6(4):393--431, 1988.

Digital Library

[3]

A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases, 1999.

Digital Library

[4]

A.R. Alameldeen and D.A. Wood. Variability in architectural simulations of multi-threaded workloads. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003.

Digital Library

[5]

T.E. Anderson, H.M. Levy, B.N. Bershad, and E.D. Lazowska. The interaction of architecture and operating system design. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, 1991.

Digital Library

[6]

S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The impact of performance asymmetry in emerging multicore architectures. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005.

Digital Library

[7]

P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Proceedings of the 1998 SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 1998.

Digital Library

[8]

L.A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer architecture, 1998.

Digital Library

[9]

B.M. Beckmann and D.A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th Annual International Symposium on Microarchitecture, 2004.

Digital Library

[10]

J. Chang and G.S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture, 2006.

Digital Library

[11]

A.N. Eden and T. Mudge. The YAGS branch prediction scheme. In Proceedings of the 31st Annual International Symposium on Microarchitecture, 1998.

Digital Library

[12]

N. Gloy, C. Young, J.B. Chen, and M.D. Smith. An analysis of dynamic branch prediction schemes on system workloads. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996.

Digital Library

[13]

R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J.P. Shen. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003.

Digital Library

[14]

S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Proceedings of the 30th International Conference on Very Large Databases, 2004.

Digital Library

[15]

R. Kumar, D.M. Tullsen, P. Ranganathan, N.P. Jouppi, and K.I. Farkas. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004.

Digital Library

[16]

J.R. Larus and M. Parkes. Using cohort-scheduling to enhance server performance. In Proceedings of the General Track USENIX Annual Technical Conference, 2002.

Digital Library

[17]

H.-H.S. Lee, M. Smelyanskiy, G.S. Tyson, and C.J. Newburn. Stack value file: Custom microarchitecture for the stack. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture, 2001.

Digital Library

[18]

T. Li, L.K. John, A. Sivasubramaniam, N. Vijaykrishnan, and J. Rubio. Understanding and improving operating system effects in control flow prediction. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[19]

D. Lilja, F. Marcovitz, and P.C. Yew. Memory referencing behavior and a cache performance metric in a shared memory multiprocessor. Technical Report CSRD-836, University of Illinois, Urbana-Champaign, Dec 1988.

[20]

J.L. Lo, L.A. Barroso, S.J. Eggers, K. Gharachorloo, H.M. Levy, and S. S. Parekh. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998.

Digital Library

[21]

P.Magnusson,M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb 2002.

Digital Library

[22]

V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum. Locality-aware request distribution in cluster-based network servers. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998.

Digital Library

[23]

A. Ramirez, L.A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P.G. Lowney, and M. Valero. Code layout optimizations for transaction processing workloads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001.

Digital Library

[24]

J.A. Redstone, S.J. Eggers, and H.M. Levy. An analysis of operating system behavior on a simultaneous multithreaded architecture. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000.

Digital Library

[25]

A.J. Smith. Cache memories. ACM Comput. Surv., 14(3):473--530, 1982.

Digital Library

[26]

E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005.

Digital Library

[27]

J.E. Thorton. Parallel operation in the control data 6600. In Proceedings of the Fall Joint Computer Conference, 1964.

[28]

J. Torrellas, A. Gupta, and J. Hennessy. Characterizing the caching and synchronization performance of a multiprocessor operating system. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.

Digital Library

[29]

J. Torrellas, A. Tucker, and A. Gupta. Benefits of cache-affinity scheduling in shared-memory multiprocessors: a summary. In Proceedings of the 1993 SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993.

Digital Library

[30]

R. Uhlig, G. Neiger, D. Rodgers, A.L. Santoni, F.C. M. Martins, A.V. Anderson, S.M. Bennett, A. Kagi, F.H. Leung, and L. Smith. Intel virtualization technology. Computer, 38(5), 2005.

Digital Library

[31]

P. Wells, K. Chakraborty, and G. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, 2006.

Digital Library

[32]

M. Welsh, D. Culler, and E. Brewer. SEDA: an architecture for wellconditioned, scalable internet services. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.

Digital Library

[33]

T.F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005.

Digital Library

Index Terms

Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different ...
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Proceedings of the 2006 ASPLOS Conference

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different ...
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Proceedings of the 2006 ASPLOS Conference

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 41, Issue 11

Proceedings of the 2006 ASPLOS Conference

November 2006

425 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1168918

Issue’s Table of Contents

ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
October 2006
440 pages
ISBN:1595934510
DOI:10.1145/1168857
General Chair:
John Paul Shen
Intel Corp.
,
Program Chair:
Margaret R. Martonosi
Princeton University

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2006

Published in SIGPLAN Volume 41, Issue 11

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

96
Total Citations
View Citations
1,172
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to 31 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents