Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Published: 20 October 2006 Publication History

Abstract

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.

References

[1]
Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Programming, Dec 2005.
[2]
A. Agarwal, J. Hennessy, and M. Horowitz. Cache performance of operating system and multiprogramming workloads. ACM Trans. Comput. Syst., 6(4):393--431, 1988.
[3]
A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases, 1999.
[4]
A.R. Alameldeen and D.A. Wood. Variability in architectural simulations of multi-threaded workloads. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003.
[5]
T.E. Anderson, H.M. Levy, B.N. Bershad, and E.D. Lazowska. The interaction of architecture and operating system design. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, 1991.
[6]
S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The impact of performance asymmetry in emerging multicore architectures. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005.
[7]
P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Proceedings of the 1998 SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 1998.
[8]
L.A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer architecture, 1998.
[9]
B.M. Beckmann and D.A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th Annual International Symposium on Microarchitecture, 2004.
[10]
J. Chang and G.S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture, 2006.
[11]
A.N. Eden and T. Mudge. The YAGS branch prediction scheme. In Proceedings of the 31st Annual International Symposium on Microarchitecture, 1998.
[12]
N. Gloy, C. Young, J.B. Chen, and M.D. Smith. An analysis of dynamic branch prediction schemes on system workloads. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996.
[13]
R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J.P. Shen. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003.
[14]
S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Proceedings of the 30th International Conference on Very Large Databases, 2004.
[15]
R. Kumar, D.M. Tullsen, P. Ranganathan, N.P. Jouppi, and K.I. Farkas. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004.
[16]
J.R. Larus and M. Parkes. Using cohort-scheduling to enhance server performance. In Proceedings of the General Track USENIX Annual Technical Conference, 2002.
[17]
H.-H.S. Lee, M. Smelyanskiy, G.S. Tyson, and C.J. Newburn. Stack value file: Custom microarchitecture for the stack. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture, 2001.
[18]
T. Li, L.K. John, A. Sivasubramaniam, N. Vijaykrishnan, and J. Rubio. Understanding and improving operating system effects in control flow prediction. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.
[19]
D. Lilja, F. Marcovitz, and P.C. Yew. Memory referencing behavior and a cache performance metric in a shared memory multiprocessor. Technical Report CSRD-836, University of Illinois, Urbana-Champaign, Dec 1988.
[20]
J.L. Lo, L.A. Barroso, S.J. Eggers, K. Gharachorloo, H.M. Levy, and S. S. Parekh. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998.
[21]
P.Magnusson,M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb 2002.
[22]
V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum. Locality-aware request distribution in cluster-based network servers. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998.
[23]
A. Ramirez, L.A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P.G. Lowney, and M. Valero. Code layout optimizations for transaction processing workloads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001.
[24]
J.A. Redstone, S.J. Eggers, and H.M. Levy. An analysis of operating system behavior on a simultaneous multithreaded architecture. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000.
[25]
A.J. Smith. Cache memories. ACM Comput. Surv., 14(3):473--530, 1982.
[26]
E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005.
[27]
J.E. Thorton. Parallel operation in the control data 6600. In Proceedings of the Fall Joint Computer Conference, 1964.
[28]
J. Torrellas, A. Gupta, and J. Hennessy. Characterizing the caching and synchronization performance of a multiprocessor operating system. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
[29]
J. Torrellas, A. Tucker, and A. Gupta. Benefits of cache-affinity scheduling in shared-memory multiprocessors: a summary. In Proceedings of the 1993 SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993.
[30]
R. Uhlig, G. Neiger, D. Rodgers, A.L. Santoni, F.C. M. Martins, A.V. Anderson, S.M. Bennett, A. Kagi, F.H. Leung, and L. Smith. Intel virtualization technology. Computer, 38(5), 2005.
[31]
P. Wells, K. Chakraborty, and G. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, 2006.
[32]
M. Welsh, D. Culler, and E. Brewer. SEDA: an architecture for wellconditioned, scalable internet services. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.
[33]
T.F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005.

Cited By

View all
  • (2019)Asynchronous Abstract MachinesProceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers10.1145/3322789.3328744(19-26)Online publication date: 17-Jun-2019
  • (2013)BibliographyMulticore Technology10.1201/b15268-20(409-450)Online publication date: 18-Jul-2013
  • (2019)Asynchronous Abstract MachinesProceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers10.1145/3322789.3328744(19-26)Online publication date: 17-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 40, Issue 5
Proceedings of the 2006 ASPLOS Conference
December 2006
425 pages
ISSN:0163-5980
DOI:10.1145/1168917
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
    October 2006
    440 pages
    ISBN:1595934510
    DOI:10.1145/1168857
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2006
Published in SIGOPS Volume 40, Issue 5

Check for updates

Author Tags

  1. cache locality
  2. dynamic specialization

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Asynchronous Abstract MachinesProceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers10.1145/3322789.3328744(19-26)Online publication date: 17-Jun-2019
  • (2013)BibliographyMulticore Technology10.1201/b15268-20(409-450)Online publication date: 18-Jul-2013
  • (2019)Asynchronous Abstract MachinesProceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers10.1145/3322789.3328744(19-26)Online publication date: 17-Jun-2019
  • (2019)Toward Verifying Nonlinear Integer ArithmeticJournal of the ACM10.1145/331939666:3(1-30)Online publication date: 14-Jun-2019
  • (2019)Uniform Sampling Through the Lovász Local LemmaJournal of the ACM10.1145/331013166:3(1-31)Online publication date: 12-Apr-2019
  • (2019)Parallel Bayesian Search with No CoordinationJournal of the ACM10.1145/330411166:3(1-28)Online publication date: 5-Apr-2019
  • (2019)Near-optimal Linear Decision Trees for k-SUM and Related ProblemsJournal of the ACM10.1145/328595366:3(1-18)Online publication date: 12-Apr-2019
  • (2017)Databases on Modern Hardware: How to Stop Underutilization and Love MulticoresSynthesis Lectures on Data Management10.2200/S00774ED1V01Y201704DTM0459:1(1-113)Online publication date: 14-Aug-2017
  • (2017)SchedtaskProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123984(612-624)Online publication date: 14-Oct-2017
  • (2017)Exploring Energy-Efficient Cache Design in Emerging Mobile PlatformsACM Transactions on Design Automation of Electronic Systems10.1145/284394022:4(1-20)Online publication date: 20-Jul-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media