Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Published: 23 June 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.

    References

    [1]
    Apple Inc. iOS App Programming Guide. http://developer.apple.com/library/ios/DOCUMENTATION/iPhone/Conceptual/iPhoneOSProgrammingGuide/iPhoneAppProgrammingGuide.pdf.
    [2]
    L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2009.
    [3]
    S. Beamer, K. Asanovic, and D. A. Patterson. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for graph500. Technical Report UCB/EECS-2011-117, EECS Department, University of California, Berkeley, Nov 2011.
    [4]
    C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.
    [5]
    S. Bird, B. Smith, K. Asanović, and D. A. Patterson. PACORA: Dynamically Optimizing Resource Allocations for Interactive Applications. Technical report, University of California, Berkeley, April 2013.
    [6]
    S. M. Blackburn et al. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA, pages 169--190, 2006.
    [7]
    F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernandez, A. Ramirez, and M. Valero. Predictable Performance in SMT Processors: Synergy between the OS and SMTs. IEEE Trans. Computers, 55(7):785--799, 2006.
    [8]
    D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, pages 340--351, 2005.
    [9]
    S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. In MICRO, pages 455--468, 2006.
    [10]
    J. Chong, G. Friedland, A. Janin, N. Morgan, and C. Oei. Opportunities and challenges of parallelizing speech recognition. In HotPar, 2010.
    [11]
    S. Eranian. Perfmon2: a flexible performance monitoring interface for linux. In Ottawa Linux Symposium, pages 269--288, 2006.
    [12]
    H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley. Looking back and looking forward: power, performance, and upheaval. Commun. ACM, 55(7):105--114, July 2012.
    [13]
    A. Fedorova, S. Blagodurov, and S. Zhuravlev. Managing contention for shared resources on multicore processors. Commun. ACM, 53(2):49--57, 2010.
    [14]
    M. Ferdman, A. Adileh, Y. O. Koçberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ASPLOS, pages 37--48, 2012.
    [15]
    L. Gidra, G. Thomas, J. Sopena, and M. Shapiro. Assessing the scalability of garbage collectors on many cores. In PLOS, pages 1--5, 2011.
    [16]
    F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In MICRO, 2007.
    [17]
    J. L. Hennessy and D. A. Patterson. Computer Architecture - A Quantitative Approach (5. ed.). Morgan Kaufmann, 2012.
    [18]
    Intel Corp. Intel 64 and ia-32 architectures optimization reference manual, June 2011.
    [19]
    Intel Corp. Intel 64 and ia-32 architectures software developer's manual, March 2012.
    [20]
    R. R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. R. Hsu, and S. K. Reinhardt. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, pages 25--36, 2007.
    [21]
    A. Jaleel. Memory characterization of workloads using instrumentation-driven simulation -- a pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites. Technical report, VSSAD, Intel Corporation, 2007.
    [22]
    S. Kamil. Stencil probe, 2012. http://www.cs.berkeley.edu/~skamil/projects/stencilprobe/.
    [23]
    J. W. Lee, M. C. Ng, and K. Asanovic. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In ISCA, pages 89--100, 2008.
    [24]
    J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, pages 367--378, feb. 2008.
    [25]
    L. A. Meyerovich, M. E. Torok, E. Atkinson, and R. Bodik. Parallel schedule synthesis for attribute grammars. In PPoPP, 2013.
    [26]
    M. Moreto, F. J. Cazorla, A. Ramirez, R. Sakellariou, and M. Valero. FlexDCP: a QoS framework for CMP architectures. SIGOPS Oper. Syst. Rev., 43(2):86--96, 2009.
    [27]
    Perfmon2 webpage. perfmon2.sourceforge.net/.
    [28]
    A. Phansalkar, A. Joshi, and L. K. John. Analysis of redundancy and application balance in the spec cpu2006 benchmark suite. In ISCA, pages 412--423, 2007.
    [29]
    M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO, pages 423--432, 2006.
    [30]
    D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In ISCA), June 2011.
    [31]
    E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity, 2009.
    [32]
    Standard Performance Evaluation Corporation. SPEC CPU 2006 benchmark suite. http://www.spec.org.
    [33]
    G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. In HPCA, pages 117--128, 2002.
    [34]
    D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared l2 caches on multicore systems in software. In WIOSCA, 2007.
    [35]
    D. K. Tam, R. Azimi, L. Soares, and M. Stumm. Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. In ASPLOS, pages 121--132, 2009.
    [36]
    L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. In ISCA, pages 283--294, 2011.
    [37]
    C.-J. Wu and M. Martonosi. Characterization and dynamic mitigation of intra-application cache interference. In ISPASS, pages 2--11, 2011.
    [38]
    Y. Xie and G. H. Loh. Scalable shared-cache management by containing thrashing workloads. In HiPEAC, pages 262--276, 2010.
    [39]
    E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In PPoPP, pages 203--212, 2010.

    Cited By

    View all
    • (2024)TraceUpscaler: Upscaling Traces to Evaluate Systems at High LoadProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629581(942-961)Online publication date: 22-Apr-2024
    • (2023)Component-distinguishable Co-location and Resource Reclamation for High-throughput ComputingACM Transactions on Computer Systems10.1145/3630006Online publication date: 18-Nov-2023
    • (2023)An Evaluation of Time-triggered Scheduling in the Linux KernelProceedings of the 31st International Conference on Real-Time Networks and Systems10.1145/3575757.3593660(119-131)Online publication date: 7-Jun-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 41, Issue 3
    ICSA '13
    June 2013
    666 pages
    ISSN:0163-5964
    DOI:10.1145/2508148
    Issue’s Table of Contents
    • cover image ACM Other conferences
      ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
      June 2013
      686 pages
      ISBN:9781450320795
      DOI:10.1145/2485922
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 June 2013
    Published in SIGARCH Volume 41, Issue 3

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)41
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)TraceUpscaler: Upscaling Traces to Evaluate Systems at High LoadProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629581(942-961)Online publication date: 22-Apr-2024
    • (2023)Component-distinguishable Co-location and Resource Reclamation for High-throughput ComputingACM Transactions on Computer Systems10.1145/3630006Online publication date: 18-Nov-2023
    • (2023)An Evaluation of Time-triggered Scheduling in the Linux KernelProceedings of the 31st International Conference on Real-Time Networks and Systems10.1145/3575757.3593660(119-131)Online publication date: 7-Jun-2023
    • (2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
    • (2023)RAPID: Enabling fast online policy learning in dynamic public cloud environmentsNeurocomputing10.1016/j.neucom.2023.126737558(126737)Online publication date: Nov-2023
    • (2023)BALANCER: bandwidth allocation and cache partitioning for multicore processorsThe Journal of Supercomputing10.1007/s11227-023-05070-079:9(10252-10276)Online publication date: 4-Feb-2023
    • (2023)Running Serverless Function on Resource Fragments in Data CenterAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0798-0_26(443-462)Online publication date: 20-Oct-2023
    • (2022)Com-CASProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569645(14-27)Online publication date: 8-Oct-2022
    • (2022)A Study on the Impact of Memory DoS Attacks on Cloud Applications and Exploring Real-Time Detection SchemesIEEE/ACM Transactions on Networking10.1109/TNET.2022.314489530:4(1644-1658)Online publication date: Aug-2022
    • (2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media