Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3026877.3026935acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

History-based harvesting of spare cycles and storage in large-scale datacenters

Published: 02 November 2016 Publication History
  • Get Citation Alerts
  • Abstract

    An effective way to increase utilization and reduce costs in datacenters is to co-locate their latency-critical services and batch workloads. In this paper, we describe systems that harvest spare compute cycles and storage space for co-location purposes. The main challenge is minimizing the performance impact on the services, while accounting for their utilization and management patterns. To overcome this challenge, we propose techniques for giving the services priority over the resources, and leveraging historical information about them. Based on this information, we schedule related batch tasks on servers that exhibit similar patterns and will likely have enough available resources for the tasks' durations, and place data replicas at servers that exhibit diverse patterns. We characterize the dynamics of how services are utilized and managed in ten large-scale production datacenters. Using real experiments and simulations, we show that our techniques eliminate data loss and unavailability in many scenarios, while protecting the co-located services and improving batch job execution time.

    References

    [1]
    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensor-Flow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating System Design and Implementation, 2016.
    [2]
    L. Abraham, J. Allen, O. Barykin, V. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, J. L. Wiener, and O. Zed. Scuba: Diving into Data at Facebook. Proceedings of the VLDB Endowment, 2013.
    [3]
    F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. Tarazu: Optimizing MapReduce on Heterogeneous Clusters. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.
    [4]
    L. A. Barroso, J. Clidaras, and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, 2013.
    [5]
    M. Carvalho, W. Cirne, F. Brasileiro, and J. Wilkes. Long-term SLOs for Reclaimed Cloud Computing Resources. In Proceedings of the ACM Symposium on Cloud Computing, 2014.
    [6]
    F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems, 2008.
    [7]
    R. B. Clay, Z. Shen, and X. Ma. Accelerating Batch Analytics With Residual Resources From Interactive Clouds. In Proceedings of the 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, 2013.
    [8]
    C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based Scheduling: If You'Re Late Don'T Blame Us! In Proceedings of the ACM Symposium on Cloud Computing, 2014.
    [9]
    C. Delimitrou and C. Kozyrakis. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, 2013.
    [10]
    C. Delimitrou and C. Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.
    [11]
    A. Foundation. HDFS Architecture Guide, 2008.
    [12]
    A. Goder, A. Spiridonov, and Y. Wang. Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems. In Proceedings of the USENIX Annual Technical Conference, 2015.
    [13]
    I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, 2015.
    [14]
    I. Goiri, K. Le, T. D. Nguyen, J. Guitart, J. Torres, and R. Bianchini. GreenHadoop: Leveraging Green Energy in Data-processing Frameworks. In Proceedings of the 7th ACM European Conference on Computer Systems, 2012.
    [15]
    R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-Resource Packing for Cluster Schedulers. In Proceedings of the 2014 ACM SIGCOMM Conference, 2014.
    [16]
    B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, 2011.
    [17]
    M. Isard. Autopilot: Automatic Data Center Management. SIGOPS Operating Systems Review, 2007.
    [18]
    K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. M. Fumarola, S. Heddaya, R. Ramakrishnan, and S. Sakalanaga. Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters. In Proceedings of the USENIX Annual Technical Conference, 2015.
    [19]
    H. Kasture and D. Sanchez. Ubik: Efficient Cache Sharing with Strict Qos for Latency-Critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.
    [20]
    M. A. Laurenzano, Y. Zhang, L. Tang, and J. Mars. Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014.
    [21]
    J. Leverich and C. Kozyrakis. Reconciling High Server Utilization and Sub-Millisecond Quality-of-Service. In Proceedings of the 9th European Conference on Computer Systems, 2014.
    [22]
    H. Lin, X. Ma, J. Archuleta, W.-C. Feng, M. Gardner, and Z. Zhang. MOON: MapReduce On Opportunistic eNvironments. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010.
    [23]
    M. J. Litzkow, M. Livny, and M. W. Mutka. Condor-A Hunter of Idle Workstations. In Proceedings of the 8th International Conference on Distributed Computing Systems, 1988.
    [24]
    D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015.
    [25]
    J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011.
    [26]
    M. McCandless, E. Hatcher, and O. Gospodnetic. Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co., 2010.
    [27]
    D. Novakovic, N. Vasic, S. Novakovic, D. Kostic, and R. Bianchini. DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments. In Proceedings of the USENIX Annual Technical Conference, 2013.
    [28]
    G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers. IEEE Micro, 2010.
    [29]
    B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2015.
    [30]
    B. Sharma, T. Wood, and C. R. Das. HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers. In Proceedings of the 33rd International Conference on Distributed Computing Systems, 2013.
    [31]
    L. Tang, J. Mars, and M. L. Soffa. Compiling for Niceness: Mitigating Contention for QoS in Warehouse Scale Computers. In Proceedings of the 10th International Symposium on Code Generation and Optimization, 2012.
    [32]
    L. Tang, J. Mars, W. Wang, T. Dey, and M. L. Soffa. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, 2013.
    [33]
    A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution Over a Map-Reduce Framework. Proceedings of the VLDB Endowment, 2009.
    [34]
    Transaction Processing Performance Council. TPC Benchmarks.
    [35]
    D. Tsafrir, Y. Etsion, and D. G. Feitelson. Backfilling Using System-Generated Predictions Rather Than User Runtime Estimates. IEEE Transactions on Parallel and Distributed Systems, 2007.
    [36]
    V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, 2013.
    [37]
    A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale Cluster Management at Google with Borg. In Proceedings of the 10th European Conference on Computer Systems, 2015.
    [38]
    H. Yang, A. Breslow, J. Mars, and L. Tang. Bubbleflux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.
    [39]
    M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In Proceedings of the 5th European Conference on Computer Systems, 2010.
    [40]
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010.
    [41]
    W. Zhang, S. Rajasekaran, S. Duan, T. Wood, and M. Zhuy. Minimizing Interference and Maximizing Progress for Hadoop Virtual Machines. SIGMETRICS Performance Evaluation Review, 2015.
    [42]
    X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU Performance Isolation for Shared Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, 2013.
    [43]
    Y. Zhang, M. A. Laurenzano, J. Mars, and L. Tang. SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014.

    Cited By

    View all
    • (2023)Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud PlatformsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587437(416-432)Online publication date: 8-May-2023
    • (2022)CoSpotProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563499(540-556)Online publication date: 7-Nov-2022
    • (2022)Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and OpportunitiesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3522566(2465-2473)Online publication date: 10-Jun-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    OSDI'16: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation
    November 2016
    786 pages
    ISBN:9781931971331

    Sponsors

    • VMware
    • NetApp
    • Google Inc.
    • Microsoft: Microsoft
    • Facebook: Facebook

    In-Cooperation

    Publisher

    USENIX Association

    United States

    Publication History

    Published: 02 November 2016

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud PlatformsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587437(416-432)Online publication date: 8-May-2023
    • (2022)CoSpotProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563499(540-556)Online publication date: 7-Nov-2022
    • (2022)Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and OpportunitiesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3522566(2465-2473)Online publication date: 10-Jun-2022
    • (2019)Scheduling Beyond CPUs for HPCProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325401(97-108)Online publication date: 17-Jun-2019
    • (2019)Managing Tail Latency in Datacenter-Scale File Systems Under Production ConstraintsProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303973(1-15)Online publication date: 25-Mar-2019
    • (2019)Resource DeflationProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303945(1-17)Online publication date: 25-Mar-2019
    • (2018)PocketProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291200(427-444)Online publication date: 8-Oct-2018
    • (2018)SmoothOperatorACM SIGPLAN Notices10.1145/3296957.317319053:2(535-548)Online publication date: 19-Mar-2018
    • (2018)Application-Agnostic Batch Workload Management in Cloud EnvironmentsProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3275446(504-504)Online publication date: 11-Oct-2018
    • (2018)KairosProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3267838(135-148)Online publication date: 11-Oct-2018
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media