Article

History-based harvesting of spare cycles and storage in large-scale datacenters

Authors:

Giovanni Matteo Fumarola,

Marcus Fontoura,

Ricardo BianchiniAuthors Info & Claims

OSDI'16: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation

Pages 755 - 770

Published: 02 November 2016 Publication History

Abstract

An effective way to increase utilization and reduce costs in datacenters is to co-locate their latency-critical services and batch workloads. In this paper, we describe systems that harvest spare compute cycles and storage space for co-location purposes. The main challenge is minimizing the performance impact on the services, while accounting for their utilization and management patterns. To overcome this challenge, we propose techniques for giving the services priority over the resources, and leveraging historical information about them. Based on this information, we schedule related batch tasks on servers that exhibit similar patterns and will likely have enough available resources for the tasks' durations, and place data replicas at servers that exhibit diverse patterns. We characterize the dynamics of how services are utilized and managed in ten large-scale production datacenters. Using real experiments and simulations, we show that our techniques eliminate data loss and unavailability in many scenarios, while protecting the co-located services and improving batch job execution time.

References

[1]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensor-Flow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating System Design and Implementation, 2016.

[2]

L. Abraham, J. Allen, O. Barykin, V. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, J. L. Wiener, and O. Zed. Scuba: Diving into Data at Facebook. Proceedings of the VLDB Endowment, 2013.

Digital Library

[3]

F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. Tarazu: Optimizing MapReduce on Heterogeneous Clusters. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.

Digital Library

[4]

L. A. Barroso, J. Clidaras, and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, 2013.

Digital Library

[5]

M. Carvalho, W. Cirne, F. Brasileiro, and J. Wilkes. Long-term SLOs for Reclaimed Cloud Computing Resources. In Proceedings of the ACM Symposium on Cloud Computing, 2014.

Digital Library

[6]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems, 2008.

Digital Library

[7]

R. B. Clay, Z. Shen, and X. Ma. Accelerating Batch Analytics With Residual Resources From Interactive Clouds. In Proceedings of the 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, 2013.

Digital Library

[8]

C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based Scheduling: If You'Re Late Don'T Blame Us! In Proceedings of the ACM Symposium on Cloud Computing, 2014.

Digital Library

[9]

C. Delimitrou and C. Kozyrakis. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, 2013.

Digital Library

[10]

C. Delimitrou and C. Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.

Digital Library

[11]

A. Foundation. HDFS Architecture Guide, 2008.

[12]

A. Goder, A. Spiridonov, and Y. Wang. Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems. In Proceedings of the USENIX Annual Technical Conference, 2015.

Digital Library

[13]

I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, 2015.

Digital Library

[14]

I. Goiri, K. Le, T. D. Nguyen, J. Guitart, J. Torres, and R. Bianchini. GreenHadoop: Leveraging Green Energy in Data-processing Frameworks. In Proceedings of the 7th ACM European Conference on Computer Systems, 2012.

Digital Library

[15]

R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-Resource Packing for Cluster Schedulers. In Proceedings of the 2014 ACM SIGCOMM Conference, 2014.

Digital Library

[16]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, 2011.

Digital Library

[17]

M. Isard. Autopilot: Automatic Data Center Management. SIGOPS Operating Systems Review, 2007.

Digital Library

[18]

K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. M. Fumarola, S. Heddaya, R. Ramakrishnan, and S. Sakalanaga. Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters. In Proceedings of the USENIX Annual Technical Conference, 2015.

Digital Library

[19]

H. Kasture and D. Sanchez. Ubik: Efficient Cache Sharing with Strict Qos for Latency-Critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.

Digital Library

[20]

M. A. Laurenzano, Y. Zhang, L. Tang, and J. Mars. Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014.

Digital Library

[21]

J. Leverich and C. Kozyrakis. Reconciling High Server Utilization and Sub-Millisecond Quality-of-Service. In Proceedings of the 9th European Conference on Computer Systems, 2014.

Digital Library

[22]

H. Lin, X. Ma, J. Archuleta, W.-C. Feng, M. Gardner, and Z. Zhang. MOON: MapReduce On Opportunistic eNvironments. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010.

Digital Library

[23]

M. J. Litzkow, M. Livny, and M. W. Mutka. Condor-A Hunter of Idle Workstations. In Proceedings of the 8th International Conference on Distributed Computing Systems, 1988.

[24]

D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015.

Digital Library

[25]

J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011.

Digital Library

[26]

M. McCandless, E. Hatcher, and O. Gospodnetic. Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co., 2010.

Digital Library

[27]

D. Novakovic, N. Vasic, S. Novakovic, D. Kostic, and R. Bianchini. DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments. In Proceedings of the USENIX Annual Technical Conference, 2013.

Digital Library

[28]

G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers. IEEE Micro, 2010.

Digital Library

[29]

B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2015.

Digital Library

[30]

B. Sharma, T. Wood, and C. R. Das. HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers. In Proceedings of the 33rd International Conference on Distributed Computing Systems, 2013.

Digital Library

[31]

L. Tang, J. Mars, and M. L. Soffa. Compiling for Niceness: Mitigating Contention for QoS in Warehouse Scale Computers. In Proceedings of the 10th International Symposium on Code Generation and Optimization, 2012.

Digital Library

[32]

L. Tang, J. Mars, W. Wang, T. Dey, and M. L. Soffa. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, 2013.

Digital Library

[33]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution Over a Map-Reduce Framework. Proceedings of the VLDB Endowment, 2009.

Digital Library

[34]

Transaction Processing Performance Council. TPC Benchmarks.

[35]

D. Tsafrir, Y. Etsion, and D. G. Feitelson. Backfilling Using System-Generated Predictions Rather Than User Runtime Estimates. IEEE Transactions on Parallel and Distributed Systems, 2007.

Digital Library

[36]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, 2013.

Digital Library

[37]

A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale Cluster Management at Google with Borg. In Proceedings of the 10th European Conference on Computer Systems, 2015.

Digital Library

[38]

H. Yang, A. Breslow, J. Mars, and L. Tang. Bubbleflux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.

Digital Library

[39]

M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In Proceedings of the 5th European Conference on Computer Systems, 2010.

Digital Library

[40]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010.

Digital Library

[41]

W. Zhang, S. Rajasekaran, S. Duan, T. Wood, and M. Zhuy. Minimizing Interference and Maximizing Progress for Hadoop Virtual Machines. SIGMETRICS Performance Evaluation Review, 2015.

Digital Library

[42]

X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU Performance Isolation for Shared Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, 2013.

Digital Library

[43]

Y. Zhang, M. A. Laurenzano, J. Mars, and L. Tang. SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014.

Digital Library

Cited By

Lu CXu HYe KXu GZhang LYang GXu CFedorova ANarayanan DDi Luna GQuerzoni L(2023)Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud PlatformsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587437(416-432)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587437
Iqbal SLi HBergsma SBeschastnikh IHu AGavrilovska AAltınbüken DBinnig C(2022)CoSpotProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563499(540-556)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3542929.3563499
Narasayya VChaudhuri SIves ZBonifati AEl Abbadi A(2022)Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and OpportunitiesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3522566(2465-2473)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3522566
Show More Cited By

History-based harvesting of spare cycles and storage in large-scale datacenters

Recommendations

End-to-end performance isolation through virtual datacenters
OSDI'14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation

The lack of performance isolation in multi-tenant datacenters at appliances like middleboxes and storage servers results in volatile application performance. To insulate tenants, we propose giving them the abstraction of a dedicated virtual datacenter (...
Managing energy, performance and cost in large scale heterogeneous datacenters using migrations
Abstract
Improving datacenter energy efficiency becomes increasingly important due to energy supply problems, fuel costs and global warming. Virtualisation can help to improve datacenter energy efficiency through server consolidation which ...
Highlights
- The existence of a trade-off between overall energy consumption and performance (hence cost).
An energy, performance efficient resource consolidation scheme for heterogeneous cloud datacenters
Abstract
Datacenters are the principal electricity consumers for cloud computing that provide an IT backbone for today's business and economy. Numerous studies suggest that most of the servers, in the US datacenters, are idle or less-utilised, ...
Highlights
- A consolidation method is suggested to manage datacenter resources – particularly, when container run inside VMs.

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

OSDI'16: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation

November 2016

786 pages

ISBN:9781931971331

Program Chairs:
Kimberly Keeton
Hewlett Packard Labs
,
Timothy Roscoe
ETH Zurich

Sponsors

VMware
NetApp
Google Inc.
Microsoft: Microsoft
Facebook: Facebook

In-Cooperation

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

USENIX Association

United States

Publication History

Published: 02 November 2016

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lu CXu HYe KXu GZhang LYang GXu CFedorova ANarayanan DDi Luna GQuerzoni L(2023)Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud PlatformsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587437(416-432)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587437
Iqbal SLi HBergsma SBeschastnikh IHu AGavrilovska AAltınbüken DBinnig C(2022)CoSpotProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563499(540-556)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3542929.3563499
Narasayya VChaudhuri SIves ZBonifati AEl Abbadi A(2022)Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and OpportunitiesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3522566(2465-2473)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3522566
Fan YLan ZRich PAllcock WPapka MAustin BPaul DWeissman JButt ASmirni E(2019)Scheduling Beyond CPUs for HPCProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325401(97-108)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325401
Misra PBorge MGoiri ÍLebeck AZwaenepoel WBianchini R(2019)Managing Tail Latency in Datacenter-Scale File Systems Under Production ConstraintsProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303973(1-15)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303973
Sharma PAli-Eldin AShenoy P(2019)Resource DeflationProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303945(1-17)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303945
Klimovic AWang YStuedi PTrivedi APfefferle JKozyrakis CArpaci-Dusseau AVoelker G(2018)PocketProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291200(427-444)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291200
Hsu CDeng QMars JTang L(2018)SmoothOperatorACM SIGPLAN Notices10.1145/3296957.317319053:2(535-548)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173190
Javadi SBhaskara SDoshi RSoundarapandian PWajahat MGandhi A(2018)Application-Agnostic Batch Workload Management in Cloud EnvironmentsProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3275446(504-504)Online publication date: 11-Oct-2018
https://dl.acm.org/doi/10.1145/3267809.3275446
Delgado PDidona DDinu FZwaenepoel W(2018)KairosProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3267838(135-148)Online publication date: 11-Oct-2018
https://dl.acm.org/doi/10.1145/3267809.3267838
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents