research-article

Open access

Autopilot: workload autoscaling at Google

Authors:

Krzysztof Rzadca,

Pawel Findeisen,

Jacek Swiderski,

Przemyslaw Zych,

Przemyslaw Broniek,

Jarek Kusmierek,

Piotr Witusowski,

John WilkesAuthors Info & Claims

EuroSys '20: Proceedings of the Fifteenth European Conference on Computer Systems

Article No.: 16, Pages 1 - 16

https://doi.org/10.1145/3342195.3387524

Published: 17 April 2020 Publication History

Abstract

In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage.

To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack - the difference between the limit and the actual resource usage - while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.

Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage.

References

[1]

O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI'17), pages 469--482, 2017.

[2]

D. Breitgand and A. Epstein. Improving consolidation of virtual machines with risk-aware bandwidth oversubscription in compute clouds. In IEEE INFOCOM, pages 2861--2865. IEEE, 2012.

[3]

B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes. Borg, Omega, and Kubernetes. Queue, 14(1):10, 2016.

Digital Library

[4]

M. Carvalho, W. Cirne, F. Brasileiro, and J. Wilkes. Long-term SLOs for reclaimed cloud computing resources. In ACM Symposium on Cloud Computing (SoCC'14), pages 1--13. ACM, 2014.

Digital Library

[5]

C. Curino, S. Krishnan, K. Karanasos, S. Rao, G. M. Fumarola, B. Huang, K. Chaliparambil, A. Suresh, Y. Chen, S. Heddaya, et al. Hydra: a federated resource manager for data-center scale analytics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI'19), pages 177--192, 2019.

[6]

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI'04), pages 10--10, San Francisco, CA, 2004. USENIX Association.

[7]

C. Delimitrou and C. Kozyrakis. Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices, 49(4):127--144, 2014.

Digital Library

[8]

F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.

[9]

M. Duggan, R. Shaw, J. Duggan, E. Howley, and E. Barrett. A multitime-steps-ahead prediction approach for scheduling live migration in cloud data centers. Software: Practice and Experience, 49(4):617--639, 2019.

[10]

A. Evangelidis, D. Parker, and R. Bahsoon. Performance modelling and verification of cloud-based auto-scaling policies. Future Generation Computer Systems, 87:629--638, 2018.

[11]

G. Galante, L. C. E. De Bona, A. R. Mury, B. Schulze, and R. da Rosa Righi. An analysis of public clouds elasticity in the execution of scientific applications: a survey. Journal of Grid Computing, 14(2):193--216, 2016.

Digital Library

[12]

A. Gandhi, P. Dube, A. Karve, A. Kochut, and L. Zhang. Adaptive, model-driven autoscaling for cloud applications. In 11th International Conference on Autonomic Computing (ICAC'14), pages 57--64, 2014.

[13]

Z. Gong, X. Gu, and J. Wilkes. Press: predictive elastic resource scaling for cloud systems. In International Conference on Network and Service Management, pages 9--16. IEEE, 2010.

[14]

R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. In ACM SIGCOMM'14, pages 455--466. ACM, 2014.

Digital Library

[15]

K. Grygiel and M. Wielgus. Kubernetes vertical pod autoscaler: design proposal. https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md, kubernetes community, 2018. Accessed 2019-11-04.

[16]

A. R. Hummaida, N. W. Paton, and R. Sakellariou. Adaptation in cloud resource configuration: a survey. Journal of Cloud Computing, 5(1):7, 2016.

Digital Library

[17]

A. Ilyushkin, A. Ali-Eldin, N. Herbst, A. V. Papadopoulos, B. Ghit, D. Epema, and A. Iosup. An experimental performance evaluation of autoscaling policies for complex workflows. In 8th ACM/SPEC on International Conference on Performance Engineering, pages 75--86. ACM, 2017.

Digital Library

[18]

S. Islam, J. Keung, K. Lee, and A. Liu. Empirical prediction models for adaptive resource provisioning in the cloud. Future Generation Computer Systems, 28(1):155--162, 2012.

Digital Library

[19]

P. Jamshidi, A. M. Sharifloo, C. Pahl, A. Metzger, and G. Estrada. Self-learning cloud controllers: fuzzy q-learning for knowledge evolution. In International Conference on Cloud and Autonomic Computing, pages 208--211. IEEE, 2015.

Digital Library

[20]

D. Janardhanan and E. Barrett. CPU workload forecasting of machines in data centers using LSTM recurrent neural networks and ARIMA models. In 12th International Conference for Internet Technology and Secured Transactions (ICITST'17), pages 55--60. IEEE, 2017.

[21]

P. Janus and K. Rzadca. SLO-aware colocation of data center tasks based on instantaneous processor requirements. In ACM Symposium on Cloud Computing (SoCC'17), pages 256--268. ACM, 2017.

Digital Library

[22]

T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano. A review of auto-scaling techniques for elastic applications in cloud environments. Journal of Grid Computing, 12(4):559--592, 2014.

Digital Library

[23]

C. Lu, K. Ye, G. Xu, C.-Z. Xu, and T. Bai. Imbalance in the cloud: an analysis on Alibaba cluster trace. In IEEE International Conference on Big Data (BigData'17), pages 2884--2892. IEEE, 2017.

[24]

V. Podolskiy, A. Jindal, and M. Gerndt. IaaS reactive autoscaling performance challenges. In 11th IEEE International Conference on Cloud Computing (CLOUD'18), pages 954--957. IEEE, 2018.

[25]

G. Rattihalli, M. Govindaraju, H. Lu, and D. Tiwari. Exploring potential for non-disruptive vertical auto scaling and resource estimation in Kubernetes. In 12th IEEE International Conference on Cloud Computing (CLOUD'19), pages 33--40. IEEE, 2019.

[26]

C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In 3rd ACM Symposium on Cloud Computing (SoCC'12), page 7. ACM, 2012.

[27]

C. Reiss, J. Wilkes, and J. L. Hellerstein. Google cluster-usage traces: format + schema. Technical report at https://github.com/google/cluster-data/, Google, Mountain View, CA, USA, 2011.

[28]

N. Roy, A. Dubey, and A. Gokhale. Efficient autoscaling in the cloud using predictive models for workload forecasting. In 4th IEEE International Conference on Cloud Computing (CLOUD'11), pages 500--507. IEEE, 2011.

Digital Library

[29]

M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In 8th ACM European Conference on Computer Systems (EuroSys'13), pages 351--364. ACM, 2013.

Digital Library

[30]

X. Sun, C. Hu, R. Yang, P. Garraghan, T. Wo, J. Xu, J. Zhu, and C. Li. ROSE: cluster resource scheduling via speculative over-subscription. In 38th IEEE International Conference on Distributed Computing Systems (ICDCS), pages 949--960. IEEE, 2018.

[31]

M. Tirmazi, A. Barker, N. Deng, Z. G. Qin, M. E. Haque, S. Hand, M. Harchol-Balter, and J. Wilkes. Borg: the Next Generation. In 15th ACM European Conference on Computer Systems (EuroSys'20), Heraklion, Crete, Greece, 2020.

Digital Library

[32]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache Hadoop YARN: Yet Another Resource Negotiator. In 4th ACM Symposium on Cloud Computing (SoCC'13), page 5. ACM, 2013.

[33]

S. Venkataraman, Z. Yang, M. Franklin, B. Recht, and I. Stoica. Ernest: Efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI'16), pages 363--378, 2016.

[34]

A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In 10th ACM European Conference on Computer Systems (EuroSys'15), page 18. ACM, 2015.

[35]

J. Wilkes. Google cluster usage traces v3. Technical report at https://github.com/google/cluster-data, Google, Mountain View, CA, USA, 2020.

[36]

N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, B. Smith, and R. H. Katz. Selecting the best VM across multiple public clouds: A data-driven performance modeling approach. In ACM Symposium on Cloud Computing (SoCC'17), pages 452--465, 2017.

Digital Library

[37]

A. B. Yoo, M. A. Jette, and M. Grondona. Slurm: simple Linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 44--60. Springer, 2003.

[38]

X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI²: CPU performance isolation for shared compute clusters. In 8th ACM European Conference on Computer Systems (EuroSys'13), pages 379--391, 2013.

Digital Library

Cited By

Agrawal KAbdu Jyothi SEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Cooperative Graceful Degradation in Containerized CloudsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707244(214-232)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707244
Reidys BZardoshti PGoiri ÍIrvene CBerger DMa HArya KCortez EStark TBak EIyigun MNovakovic SHsu LTrueba KPan ABansal CRajmohan SHuang JBianchini REeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud PlatformsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707226(164-181)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707226
Sun YDing ZYan YWang ZDehghanian PLee W(2025)Privacy-Preserving Energy Sharing Among Cloud Service Providers via Collaborative Job SchedulingIEEE Transactions on Smart Grid10.1109/TSG.2024.348239016:2(1168-1180)Online publication date: Mar-2025
https://doi.org/10.1109/TSG.2024.3482390
Show More Cited By

Recommendations

Model based UAV autopilot tuning
FM'11/HMT'11: Proceedings of the 8th WSEAS international conference on fluid mechanics, 8th WSEAS international conference on Heat and mass transfer

The paper presents the role of Autopilots in Unmanned Aerial Vehicles (UAVs) and the process of their configuration before flying can occur. The common autopilot architecture, emphasising the control module, is given shortly and the functionality of the ...
B-737 autopilot design and implementation for simulated flight management system
ANSS '16: Proceedings of the 49th Annual Simulation Symposium

One of the major investments of the Federal Aviation Administration's (FAA) Next Generation Air Transportation (NextGen) program is in Four-Dimensional (4D) Trajectory Based Operations (TBO). The heart of 4D TBO is the autopilot capability on any ...
Embedded autopilot for accurate waypoint navigation and trajectory tracking: application to miniature rotorcraft UAVs
ICRA'09: Proceedings of the 2009 IEEE international conference on Robotics and Automation

In this paper, we describe a miniature flight platform weighing less than 700 grams and capable of waypoint navigation, trajectory tracking, precise hovering and automatic takeoff and landing. In an effort to make advanced autonomous behaviors available ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '20: Proceedings of the Fifteenth European Conference on Computer Systems

April 2020

49 pages

ISBN:9781450368827

DOI:10.1145/3342195

General Chairs:
Angelos Bilas
University of Crete and FORTH-ICS
,
Kostas Magoutis
University of Crete and FORTH-ICS
,
Evangelos Markatos
University of Crete and FORTH-ICS
,
Program Chairs:
Dejan Kostic
KTH Royal Institute of Technology, Sweden
,
Margo Seltzer
The University of British Columbia, Canada

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

EuroSys '20

Sponsor:

SIGOPS

EuroSys '20: Fifteenth EuroSys Conference 2020

April 27 - 30, 2020

Heraklion, Greece

Acceptance Rates

EuroSys '20 Paper Acceptance Rate 43 of 234 submissions, 18%;

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

200
Total Citations
View Citations
44,149
Total Downloads

Downloads (Last 12 months)5,477
Downloads (Last 6 weeks)519

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Agrawal KAbdu Jyothi SEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Cooperative Graceful Degradation in Containerized CloudsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707244(214-232)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707244
Reidys BZardoshti PGoiri ÍIrvene CBerger DMa HArya KCortez EStark TBak EIyigun MNovakovic SHsu LTrueba KPan ABansal CRajmohan SHuang JBianchini REeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud PlatformsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707226(164-181)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707226
Sun YDing ZYan YWang ZDehghanian PLee W(2025)Privacy-Preserving Energy Sharing Among Cloud Service Providers via Collaborative Job SchedulingIEEE Transactions on Smart Grid10.1109/TSG.2024.348239016:2(1168-1180)Online publication date: Mar-2025
https://doi.org/10.1109/TSG.2024.3482390
Wang ZLi HSun LRosenkrantz SChe HJiang H(2025)A Tail Latency SLO Guaranteed Task Scheduling Scheme for User-Facing ServicesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.354263836:4(759-774)Online publication date: Apr-2025
https://doi.org/10.1109/TPDS.2025.3542638
Liao HLiu TGuo JHuang BYang DDing J(2025)Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data CentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349487936:1(67-83)Online publication date: Jan-2025
https://doi.org/10.1109/TPDS.2024.3494879
Dai PHan BLi KXu XXing HLiu K(2025)Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge ComputingIEEE Transactions on Mobile Computing10.1109/TMC.2024.345779324:1(210-226)Online publication date: Jan-2025
https://doi.org/10.1109/TMC.2024.3457793
Hua QYang DQian SCao JXue GLi M(2025)Humas: A Heterogeneity- and Upgrade-Aware Microservice Auto-Scaling Framework in Large-Scale Data CentersIEEE Transactions on Computers10.1109/TC.2024.350686274:3(968-982)Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1109/TC.2024.3506862
Zafeiropoulos AFilinis NFotopoulou EPapavassiliou S(2025)AI-Assisted Synergetic Orchestration Mechanisms for Autoscaling in Computing Continuum SystemsIEEE Communications Magazine10.1109/MCOM.001.220058363:1(116-122)Online publication date: Jan-2025
https://doi.org/10.1109/MCOM.001.2200583
Santos JReppas EWauters TVolckaert BDe Turck F(2025) GwydionJournal of Network and Computer Applications10.1016/j.jnca.2024.104067234:COnline publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1016/j.jnca.2024.104067
Xu DLiu FWang BTang XZeng DGao HChen RWu Q(2025)GenesisRM: A state-driven approach to resource management for distributed JVM web applicationsFuture Generation Computer Systems10.1016/j.future.2024.107539163(107539)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107539
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten