Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3342195.3387524acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open access

Autopilot: workload autoscaling at Google

Published: 17 April 2020 Publication History

Abstract

In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage.
To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack - the difference between the limit and the actual resource usage - while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.
Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage.

References

[1]
O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI'17), pages 469--482, 2017.
[2]
D. Breitgand and A. Epstein. Improving consolidation of virtual machines with risk-aware bandwidth oversubscription in compute clouds. In IEEE INFOCOM, pages 2861--2865. IEEE, 2012.
[3]
B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes. Borg, Omega, and Kubernetes. Queue, 14(1):10, 2016.
[4]
M. Carvalho, W. Cirne, F. Brasileiro, and J. Wilkes. Long-term SLOs for reclaimed cloud computing resources. In ACM Symposium on Cloud Computing (SoCC'14), pages 1--13. ACM, 2014.
[5]
C. Curino, S. Krishnan, K. Karanasos, S. Rao, G. M. Fumarola, B. Huang, K. Chaliparambil, A. Suresh, Y. Chen, S. Heddaya, et al. Hydra: a federated resource manager for data-center scale analytics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI'19), pages 177--192, 2019.
[6]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI'04), pages 10--10, San Francisco, CA, 2004. USENIX Association.
[7]
C. Delimitrou and C. Kozyrakis. Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices, 49(4):127--144, 2014.
[8]
F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
[9]
M. Duggan, R. Shaw, J. Duggan, E. Howley, and E. Barrett. A multitime-steps-ahead prediction approach for scheduling live migration in cloud data centers. Software: Practice and Experience, 49(4):617--639, 2019.
[10]
A. Evangelidis, D. Parker, and R. Bahsoon. Performance modelling and verification of cloud-based auto-scaling policies. Future Generation Computer Systems, 87:629--638, 2018.
[11]
G. Galante, L. C. E. De Bona, A. R. Mury, B. Schulze, and R. da Rosa Righi. An analysis of public clouds elasticity in the execution of scientific applications: a survey. Journal of Grid Computing, 14(2):193--216, 2016.
[12]
A. Gandhi, P. Dube, A. Karve, A. Kochut, and L. Zhang. Adaptive, model-driven autoscaling for cloud applications. In 11th International Conference on Autonomic Computing (ICAC'14), pages 57--64, 2014.
[13]
Z. Gong, X. Gu, and J. Wilkes. Press: predictive elastic resource scaling for cloud systems. In International Conference on Network and Service Management, pages 9--16. IEEE, 2010.
[14]
R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. In ACM SIGCOMM'14, pages 455--466. ACM, 2014.
[15]
K. Grygiel and M. Wielgus. Kubernetes vertical pod autoscaler: design proposal. https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md, kubernetes community, 2018. Accessed 2019-11-04.
[16]
A. R. Hummaida, N. W. Paton, and R. Sakellariou. Adaptation in cloud resource configuration: a survey. Journal of Cloud Computing, 5(1):7, 2016.
[17]
A. Ilyushkin, A. Ali-Eldin, N. Herbst, A. V. Papadopoulos, B. Ghit, D. Epema, and A. Iosup. An experimental performance evaluation of autoscaling policies for complex workflows. In 8th ACM/SPEC on International Conference on Performance Engineering, pages 75--86. ACM, 2017.
[18]
S. Islam, J. Keung, K. Lee, and A. Liu. Empirical prediction models for adaptive resource provisioning in the cloud. Future Generation Computer Systems, 28(1):155--162, 2012.
[19]
P. Jamshidi, A. M. Sharifloo, C. Pahl, A. Metzger, and G. Estrada. Self-learning cloud controllers: fuzzy q-learning for knowledge evolution. In International Conference on Cloud and Autonomic Computing, pages 208--211. IEEE, 2015.
[20]
D. Janardhanan and E. Barrett. CPU workload forecasting of machines in data centers using LSTM recurrent neural networks and ARIMA models. In 12th International Conference for Internet Technology and Secured Transactions (ICITST'17), pages 55--60. IEEE, 2017.
[21]
P. Janus and K. Rzadca. SLO-aware colocation of data center tasks based on instantaneous processor requirements. In ACM Symposium on Cloud Computing (SoCC'17), pages 256--268. ACM, 2017.
[22]
T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano. A review of auto-scaling techniques for elastic applications in cloud environments. Journal of Grid Computing, 12(4):559--592, 2014.
[23]
C. Lu, K. Ye, G. Xu, C.-Z. Xu, and T. Bai. Imbalance in the cloud: an analysis on Alibaba cluster trace. In IEEE International Conference on Big Data (BigData'17), pages 2884--2892. IEEE, 2017.
[24]
V. Podolskiy, A. Jindal, and M. Gerndt. IaaS reactive autoscaling performance challenges. In 11th IEEE International Conference on Cloud Computing (CLOUD'18), pages 954--957. IEEE, 2018.
[25]
G. Rattihalli, M. Govindaraju, H. Lu, and D. Tiwari. Exploring potential for non-disruptive vertical auto scaling and resource estimation in Kubernetes. In 12th IEEE International Conference on Cloud Computing (CLOUD'19), pages 33--40. IEEE, 2019.
[26]
C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In 3rd ACM Symposium on Cloud Computing (SoCC'12), page 7. ACM, 2012.
[27]
C. Reiss, J. Wilkes, and J. L. Hellerstein. Google cluster-usage traces: format + schema. Technical report at https://github.com/google/cluster-data/, Google, Mountain View, CA, USA, 2011.
[28]
N. Roy, A. Dubey, and A. Gokhale. Efficient autoscaling in the cloud using predictive models for workload forecasting. In 4th IEEE International Conference on Cloud Computing (CLOUD'11), pages 500--507. IEEE, 2011.
[29]
M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In 8th ACM European Conference on Computer Systems (EuroSys'13), pages 351--364. ACM, 2013.
[30]
X. Sun, C. Hu, R. Yang, P. Garraghan, T. Wo, J. Xu, J. Zhu, and C. Li. ROSE: cluster resource scheduling via speculative over-subscription. In 38th IEEE International Conference on Distributed Computing Systems (ICDCS), pages 949--960. IEEE, 2018.
[31]
M. Tirmazi, A. Barker, N. Deng, Z. G. Qin, M. E. Haque, S. Hand, M. Harchol-Balter, and J. Wilkes. Borg: the Next Generation. In 15th ACM European Conference on Computer Systems (EuroSys'20), Heraklion, Crete, Greece, 2020.
[32]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache Hadoop YARN: Yet Another Resource Negotiator. In 4th ACM Symposium on Cloud Computing (SoCC'13), page 5. ACM, 2013.
[33]
S. Venkataraman, Z. Yang, M. Franklin, B. Recht, and I. Stoica. Ernest: Efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI'16), pages 363--378, 2016.
[34]
A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In 10th ACM European Conference on Computer Systems (EuroSys'15), page 18. ACM, 2015.
[35]
J. Wilkes. Google cluster usage traces v3. Technical report at https://github.com/google/cluster-data, Google, Mountain View, CA, USA, 2020.
[36]
N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, B. Smith, and R. H. Katz. Selecting the best VM across multiple public clouds: A data-driven performance modeling approach. In ACM Symposium on Cloud Computing (SoCC'17), pages 452--465, 2017.
[37]
A. B. Yoo, M. A. Jette, and M. Grondona. Slurm: simple Linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 44--60. Springer, 2003.
[38]
X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU performance isolation for shared compute clusters. In 8th ACM European Conference on Computer Systems (EuroSys'13), pages 379--391, 2013.

Cited By

View all
  • (2025)Cooperative Graceful Degradation in Containerized CloudsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707244(214-232)Online publication date: 30-Mar-2025
  • (2025)Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud PlatformsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707226(164-181)Online publication date: 30-Mar-2025
  • (2025)Privacy-Preserving Energy Sharing Among Cloud Service Providers via Collaborative Job SchedulingIEEE Transactions on Smart Grid10.1109/TSG.2024.348239016:2(1168-1180)Online publication date: Mar-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '20: Proceedings of the Fifteenth European Conference on Computer Systems
April 2020
49 pages
ISBN:9781450368827
DOI:10.1145/3342195
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2020

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

EuroSys '20
Sponsor:
EuroSys '20: Fifteenth EuroSys Conference 2020
April 27 - 30, 2020
Heraklion, Greece

Acceptance Rates

EuroSys '20 Paper Acceptance Rate 43 of 234 submissions, 18%;
Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5,477
  • Downloads (Last 6 weeks)519
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Cooperative Graceful Degradation in Containerized CloudsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707244(214-232)Online publication date: 30-Mar-2025
  • (2025)Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud PlatformsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707226(164-181)Online publication date: 30-Mar-2025
  • (2025)Privacy-Preserving Energy Sharing Among Cloud Service Providers via Collaborative Job SchedulingIEEE Transactions on Smart Grid10.1109/TSG.2024.348239016:2(1168-1180)Online publication date: Mar-2025
  • (2025)A Tail Latency SLO Guaranteed Task Scheduling Scheme for User-Facing ServicesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.354263836:4(759-774)Online publication date: Apr-2025
  • (2025)Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data CentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349487936:1(67-83)Online publication date: Jan-2025
  • (2025)Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge ComputingIEEE Transactions on Mobile Computing10.1109/TMC.2024.345779324:1(210-226)Online publication date: Jan-2025
  • (2025)Humas: A Heterogeneity- and Upgrade-Aware Microservice Auto-Scaling Framework in Large-Scale Data CentersIEEE Transactions on Computers10.1109/TC.2024.350686274:3(968-982)Online publication date: 1-Mar-2025
  • (2025)AI-Assisted Synergetic Orchestration Mechanisms for Autoscaling in Computing Continuum SystemsIEEE Communications Magazine10.1109/MCOM.001.220058363:1(116-122)Online publication date: Jan-2025
  • (2025) GwydionJournal of Network and Computer Applications10.1016/j.jnca.2024.104067234:COnline publication date: 1-Feb-2025
  • (2025)GenesisRM: A state-driven approach to resource management for distributed JVM web applicationsFuture Generation Computer Systems10.1016/j.future.2024.107539163(107539)Online publication date: Feb-2025
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media