research-article

DC-DRF: Adaptive Multi-Resource Sharing at Public Cloud Scale

Authors:

Stavros VolosAuthors Info & Claims

SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

Pages 374 - 385

https://doi.org/10.1145/3267809.3267848

Published: 11 October 2018 Publication History

Abstract

Public cloud datacenters implement a distributed computing environment built for economy at scale, with hundreds of thousands of compute and storage servers and a large population of predominantly small customers often densely packed to a compute server. Several recent contributions have investigated how equitable sharing and differentiated services can be achieved in this multi-resource environment, using the Extended Dominant Resource Fairness (EDRF) algorithm. However, we find that EDRF requires prohibitive execution time when employed at datacenter scale due to its iterative nature and polynomial time complexity; its closed-form expression does not alter its asymptotic complexity.

In response, we propose Deadline-Constrained DRF, or DC-DRF, an adaptive approximation of EDRF designed to support centralized multi-resource allocation at datacenter scale in bounded time. The approximation introduces error which can be reduced using a high-performance implementation, drawing on parallelization techniques from the field of High-Performance Computing and vector arithmetic instructions available in modern server processors. We evaluate DC-DRF at scales that exceed those previously reported by several orders of magnitude, calculating resource allocations for one million predominantly small tenants and one hundred thousand resources, in seconds. Our parallel implementation preserves the properties of EDRF up to a small error, and empirical results show that the error introduced by approximation is insignificant for practical purposes.

References

[1]

Apache Hadoop Yarn DRF scheduler.

[2]

Windows Server 2016 technical preview storage quality of service.

[3]

An Introduction to the IntelÂő QuickPath Interconnect.

[4]

Intel Intrinsics Guide.

[5]

S. Angel, H. Ballani, T. Karagiannis, G. O'Shea, and E. Thereska. End-to-end performance isolation through virtual datacenters. In 11th USENIX Symposium on Operating Systems and Design (OSDI), 2014.

Digital Library

[6]

B. Awerbuch and Y. Shavitt. Converging to approximated max-min flow fairness in logarithmic time. In 17th Conference on Information Communications (INFOCOM), 1998.

[7]

H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Towards predictable datacenter networks. In ACM SIGCOMM 2011 Conference on Special Interest Group on Data Communication (SIGCOMM), 2011.

Digital Library

[8]

H. Ballani, K. Jang, T. Karagiannis, C. Kim, D. Gunawardena, and G. O'Shea. Chatty tenants and the cloud network sharing problem. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2013.

Digital Library

[9]

T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In 10th ACM SIGCOMM Conference on Internet Measurement (IMC), 2010.

Digital Library

[10]

A. Bhattacharya, D. Culler, E. Friedman, A. Ghodsi, S. Shenker, and I. Stoica. Hierarchical scheduling for diverse datacenter workloads. In ACM Symposium on Cloud Computing 2013 (SoCC), 2013.

Digital Library

[11]

B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, and J. e. a. Wu. Windows Azure Storage: A highly available cloud storage service with strong consistency. In 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011.

Digital Library

[12]

A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger. A cloud-scale acceleration architecture. In 49th International Symposium on Microarchitecture (MICRO), 2016.

Digital Library

[13]

M. Chowdhury, Z. Liu, A. Ghodsi, and I. Stoica. HUG: Multi-resource fairness for correlated and elastic demands. In 13th USENIX Symposium on Networked Systems and Design (NSDI), 2016.

Digital Library

[14]

E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini. Resource Central: Understanding and predicting workloads for improved resource management in large cloud platforms. In 26th ACM Symposium on Operating Systems (SOSP), 2017.

Digital Library

[15]

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, A.-D. Kaynak, C.and Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.

Digital Library

[16]

A. Ghodsi, V. Sekar, M. Zaharia, and I. Stoica. Multi-resource fair queueing for packet processing. In ACM SIGCOMM 2012 Conference on Special Interest Group on Data Communication (SIGCOMM), 2012.

Digital Library

[17]

A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2011.

Digital Library

[18]

D. Ghoshal, R. S. Canon, and L. Ramakrishnan. I/o performance of virtualized cloud environments. In 2nd International Workshop on Data Intensive Computing in the Clouds (DataCloud), 2011.

Digital Library

[19]

I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.

Digital Library

[20]

R. Grandl, M. Chowdhury, A. Akella, and G. Ananthanarayanan. Altruistic scheduling in multi-resource clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.

Digital Library

[21]

C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, P. Sun, W. Wu, and Y. Zhang. Secondnet: A data center network virtualization architecture with bandwidth guarantees. In 6th International Conference on Emerging Networking Experiments and Technologies (Co-NEXT), Co-NEXT '10, 2010.

Digital Library

[22]

C. Guo, H. Wu, Z. Deng, G. Soni, J. Ye, J. Padhye, and M. Lipshteyn. Rdma over commodity ethernet at scale. In ACM SIGCOMM 2016 Conference on Special Interest Group on Data Communication (SIGCOMM), 2016.

Digital Library

[23]

Z. Hill, J. Li, M. Mao, A. Ruiz-Alvarez, and M. Humphrey. Early observations on the performance of windows azure. In 19th ACM International Symposium on High Performance Distributed Computing (HPDC), 2010.

Digital Library

[24]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2011.

Digital Library

[25]

A. Iosup, N. Yigitbasi, and D. Epema. On the performance variability of production cloud services. In 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2011.

Digital Library

[26]

S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic: Measurements & analysis. In 9th ACM SIGCOMM Conference on Internet Measurement (IMC), 2009.

Digital Library

[27]

I. A. Kash, G. O'Shea, and S. Volos. MSR Techncal Report MSR-TR-2018-5 DC-DRF: Adaptive Multi-Resource Sharing at Public Cloud Scale.

[28]

J. Lee, Y. Turner, M. Lee, L. Popa, S. Banerjee, J.-M. Kang, and P. Sharma. Application-driven bandwidth guarantees in datacenters. In ACM SIGCOMM 2014 Conference on Special Interest Group on Data Communication (SIGCOMM), 2014.

Digital Library

[29]

A. Li, X. Yang, S. Kandula, and M. Zhang. Cloudcmp: Comparing public cloud providers. In 10th ACM SIGCOMM Conference on Internet Measurement (IMC), 2010.

Digital Library

[30]

J. Mace, P. Bodik, R. Fonseca, and M. Musuvathi. Retro: Targeted resource management in multi-tenant distributed systems. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), 2015.

Digital Library

[31]

P. J. Marandi, C. Gkantsidis, F. Junqueira, and D. Narayanan. Filo: Consolidated consensus as a cloud service. In 2016 USENIX Annual Technical Conference (ATC), 2016.

Digital Library

[32]

D. C. Parkes, A. D. Procaccia, and N. Shah. Beyond dominant resource fairness: Extensions, limitations, and indivisibilities. In ACM Conference on Electronic Commerce (EC), 2012.

Digital Library

[33]

P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A. Maltz, R. Kern, H. Kumar, M. Zikos, H. Wu, C. Kim, and N. Karri. Ananta: Cloud scale load balancing. In ACM SIGCOMM 2013 Conference on Special Interest Group on Data Communication (SIGCOMM), 2013.

Digital Library

[34]

P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A. Maltz, R. Kern, H. Kumar, M. Zikos, H. Wu, C. Kim, and N. Karri. Ananta: Cloud scale load balancing. In ACM SIGCOMM 2013 Conference on Special Interest Group on Data Communication (SIGCOMM), 2013.

Digital Library

[35]

L. Popa, A. Krishnamurthy, S. Ratnasamy, and I. Stoica. Faircloud: Sharing the network in cloud computing. In 10th ACM Workshop on Hot Topics in Networks (HotNets), 2011.

Digital Library

[36]

L. Popa, P. Yalagandula, S. Banerjee, J. C. Mogul, Y. Turner, and J. R. Santos. Elasticswitch: Practical work-conserving bandwidth guarantees for cloud computing. In ACM SIGCOMM 2013 Conference on Special Interest Group on Data Communication (SIGCOMM), 2013.

Digital Library

[37]

A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren. Inside the social network's (datacenter) network. In ACM SIGCOMM 2015 Conference on Special Interest Group on Data Communication (SIGCOMM), 2015.

Digital Library

[38]

J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime measurements in the cloud: Observing, analyzing, and reducing variance. In 33rd International Conference on Very Large Data Bases (VLDB), 2010.

Digital Library

[39]

A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha. Sharing the data center network. In 8th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011.

Digital Library

[40]

D. Shue, M. J. Freedman, and A. Shaikh. Performance isolation and fairness for multi-tenant cloud storage. In 10th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2012.

Digital Library

[41]

I. Stefanovici, E. Thereska, B. Schroeder, H. Ballani, A. Rowstron, and T. Talpey. Software-Defined Caching: Managing caches in multi-tenant data centers. In ACM Symposium on Cloud Computing 2015 (SoCC), 2015.

Digital Library

[42]

E. Thereska, H. Ballani, G. O'Shea, T. Karagiannis, A. Rowstron, T. Talpey, R. Black, and T. Zhu. IOFlow: A software-defined storage architecture. In 25th ACM Symposium on Operating Systems Principles (SOSP), 2013.

Digital Library

[43]

A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In European Conference on Computer Systems 2015 (EuroSys), 2015.

Digital Library

[44]

E. Walker. Benchmarking amazon ec2 for high-performance scientiïňĄc computing.; LOGIN, 29(3):18--23, 2008.

[45]

G. Wang and T. S. E. Ng. The impact of virtualization on network performance of amazon ec2 data center. In 29th Conference on Information Communications (INFOCOM), 2010.

Digital Library

[46]

H. Wang and P. Varman. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation. In 12th USENIX Conference on File and Storage Technologies (FAST), 2014.

Digital Library

[47]

D. Xie, N. Ding, Y. C. Hu, and R. Kompella. The only constant is change: Incorporating time-varying network reservations in data centers. SIGCOMM Computer Communication Review, 42(4):199--210, 2012.

Digital Library

Cited By

Kim ELee KYoo C(2023)Network SLO-aware container scheduling in KubernetesThe Journal of Supercomputing10.1007/s11227-023-05122-579:10(11478-11494)Online publication date: 28-Feb-2023
https://doi.org/10.1007/s11227-023-05122-5
Cao QSi W(2022)PECS: A Pareto-efficient and Envy-free Cloud Resource Scheduler2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894320(147-152)Online publication date: 11-Nov-2022
https://doi.org/10.1109/IPCCC55026.2022.9894320
Chen YTong WFeng DWang Z(2022)Workload-aware storage policies for cloud object storageJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.026Online publication date: Feb-2022
https://doi.org/10.1016/j.jpdc.2022.01.026
Show More Cited By

Recommendations

A Novel Approach for Fair and Secure Resource Allocation in Storage Cloud Architectures Based on DRF Mechanism
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)

In a cloud computing environment, cloud users are competing for the same shared resources in order to run their applications or storage their data. In fact, cloud providers try to offer the necessary resources according to the users' demands giving the ...
AC/DC TCP: Virtual Congestion Control Enforcement for Datacenter Networks
SIGCOMM '16: Proceedings of the 2016 ACM SIGCOMM Conference

Multi-tenant datacenters are successful because tenants can seamlessly port their applications and services to the cloud. Virtual Machine (VM) technology plays an integral role in this success by enabling a diverse set of software to be run on a unified ...
A tenant-based resource allocation model for scaling Software-as-a-Service applications over cloud computing infrastructures

Cloud computing provides on-demand access to computational resources which together with pay-per-use business models, enable application providers seamlessly scaling their services. Cloud computing infrastructures allow creating a variable number of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

October 2018

546 pages

ISBN:9781450360111

DOI:10.1145/3267809

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SoCC '18

Sponsor:

SoCC '18: ACM Symposium on Cloud Computing

October 11 - 13, 2018

CA, Carlsbad, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
375
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim ELee KYoo C(2023)Network SLO-aware container scheduling in KubernetesThe Journal of Supercomputing10.1007/s11227-023-05122-579:10(11478-11494)Online publication date: 28-Feb-2023
https://doi.org/10.1007/s11227-023-05122-5
Cao QSi W(2022)PECS: A Pareto-efficient and Envy-free Cloud Resource Scheduler2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894320(147-152)Online publication date: 11-Nov-2022
https://doi.org/10.1109/IPCCC55026.2022.9894320
Chen YTong WFeng DWang Z(2022)Workload-aware storage policies for cloud object storageJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.026Online publication date: Feb-2022
https://doi.org/10.1016/j.jpdc.2022.01.026
Kim ELee KYoo C(2022)DepCon: Achieving Network SLO for High Performance CloudsEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_27(339-351)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_27
Sadok HCampista MCosta L(2021)Stateful DRF: Considering the Past in a Multi-Resource AllocationIEEE Transactions on Computers10.1109/TC.2020.300600770:7(1094-1105)Online publication date: 1-Jul-2021
https://doi.org/10.1109/TC.2020.3006007
Chen YTong WFeng DWang Z(2020)Mass: Workload-Aware Storage Policy for OpenStack SwiftProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404427(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404427
Macedo RPaulo JPereira JBessani A(2020)A Survey and Classification of Software-Defined Storage SystemsACM Computing Surveys10.1145/338589653:3(1-38)Online publication date: 28-May-2020
https://dl.acm.org/doi/10.1145/3385896

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents