Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3267809.3267848acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

DC-DRF: Adaptive Multi-Resource Sharing at Public Cloud Scale

Published: 11 October 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Public cloud datacenters implement a distributed computing environment built for economy at scale, with hundreds of thousands of compute and storage servers and a large population of predominantly small customers often densely packed to a compute server. Several recent contributions have investigated how equitable sharing and differentiated services can be achieved in this multi-resource environment, using the Extended Dominant Resource Fairness (EDRF) algorithm. However, we find that EDRF requires prohibitive execution time when employed at datacenter scale due to its iterative nature and polynomial time complexity; its closed-form expression does not alter its asymptotic complexity.
    In response, we propose Deadline-Constrained DRF, or DC-DRF, an adaptive approximation of EDRF designed to support centralized multi-resource allocation at datacenter scale in bounded time. The approximation introduces error which can be reduced using a high-performance implementation, drawing on parallelization techniques from the field of High-Performance Computing and vector arithmetic instructions available in modern server processors. We evaluate DC-DRF at scales that exceed those previously reported by several orders of magnitude, calculating resource allocations for one million predominantly small tenants and one hundred thousand resources, in seconds. Our parallel implementation preserves the properties of EDRF up to a small error, and empirical results show that the error introduced by approximation is insignificant for practical purposes.

    References

    [1]
    Apache Hadoop Yarn DRF scheduler.
    [2]
    Windows Server 2016 technical preview storage quality of service.
    [3]
    An Introduction to the IntelÂő QuickPath Interconnect.
    [4]
    Intel Intrinsics Guide.
    [5]
    S. Angel, H. Ballani, T. Karagiannis, G. O'Shea, and E. Thereska. End-to-end performance isolation through virtual datacenters. In 11th USENIX Symposium on Operating Systems and Design (OSDI), 2014.
    [6]
    B. Awerbuch and Y. Shavitt. Converging to approximated max-min flow fairness in logarithmic time. In 17th Conference on Information Communications (INFOCOM), 1998.
    [7]
    H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Towards predictable datacenter networks. In ACM SIGCOMM 2011 Conference on Special Interest Group on Data Communication (SIGCOMM), 2011.
    [8]
    H. Ballani, K. Jang, T. Karagiannis, C. Kim, D. Gunawardena, and G. O'Shea. Chatty tenants and the cloud network sharing problem. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2013.
    [9]
    T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In 10th ACM SIGCOMM Conference on Internet Measurement (IMC), 2010.
    [10]
    A. Bhattacharya, D. Culler, E. Friedman, A. Ghodsi, S. Shenker, and I. Stoica. Hierarchical scheduling for diverse datacenter workloads. In ACM Symposium on Cloud Computing 2013 (SoCC), 2013.
    [11]
    B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, and J. e. a. Wu. Windows Azure Storage: A highly available cloud storage service with strong consistency. In 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011.
    [12]
    A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger. A cloud-scale acceleration architecture. In 49th International Symposium on Microarchitecture (MICRO), 2016.
    [13]
    M. Chowdhury, Z. Liu, A. Ghodsi, and I. Stoica. HUG: Multi-resource fairness for correlated and elastic demands. In 13th USENIX Symposium on Networked Systems and Design (NSDI), 2016.
    [14]
    E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini. Resource Central: Understanding and predicting workloads for improved resource management in large cloud platforms. In 26th ACM Symposium on Operating Systems (SOSP), 2017.
    [15]
    M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, A.-D. Kaynak, C.and Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.
    [16]
    A. Ghodsi, V. Sekar, M. Zaharia, and I. Stoica. Multi-resource fair queueing for packet processing. In ACM SIGCOMM 2012 Conference on Special Interest Group on Data Communication (SIGCOMM), 2012.
    [17]
    A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2011.
    [18]
    D. Ghoshal, R. S. Canon, and L. Ramakrishnan. I/o performance of virtualized cloud environments. In 2nd International Workshop on Data Intensive Computing in the Clouds (DataCloud), 2011.
    [19]
    I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.
    [20]
    R. Grandl, M. Chowdhury, A. Akella, and G. Ananthanarayanan. Altruistic scheduling in multi-resource clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
    [21]
    C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, P. Sun, W. Wu, and Y. Zhang. Secondnet: A data center network virtualization architecture with bandwidth guarantees. In 6th International Conference on Emerging Networking Experiments and Technologies (Co-NEXT), Co-NEXT '10, 2010.
    [22]
    C. Guo, H. Wu, Z. Deng, G. Soni, J. Ye, J. Padhye, and M. Lipshteyn. Rdma over commodity ethernet at scale. In ACM SIGCOMM 2016 Conference on Special Interest Group on Data Communication (SIGCOMM), 2016.
    [23]
    Z. Hill, J. Li, M. Mao, A. Ruiz-Alvarez, and M. Humphrey. Early observations on the performance of windows azure. In 19th ACM International Symposium on High Performance Distributed Computing (HPDC), 2010.
    [24]
    B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2011.
    [25]
    A. Iosup, N. Yigitbasi, and D. Epema. On the performance variability of production cloud services. In 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2011.
    [26]
    S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic: Measurements & analysis. In 9th ACM SIGCOMM Conference on Internet Measurement (IMC), 2009.
    [27]
    I. A. Kash, G. O'Shea, and S. Volos. MSR Techncal Report MSR-TR-2018-5 DC-DRF: Adaptive Multi-Resource Sharing at Public Cloud Scale.
    [28]
    J. Lee, Y. Turner, M. Lee, L. Popa, S. Banerjee, J.-M. Kang, and P. Sharma. Application-driven bandwidth guarantees in datacenters. In ACM SIGCOMM 2014 Conference on Special Interest Group on Data Communication (SIGCOMM), 2014.
    [29]
    A. Li, X. Yang, S. Kandula, and M. Zhang. Cloudcmp: Comparing public cloud providers. In 10th ACM SIGCOMM Conference on Internet Measurement (IMC), 2010.
    [30]
    J. Mace, P. Bodik, R. Fonseca, and M. Musuvathi. Retro: Targeted resource management in multi-tenant distributed systems. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), 2015.
    [31]
    P. J. Marandi, C. Gkantsidis, F. Junqueira, and D. Narayanan. Filo: Consolidated consensus as a cloud service. In 2016 USENIX Annual Technical Conference (ATC), 2016.
    [32]
    D. C. Parkes, A. D. Procaccia, and N. Shah. Beyond dominant resource fairness: Extensions, limitations, and indivisibilities. In ACM Conference on Electronic Commerce (EC), 2012.
    [33]
    P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A. Maltz, R. Kern, H. Kumar, M. Zikos, H. Wu, C. Kim, and N. Karri. Ananta: Cloud scale load balancing. In ACM SIGCOMM 2013 Conference on Special Interest Group on Data Communication (SIGCOMM), 2013.
    [34]
    P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A. Maltz, R. Kern, H. Kumar, M. Zikos, H. Wu, C. Kim, and N. Karri. Ananta: Cloud scale load balancing. In ACM SIGCOMM 2013 Conference on Special Interest Group on Data Communication (SIGCOMM), 2013.
    [35]
    L. Popa, A. Krishnamurthy, S. Ratnasamy, and I. Stoica. Faircloud: Sharing the network in cloud computing. In 10th ACM Workshop on Hot Topics in Networks (HotNets), 2011.
    [36]
    L. Popa, P. Yalagandula, S. Banerjee, J. C. Mogul, Y. Turner, and J. R. Santos. Elasticswitch: Practical work-conserving bandwidth guarantees for cloud computing. In ACM SIGCOMM 2013 Conference on Special Interest Group on Data Communication (SIGCOMM), 2013.
    [37]
    A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren. Inside the social network's (datacenter) network. In ACM SIGCOMM 2015 Conference on Special Interest Group on Data Communication (SIGCOMM), 2015.
    [38]
    J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime measurements in the cloud: Observing, analyzing, and reducing variance. In 33rd International Conference on Very Large Data Bases (VLDB), 2010.
    [39]
    A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha. Sharing the data center network. In 8th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011.
    [40]
    D. Shue, M. J. Freedman, and A. Shaikh. Performance isolation and fairness for multi-tenant cloud storage. In 10th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2012.
    [41]
    I. Stefanovici, E. Thereska, B. Schroeder, H. Ballani, A. Rowstron, and T. Talpey. Software-Defined Caching: Managing caches in multi-tenant data centers. In ACM Symposium on Cloud Computing 2015 (SoCC), 2015.
    [42]
    E. Thereska, H. Ballani, G. O'Shea, T. Karagiannis, A. Rowstron, T. Talpey, R. Black, and T. Zhu. IOFlow: A software-defined storage architecture. In 25th ACM Symposium on Operating Systems Principles (SOSP), 2013.
    [43]
    A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In European Conference on Computer Systems 2015 (EuroSys), 2015.
    [44]
    E. Walker. Benchmarking amazon ec2 for high-performance scientiïňĄc computing.; LOGIN, 29(3):18--23, 2008.
    [45]
    G. Wang and T. S. E. Ng. The impact of virtualization on network performance of amazon ec2 data center. In 29th Conference on Information Communications (INFOCOM), 2010.
    [46]
    H. Wang and P. Varman. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation. In 12th USENIX Conference on File and Storage Technologies (FAST), 2014.
    [47]
    D. Xie, N. Ding, Y. C. Hu, and R. Kompella. The only constant is change: Incorporating time-varying network reservations in data centers. SIGCOMM Computer Communication Review, 42(4):199--210, 2012.

    Cited By

    View all
    • (2023)Network SLO-aware container scheduling in KubernetesThe Journal of Supercomputing10.1007/s11227-023-05122-579:10(11478-11494)Online publication date: 28-Feb-2023
    • (2022)PECS: A Pareto-efficient and Envy-free Cloud Resource Scheduler2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894320(147-152)Online publication date: 11-Nov-2022
    • (2022)Workload-aware storage policies for cloud object storageJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.026Online publication date: Feb-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
    October 2018
    546 pages
    ISBN:9781450360111
    DOI:10.1145/3267809
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SoCC '18
    Sponsor:
    SoCC '18: ACM Symposium on Cloud Computing
    October 11 - 13, 2018
    CA, Carlsbad, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Network SLO-aware container scheduling in KubernetesThe Journal of Supercomputing10.1007/s11227-023-05122-579:10(11478-11494)Online publication date: 28-Feb-2023
    • (2022)PECS: A Pareto-efficient and Envy-free Cloud Resource Scheduler2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894320(147-152)Online publication date: 11-Nov-2022
    • (2022)Workload-aware storage policies for cloud object storageJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.026Online publication date: Feb-2022
    • (2022)DepCon: Achieving Network SLO for High Performance CloudsEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_27(339-351)Online publication date: 9-Jun-2022
    • (2021)Stateful DRF: Considering the Past in a Multi-Resource AllocationIEEE Transactions on Computers10.1109/TC.2020.300600770:7(1094-1105)Online publication date: 1-Jul-2021
    • (2020)Mass: Workload-Aware Storage Policy for OpenStack SwiftProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404427(1-11)Online publication date: 17-Aug-2020
    • (2020)A Survey and Classification of Software-Defined Storage SystemsACM Computing Surveys10.1145/338589653:3(1-38)Online publication date: 28-May-2020

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media