Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scaling a Declarative Cluster Manager Architecture with Query Optimization Techniques

Published: 01 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Cluster managers play a crucial role in data centers by distributing workloads among infrastructure resources. Declarative Cluster Management (DCM) is a new cluster management architecture that enables users to express placement policies declaratively using SQL-like queries. This paper presents our experiences in scaling this architecture from moderate-sized enterprise clusters (102 - 103 nodes) to hyperscale clusters (104 nodes) via query optimization techniques. First, we formally specify the syntax and semantics of DCM's declarative language, C-SQL, a SQL variant used to express constraint optimization problems. We showcase how constraints on the desired state of the cluster system can be succinctly represented as C-SQL programs, and how query optimization techniques like incremental view maintenance and predicate pushdown can enhance the execution of C-SQL programs. We evaluate the effectiveness of our optimizations through a case study of building Kubernetes schedulers using C-SQL. Our optimizations demonstrated an almost 3000× speed up in database latency and reduced the size of optimization problems by as much as 1/300 of the original, without affecting the quality of the scheduling solutions.

    References

    [1]
    [n.d.]. DBToaster SQL Reference. https://dbtoaster.github.io/docs_sql.html. Last accessed: June 2023.
    [2]
    [n.d.]. OpenStack. https://www.openstack.org/. Last accessed: June 2023.
    [3]
    [n.d.]. Red Hat OpenShift. https://www.redhat.com/en/technologies/cloud-computing/openshift. Last accessed: June 2023.
    [4]
    2007. H2 Database. https://github.com/h2database/h2database/. Last accessed: June 2023.
    [5]
    2010. Google OR-Tools. https://developers.google.com/optimization/. Last accessed: February 2023.
    [6]
    2010. Google OR-Tools Documentation. https://github.com/google/or-tools/blob/stable/ortools/sat/docs/README.md. Last accessed: April 2023.
    [7]
    2011. JOOQ. https://github.com/jOOQ/jOOQ. Last accessed: June 2023.
    [8]
    2014. Kubernetes (K8s) Github. http://github.com/kubernetes/kubernetes. Last accessed: June 2023.
    [9]
    2015. Differential Dataflow. https://github.com/TimelyDataflow/differential-dataflow. Last accessed: June 2023.
    [10]
    2021. Differential Datalog. github.com/vmware/differential-datalog. Last accessed: June 2023.
    [11]
    2021. PresoDB. https://prestodb.io/. Last accessed: August 2022.
    [12]
    2022. DDlog's SQL frontend and SQL-to-DDlog compiler. https://github.com/vmware/differential-datalog/tree/master/sql. Last accessed: November 2022.
    [13]
    2022. Kubernetes. https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/. Last accessed: June 2023.
    [14]
    Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M Hellerstein, and Russell Sears. 2010. Boom analytics: exploring data-centric, declarative programming for the cloud. In Proceedings of the 5th European conference on Computer systems. ACM, 223--236.
    [15]
    Apache. 2014. Apache Calcite. https://calcite.apache.org/. Last accessed: Feb 2023.
    [16]
    Stefano Ceri and Georg Gottlob. 1985. Translating SQL Into Relational Algebra: Optimization, Semantics, and Equivalence of SQL Queries. IEEE Trans. Softw. Eng. 11, 4 (apr 1985), 324--345.
    [17]
    Surajit Chaudhuri and Kyuseok Shim. 1999. Optimization of queries with user-defined predicates. ACM Transactions on Database Systems (TODS) 24, 2 (1999), 177--228.
    [18]
    Latha S Colby, Timothy Griffin, Leonid Libkin, Inderpal Singh Mumick, and Howard Trickey. 1996. Algorithms for deferred view maintenance. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. 469--480.
    [19]
    Emilie Danna, Subhasree Mandal, and Arjun Singh. 2012. A practical algorithm for balancing the max-min fairness and throughput objectives in traffic engineering. In 2012 Proceedings IEEE INFOCOM. IEEE, 846--854.
    [20]
    Desislava Dimitrova, John Liagouris, Sebastian Wicki, Moritz Hoffmann, Vasiliki Kalavri, and Timothy Roscoe. 2018. DeltaPath: dataflow-based high-performance incremental routing.
    [21]
    Martin Erwig, Markus Schneider, Michalis Vazirgiannis, et al. 1999. Spatio-temporal data types: An approach to modeling and querying moving objects in databases. GeoInformatica 3, 3 (1999), 269--296.
    [22]
    Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: Scheduling of Long Running Applications in Shared Production Clusters. In Proceedings of the Thirteenth EuroSys Conference (Porto, Portugal) (EuroSys '18). Association for Computing Machinery, New York, NY, USA, Article 4, 13 pages.
    [23]
    Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 99--115. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/gog
    [24]
    Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29.
    [25]
    Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-Resource Packing for Cluster Schedulers. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM '14). 455--466.
    [26]
    Robert Grandl, Mosharaf Chowdhury, Aditya Akella, and Ganesh Ananthanarayanan. 2016. Altruistic Scheduling in Multi-Resource Clusters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). 65--80.
    [27]
    Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. 2016. Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI'16). USENIX Association, USA, 81--97.
    [28]
    Timothy Griffin and Leonid Libkin. 1995. Incremental maintenance of views with duplicates. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data. 328--339.
    [29]
    Ajay Gulati and Xiaoyun Zhu. 2012. VMware distributed resource management: design, implementation, and lessons learned. VMware Technical Journal 1, 1 (2012), 45--64.
    [30]
    Haryadi S. Gunawi, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. SQCK: A declarative file system checker. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 131--146.
    [31]
    Ashish Gupta, Inderpal Singh Mumick, and Venkatramanan Siva Subrahmanian. 1993. Maintaining views incrementally. ACM SIGMOD Record 22, 2 (1993), 157--166.
    [32]
    Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, and Thomas Moscibroda. 2020. Protean: VM Allocation Service at Scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 845--861. https://www.usenix.org/conference/osdi20/presentation/hadary
    [33]
    Joseph M Hellerstein and Michael Stonebraker. 1993. Predicate migration: Optimizing queries with expensive predicates. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data. 267--276.
    [34]
    Fabien Hermenier, Xavier Lorca, Jean-Marc Menaud, Gilles Muller, and Julia Lawall. 2009. Entropy: A consolidation manager for clusters. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. ACM, 41--50.
    [35]
    Fabian Hueske, Mathias Peters, Matthias J Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and Kostas Tzoumas. 2012. Opening the Black Boxes in Data Flow Optimization. PVLDB 5, 11 (2012).
    [36]
    Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, and Wolfgang Lehner. 2020. General dynamic Yannakakis: conjunctive queries with theta joins under updates. The VLDB Journal 29, 2--3 (2020), 619--653.
    [37]
    Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair scheduling for distributed computing clusters. In ACM Symposium on Operating systems principles (SOSP). ACM, 261--276.
    [38]
    Paris C Kanellakis, Gabriel M Kuper, and Peter Z Revesz. 1990. Constraint query languages (preliminary report). In Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 299--313.
    [39]
    Paris C Kanellakis, Gabriel M Kuper, and Peter Z Revesz. 1995. Constraint query languages. J. Comput. System Sci. 51, 1 (1995), 26--52.
    [40]
    Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. NoScope: optimizing neural network queries over video at scale. PVLDB 10, 11 (2017), 1586--1597.
    [41]
    Christoph Koch, Yanif Ahmad, Oliver Kennedy, Milos Nikolic, Andres Nötzli, Daniel Lupei, and Amir Shaikhha. 2014. DBToaster: higher-order delta processing for dynamic, frequently fresh views. VLDB J. 23, 2 (2014), 253--278.
    [42]
    Kubernetes. 2018. Add a new predicate: max replicas limit per node. https://github.com/kubernetes/kubernetes/pull/71930. Last accessed: Feb 2023.
    [43]
    Kubernetes. 2018. Add max number of replicas per node/topology key to pod anti-affinity. https://github.com/kubernetes/kubernetes/issues/40358. Last accessed: Feb 2023.
    [44]
    Kubernetes. 2018. Affinity/Anti-Affinity Optimization of Pod Being Scheduled #67788. https://github.com/kubernetes/kubernetes/pull/67788. Last accessed: Feb 2023.
    [45]
    Kubernetes. 2018. Allow Minimum (or Maximum) Pods per failure zone. https://github.com/kubernetes/kubernetes/issues/66533. Last accessed: Jan 2019.
    [46]
    Kubernetes. 2018. Assign Pods to Nodes using Node Affinity. https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/. Last accessed: Feb 2023.
    [47]
    Kubernetes. 2018. Maximum of N per topology value. https://github.com/kubernetes/kubernetes/pull/41718. Last accessed: Feb 2023.
    [48]
    Kubernetes. 2018. MaxPodsPerNode - be able to set hard and soft limits for deployments / replicasets. https://github.com/kubernetes/kubernetes/issues/63560. Last accessed: Feb 2023.
    [49]
    Kubernetes. 2018. Pod priorities and preemption. https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/. Last accessed: June 2023.
    [50]
    Kubernetes. 2018. Taints and Tolerations. https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/. Last accessed: Feb 2023.
    [51]
    Kubernetes mailing list. 2018. Let's remove ServiceAffinity. https://groups.google.com/forum/#!topic/kubernetes-sig-scheduling/ewz4TYJgL0M. Last accessed: Feb 2023.
    [52]
    Gabriel Kuper, Leonid Libkin, and Jan Paredaens. 2013. Constraint databases. Springer Science & Business Media.
    [53]
    Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, et al. 2021. Shard Manager: A Generic Shard Management Framework for Geo-Distributed Applications. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP '21). 553--569.
    [54]
    Alon Y Levy, Inderpal Singh Mumick, and Yehoshua Sagiv. 1994. Query optimization by predicate move-around. In VLDB. 96--107.
    [55]
    Kubernetes Topology Manager Limitations. [n.d.]. https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/#known-limitations. Last accessed: June 2023.
    [56]
    Boon Thau Loo, Tyson Condie, Joseph M. Hellerstein, Petros Maniatis, Timothy Roscoe, and Ion Stoica. 2005. Implementing Declarative Overlays. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP '05). 75--90.
    [57]
    Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating machine learning inference with probabilistic predicates. In Proceedings of the 2018 International Conference on Management of Data. 1493--1508.
    [58]
    Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. In CIDR.
    [59]
    Microsoft. 2017. Azure Public Dataset. https://github.com/Azure/AzurePublicDataset. Last accessed: June 2023.
    [60]
    Hoshi Mistry, Prasan Roy, S Sudarshan, and Krithi Ramamritham. 2001. Materialized view selection and maintenance using multi-query optimization. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data. 307--318.
    [61]
    Inderpal Singh Mumick and Hamid Pirahesh. 1994. Implementation of magic-sets in a relational database system. ACM SIGMOD Record 23, 2 (1994), 103--114.
    [62]
    Ben Pfaff and Bruce Davie. 2013. RFC 7047: The Open vSwitch Database Management Protocol. https://datatracker.ietf.org/doc/html/rfc7047. Last accessed: June 2023.
    [63]
    Anshul Rai, Ranjita Bhagwan, and Saikat Guha. 2012. Generalized Resource Allocation for the Cloud. In Proceedings of the Third ACM Symposium on Cloud Computing (SoCC '12). Article 15, 12 pages.
    [64]
    Philippe Rigaux, Michel Scholl, Luc Segoufin, and Stéphane Grumbach. 2003. Building a constraint-based spatial database system: model, languages, and implementation. Information Systems 28, 6 (2003), 563--595.
    [65]
    Kexin Rong, Mihai Budiu, Athinagoras Skiadopoulos, Lalith Suresh, and Amy Tai. 2022. Scaling a Declarative Cluster Manager Architecture with Query Optimization Techniques (Technical Report). https://github.com/vmware/declarative-cluster-management/blob/vldb23/docs/tr.pdf.
    [66]
    Francesca Rossi, Peter Van Beek, and Toby Walsh. 2006. Handbook of constraint programming. Elsevier.
    [67]
    Leonid Ryzhyk and Mihai Budiu. 2019. Differential Datalog. In Datalog 2.0. Philadelphia, PA. http://budiu.info/work/ddlog.pdf
    [68]
    Praveen Seshadri, Joseph M Hellerstein, Hamid Pirahesh, TY Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J Stuckey, and S Sudarshan. 1996. Cost-based optimization for magic: Algebra and implementation. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. 435--446.
    [69]
    Athinagoras Skiadopoulos, Qian Li, Peter Kraft, Kostis Kaffes, Daniel Hong, Shana Mathew, David Bestor, Michael Cafarella, Vijay Gadepally, Goetz Graefe, Jeremy Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, Lalith Suresh, and Matei Zaharia. 2022. DBOS: A DBMS-Oriented Operating System. PVLDB 15, 1 (2022), 21--30.
    [70]
    Utkarsh Srivastava, Kamesh Munagala, and Jennifer Widom. 2005. Operator placement for in-network stream query processing. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 250--258.
    [71]
    Debnil Sur, Ben Pfaff, Leonid Ryzhyk, and Mihai Budiu. 2022. Full-Stack SDN. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks (Austin, Texas) (HotNets '22). Association for Computing Machinery, New York, NY, USA, 130--137.
    [72]
    Lalith Suresh, João Loff, Faria Kalim, Sangeetha Abdu Jyothi, Nina Narodytska, Leonid Ryzhyk, Sahan Gamage, Brian Oki, Pranshu Jain, and Michael Gasch. 2020. Building Scalable and Flexible Cluster Managers Using Declarative Programming. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 827--844.
    [73]
    Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, et al. 2020. Twine: A Unified Cluster Management System for Shared Infrastructure. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI'20). 787--803.
    [74]
    Muhammad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: The next Generation. In Proceedings of the 15th European Conference on Computer Systems (EuroSys '20). Article 30, 14 pages.
    [75]
    David Toman and Jan Chomicki. 1998. Datalog with integer periodicity constraints. The Journal of Logic Programming 35, 3 (1998), 263--290.
    [76]
    Alexey Tumanov, James Cipar, Gregory R. Ganger, and Michael A. Kozuch. 2012. Alsched: Algebraic Scheduling of Mixed Workloads in Heterogeneous Clouds. In Proceedings of the Third ACM Symposium on Cloud Computing (SoCC '12). Article 25, 7 pages.
    [77]
    Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters. In Proceedings of the European Conference on Computer Systems (EuroSys) (London, United Kingdom) (EuroSys '16). ACM, New York, NY, USA, Article 35, 16 pages.
    [78]
    Jeffrey D. Ullman. 1989. Principles of Database and Knowledge-Base Systems, Volume II. Computer Science Press.
    [79]
    Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys) (Bordeaux, France). 1--17.
    [80]
    Brett Walenz, Sudeepa Roy, and Jun Yang. 2017. Optimizing iceberg queries with complex joins. In Proceedings of the 2017 ACM International Conference on Management of Data. 1243--1258.
    [81]
    Anduo Wang, Xueyuan Mei, Jason Croft, Matthew Caesar, and Brighten Godfrey. 2016. Ravel: A Database-Defined Network. In Proceedings of the Symposium on SDN Research (Santa Clara, CA, USA) (SOSR '16). Association for Computing Machinery, New York, NY, USA, Article 5, 7 pages.
    [82]
    Jingren Zhou, Per-Ake Larson, and Hicham G Elmongui. 2007. Lazy maintenance of materialized views. In Proceedings of the 33rd international conference on Very large data bases. 231--242.
    [83]
    Yue Zhuge, Hector Garcia-Molina, Joachim Hammer, and Jennifer Widom. 1995. View maintenance in a warehousing environment. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data. 316--327.

    Cited By

    View all
    • (2023)Solver-In-The-Loop Cluster Resource Management for Database-as-a-ServiceProceedings of the VLDB Endowment10.14778/3625054.362506216:13(4254-4267)Online publication date: 1-Sep-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 16, Issue 10
    June 2023
    295 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 June 2023
    Published in PVLDB Volume 16, Issue 10

    Check for updates

    Badges

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)107
    • Downloads (Last 6 weeks)10
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Solver-In-The-Loop Cluster Resource Management for Database-as-a-ServiceProceedings of the VLDB Endowment10.14778/3625054.362506216:13(4254-4267)Online publication date: 1-Sep-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media