1 Introduction
Multiple organizations have adopted virtualization technologies to increase the cost efficiency of their IT infrastructure. As a result, physical infrastructure resources can be logically divided and shared among different applications. Traditionally, this was achieved via the use of
virtual machines(VMs).. A VM enables an application to be packaged along with its dependencies and the
operating system (OS) it relies on, into a single unit. However, since the OS is included in the VMs, this results in increased VM sizes as well as slow startup time. To address these issues, an alternative lightweight virtualization technique leveraging Linux containers can be used. Containers allow to package the application and its dependencies into a single deployable unit, whereas the OS kernel can be shared with other containers. As a result, containers can be much smaller in size and can be deployed faster [
14]. Docker [
29] is an open source project that provides an implementation of Linux containers.
To facilitate the automated deployment and management of such containers at scale, organizations rely on the use of container orchestration frameworks. Examples of well-known orchestration frameworks include Kubernetes [
64], Docker Swarm [
30], and Apache Mesos [
6]. Apache Mesos, developed by the University of California at Berkeley, aims at managing compute clusters by abstracting their resources from the machines on which they run. Mesos adopts a two-level scheduling mechanism, where task scheduling is delegated to the underlying client framework (e.g., Hadoop), whereas Mesos itself remains responsible for distributing resource offers across those client frameworks [
90]. In the case of Mesos, a task is the basic scheduling unit, which could either refer to a single command or a container to be executed by a Mesos agent. Docker Swarm Mode allows to group multiple hosts running Docker applications into a single cluster.
1 It is integrated with the Docker Engine
Command Line Interface (CLI), therefore allowing it to be easily managed by existing Docker users. It also supports scaling, load-balancing, and rolling updates [
30]. Kubernetes was first released by Google in 2014 and its subsequent developments were made under the
Cloud-Native Computing Foundation (CNCF) umbrella. It is shipped with a rich feature set, including self-healing, horizontal scaling, automated roll-outs and roll-backs, in addition to having an extensible design. Reference [
90] provides a system classification for job scheduling in the aforementioned frameworks. Classification criteria include:
•
Node selection: Whether the scheduler considers all or a subset of cluster nodes while making scheduling decisions.
•
Preemption: Whether the corresponding scheduler supports preemption of low-priority tasks in favor of higher-priority ones.
•
Rescheduling: Whether the scheduler can deal with rescheduling, especially in cases of preemption and failure.
•
Placement constraints: Such constraints allow to fine-tune the scheduler to satisfy the needs of workload owners and cluster administrators. It is worth noting that this feature is not applicable in the case of Apache Mesos, since it adopts a two-level scheduling architecture where application framework implements the placement logic [
90].
•
Scheduling strategies: This refers to the approach used to split the load among the different cluster nodes.
Table
1 provides an updated view of the classification in Reference [
90]. We omit Apache Mesos, since multiple criteria are not applicable to it due to its different architecture. As it can be seen from Table
1, despite a few similarities that exist between the schedulers in Kubernetes and Docker Swarm Mode, Kubernetes offers more flexibility in the scheduler’s configuration options.
Among the three orchestration frameworks, the focus of this article is on Kubernetes in particular. In fact, Kubernetes keeps witnessing increased adoption rates, as stated in the Cloud Native Survey 2020 [
21]. More specifically, the survey shows that the use of Kubernetes in production environments has increased to 83%, compared to 78% in 2019. The Sysdig 2022 Cloud-Native Security and Usage Report [
103] indicates that 96% of Sysdig’s clients (spanning multiple industries and organizational sizes) use Kubernetes for orchestration. In addition, major cloud providers currently offer managed Kubernetes services, such as Amazon Elastic Kubernetes Service, Google Kubernetes Engine, Azure Kubernetes Service, and Alibaba Cloud Container Service for Kubernetes. Finally, a performance evaluation conducted in Reference [
2] demonstrates that Kubernetes outperforms Docker Swarm and Mesos in complex application scenarios.
This survey particularly focuses on custom scheduling in Kubernetes that is conducted to meet user requirements that cannot be satisfied using the default Kubernetes scheduler. The topic of scheduling in Kubernetes is of paramount importance, since it can have negative consequences if not performed in an optimal manner. Indeed, improper scheduling decisions may contribute to degraded performance for workload owners, which may violate their
Service Level Agreements (SLAs) with Kubernetes service providers. Additionally, non-optimal scheduling can affect the cluster resource utilization, thus resulting in increased costs associated with cluster operations. The importance of this topic has therefore led to an increase in the number of related contributions, both in academic papers and in industry-driven open source projects. In fact, we have been able to identify 65 relevant contributions that were made between 2017 and 2021 and that we included in our survey. Given such a rich set of contributions, it is important to have a study that classifies them by identifying the situations where such custom schedulers serve best and what options Kubernetes offers to implement such customizations. Such a study should also highlight the open issues that may prevent Kubernetes-based custom schedulers from being implemented or evaluated properly. This would be beneficial for multiple stakeholders involved in the Kubernetes community, including Kubernetes Scheduling
Special Interest Group (SIG) [
66], Kubernetes cluster administrators and workload owners, as well as large-scale service providers having Kubernetes offerings.
At the time of writing, we did not find any survey that addresses this specific topic. Related surveys rather focus on container orchestration in general. For example, References [
17,
83,
90,
108] discuss different container orchestration frameworks, including Kubernetes. Such surveys mostly focus on classifying the surveyed frameworks based on their architectures and features. Even though they consider Kubernetes, these works do not dive deep in the possible approaches that could be used to customize its scheduler. Additionally, there exist other surveys that address the broad topic of scheduling, not necessarily based on Kubernetes. Such surveys may either focus on scheduling in the cloud [
10] and its related issues [
98], scheduling in fog computing environments [
45], or big data applications in data center networks [
112]. In contrast, the scope of our survey is specifically targeted towards contributions aiming at the design and implementation of a custom Kubernetes scheduler, in addition to those that present a novel scheduling approach and validate it on top of a Kubernetes cluster. Table
2 positions our survey with regards to the aforementioned related surveys.
The contributions of this survey can be summarized as follows:
•
We provide a study that can be used as a guide containing all the relevant contributions addressing custom scheduling in Kubernetes. This would allow readers looking to dive into this topic to discover the different facets characterizing the subject.
•
We highlight the different objectives that have been addressed in the surveyed custom Kubernetes schedulers and the different approaches that could be taken to achieve those objectives. This would allow an easy identification of the lessons learned from related literature for readers targeting specific objectives.
•
We also highlight custom schedulers that were made open source, either by academics or by major industrial players, therefore facilitating adoption and improvements upon such schedulers.
The remainder of this article is organized as follows: In Section
2, we provide the necessary background on Kubernetes and its scheduler, along with an overview of how the related works in this area have evolved over time. In Section
3, we outline the methodology we followed for classifying the surveyed contributions both in terms of their general characteristics and the specific objectives they address. In Section
4, we detail the different works belonging to each considered objective, while highlighting the main aspects that they have in common as well as their differences. In Section
5, we identify the main aspects that are currently missing from the custom Kubernetes scheduling literature. In Section
6, we conclude our work.
4 Custom Kubernetes Schedulers: Objective-oriented Analysis
In this section, we summarize the different contributions addressing each one of the objectives defined in Section
3.1. If a paper is deemed to address multiple objectives, then we classified it based on the most relevant one. Additionally, we dedicated a subsection (see Section
4.10) to those papers that have conducted specific contributions that could not fit well in the identified objectives. We explain the different approaches taken by the contributions to achieve the listed objectives as well as the corresponding results, if any. Each subsection includes a comparative table of the reviewed works, based on the criteria presented in the classification of Section
3.2.
4.1 Interference and Colocation
The works discussed in this section propose different solutions to prevent the scheduler from co-locating workloads that contend for the same resource type and therefore have a high risk of interfering with each other.
As an example, in Reference [
75], the authors propose that application developers create labels for their applications based on their expected resource usage. Example application labels include high CPU, low CPU, high disk I/O, and low disk I/O. The scheduler is able to take these labels into account by calculating a penalty for each node using the labels of the applications that are already running on it. As shown in their evaluation results based on a real cluster, the proposed scheduler successfully places applications that consume a high amount of a given resource away from each other.
The work in Reference [
11] addresses the problem of interference among co-located ML jobs. More specifically, they consider distributed ML jobs based on the worker-
parameter server (PS) architecture. The authors proposed
Harmony as a generalized scheduling solution based on deep reinforcement learning, in contrast to existing approaches that rely on detailed workload profiles and interference models. Scheduling decisions are made based on the requested number of workers and PSs and their corresponding resource demands, the amount of available resources, as well as the placement matrix of concurrent jobs. A GPU-based testbed has been set up to evaluate the performance of the proposed approach. The obtained results have shown that Harmony was able to reduce job completion times compared to the considered baselines.
Reference [
40] addresses the colocation problem from a network perspective. More specifically, this problem occurs when workloads having a light network footprint suffer from a performance degradation when colocated with those that have heavy traffic generation patterns. To prevent this from happening, the authors propose to tag workloads having a heavy network footprint, so the scheduler uses this tag to place these workloads away from the light ones. This network footprint is learned based on statistics collected from Linux connection tracking utility conntrack [
23]. The authors evaluate the proposed solution for excess flow processing overhead and they find it negligible.
Since coarse metrics such as those collected by Prometheus [
85] or the
Elastic Search, Fluentd and Kibana (EFK) tool suite are not representative of the real cluster state, the work in Reference [
109] relies on the use of low-level metrics retrieved at the level of micro-architecture events. These metrics include the
instructions per cycle (IPC) and the read/write traffic on a socket. Such metrics are used to score nodes, thus allowing to reduce interference affecting co-located workloads that share low-level resources such as low-level caches. This, in turn, results in an improved application performance compared to the default scheduler.
Similarly, in Reference [
67], the authors use low-level telemetry data retrieved from the CPU micro-architectural components (e.g., Level 3 caches) to influence the scheduler’s behavior. In particular, nodes not having enough available memory bandwidth
9 are discarded in the filtering process. Then, the feasible nodes are scored based on their available memory bandwidth, their memory latency, as well as their CPU utilization. The authors’ goal was to achieve predictable function performance in
Function as a Service (FaaS) environments. Indeed, the evaluation results obtained from a four-node testbed show that the proposed custom scheduler obtains acceptable function execution times, when the number of requests per second made to the functions is less than 7.
The work in Reference [
118] specifically deals with DL workloads that share GPU resources. Since this sharing mechanism may result in decreased workload performance, the authors propose to proactively predict the GPU utilization of each workload before running it. The predicted value is used by a modified version of the scheduler that uses First Fit Decreasing bin-packing. Evaluations conducted in a lab cluster show that this proactive approach results in a lower makespan and a better GPU utilization, compared to the default scheduler and a reactive profiling approach.
A similar problem is addressed in Reference [
55]. However, in this case, the authors propose to predict the level of interference between the workload currently being scheduled and the ones that are already running in the cluster nodes. This interference level corresponds to the ratio between the execution time of the application when it is co-executed with another application and when it is executed in isolation. The scheduler then finds the pair with the lowest level of interference and selects the corresponding node. The evaluation results show reductions in the average job completion time as well as the makespan.
The authors in Reference [
110] propose a hybrid shared-state scheduling framework that addresses the limitations of centralized and distributed cluster schedulers. More specifically, while centralized schedulers lead to high latencies especially in cases of interference and preemption, distributed schedulers lack full observability of the cluster state and may therefore lead to scheduling conflicts. To mitigate these limitations, the proposed scheduler inherits functionalities from both types of schedulers, in addition to using a shared-state mechanism where copies of the cluster state are present within each application-level scheduler. Additional features of the proposed scheduler include opportunistic scheduling of low priority pods when there are enough resources and correction of scheduling decisions to mitigate co-location interference. The authors have not provided a detailed implementation of their proposed scheduler, but they envision it to be an extension or a replacement of the default scheduler.
Table
3 provides a summary of the aforementioned works based on the criteria listed in Section
3.2.
10Discussion: To summarize, we highlight the main approaches that have been used to add interference-awareness to the Kubernetes scheduler:
•
Evaluating the interference level of the current workload with the ones that are currently running on the cluster nodes. This has been achieved in different ways, such as:
—
Labeling/tagging the workloads based on their resource usage (high/low). While this is a simple approach that could work well for well-known application types, using predefined labels could fail to capture some specific application attributes. As a result, it may be better to combine this approach with a clustering algorithm that will categorize applications into different clusters based on their attributes. The resulting cluster identifiers could then be used as labels that the scheduler can leverage while making its decisions.
—
Profiling workloads in an offline manner, storing the resulting metrics, and then using those metrics to predict interference levels. Since predictions occur proactively, this could help save scheduling time, especially for time-sensitive workloads. However, a considerable amount of cluster resources could be wasted while collecting the metrics needed to make predictions.
—
Using a deep reinforcement learning (DRL) agent that observes the set of co-located workloads and decides on a placement that will minimize the interference levels based on an appropriate reward mechanism. In this case, it is important to train the DRL agent using available past experience data (e.g., obtained from logs), instead of relying on an online approach, where there is a risk of making bad decisions, especially during the early stages of the learning process.
•
Using low-level metrics that measure current per-node interference levels as inputs for scoring the nodes. This approach can be useful for discarding or providing low scores to nodes having high interference levels.
•
Correcting scheduling decisions that led to high interference levels. As rescheduling is not envisioned to be part of the main functionalities of the Kubernetes scheduler, such a corrective approach should be implemented in a different component, such as the descheduler [
28].
4.2 Lack of Support for Network QoS
Since guaranteeing bandwidth is important especially for workloads that involve transfers of large amounts of data over the network, a number of works have emerged to provide support for this feature in the Kubernetes scheduler.
For instance, in Reference [
116], the authors propose a “network bandwidth management system” called
NBWguard. Since Kubernetes does not identify the network as a resource (similar to CPU and memory resources), the authors propose the addition of such an extended resource. This would allow users to specify their requests and limits in terms of network bandwidth, which will impact their QoS level. NBWguard leverages the capabilities offered by Linux-based network management tools (e.g.,
tc [
104] and
iptables [
49]) to throttle traffic. Evaluations based on a real cluster have shown that NBWguard effectively takes into account the pods’ QoS when allocating bandwidth. In addition, it ensures that pod traffic does not exceed the specified values.
A similar problem is addressed in Reference [
73], however with a specific focus on
Remote Direct Memory Access (RDMA) networking. RDMA refers to the mechanism allowing “network adapters to directly transfer data with no CPU involvement,” which allows to achieve higher throughput and lower latencies. In this context, the authors propose the following features, which are currently not fully supported in Kubernetes. First, they allow setting requests and limits for RDMA bandwidth in the pod specification. They also propose a scheduler extender that considers the amount of currently used bandwidth. Finally, their proposed approach allows multiple RDMA interfaces to be allocated to a single pod. A subsequent work [
42] shows that the proposed approach successfully meets the bandwidth requirements of the applications. This is achieved at the cost of a slightly increased latency. In addition, the bandwidth limits are also enforced by the scheduler.
The authors in Reference [
92] consider a fog computing environment where latency-sensitive and data-intensive smart city applications are prevalent. The requirements of these applications cannot be effectively met by the default scheduler, since it does not take up-to-date network status information into account. As a result, the authors present a
network-aware scheduler (NAS), where the fog computing infrastructure nodes are labeled based on the round trip times between them. In addition, the available bandwidth is checked against the value requested in the pod definition to filter out unsuitable nodes. The experimental evaluation has shown that NAS was able to reduce network latency by 80% compared to the Kubernetes scheduler. However, it results in an increased scheduling time, since it uses the scheduler extender mechanism, where a call to an external process needs to be performed in each scheduling decision.
The different characteristics of the aforementioned contributions are described in Table
4.
Discussion: As it can be noted, a common aspect in the aforementioned approaches is that they add support for specifying requests and limits for the network bandwidth in the pod specification. This can be performed by (i) adding network as an extended resource as done in Reference [
116], (ii) using annotations [
42], or (iii) using labels in the pod specification [
92].
12 To manage lower-level network operations, the aforementioned mechanisms are often combined with the use of existing network management tools (Linux’s
traffic control (tc) [
104] in Reference [
116]) or the implementation of a custom plugin according to the
Container Networking Interface (CNI) (as in References [
42,
73]).
4.3 Topology-awareness
Most of the papers reviewed in this subsection deal with fog/edge computing scenarios, where the cluster nodes are distributed across different locations. As a result, the topology of the nodes should be taken into account to minimize excessive communication costs between them.
In Reference [
53], the authors consider the problem of limited fog node resources, which may be insufficient to host a multi-container pod. As a result, they propose to split such pods into their constituent containers and distribute them across multiple fog nodes, while taking into account the inter-node communication costs. To do so, the authors propose to sort the pod queue by decreasing order of degree of communication between containers. Then, if no feasible node was found, the pod will be split such that some nodes are able to host its individual containers. Finally, a topology-aware node scoring is performed, taking into account the distance between the different fog nodes. The authors have proposed to implement these different steps as scheduler plugins. However, no specific implementation was provided.
The authors in Reference [
79] propose to extend the operation of Kubernetes over a
wide area network (WAN), specifically to focus on application deployment at the edge. The authors modify the Kubernetes scheduler to take into account the
autonomous system (AS) path of the
Border Gateway Protocol (BGP), by prioritizing shorter AS paths. The experimental results show that the proposed approach allows the application to obtain shorter access times, which is needed in an edge environment.
In Reference [
34], authors address the need to ensure low-latency replica placements in a fog computing environment, while aiming to minimize the load imbalance by reducing the number of application replicas with a low resource utilization. To do so, the authors propose
Hona as a scheduler for Kubernetes, taking into account the amount of available resources, the inter-node latencies, as well as the traffic sources and their corresponding traffic volumes. This is achieved by leveraging a “random search heuristic” and a heuristic that uses network latencies using Vivaldi coordinates [
26]. It was shown that these heuristics are able to find acceptable solutions in a short time.
In Reference [
80], the paper considers the problem of increased latencies that occur due to users being served from distant fog nodes. In fact, the scheduling mechanism in Kubernetes does not add new replicas to meet the increasing demands at certain fog node locations. As a result, the authors propose to tag the nodes with their locations and allow users to specify their requested deployment location in the pod specification, possibly according to different weights. Nodes are then scored such that the ones where the application is most requested are given a higher priority. The evaluation results show that this approach allows to cope with increasing client requests in certain locations, therefore resulting in minimized latencies for the affected users.
The authors in Reference [
32] target industrial automation applications aiming at minimizing the application latency. To achieve that, they formulate the application placement problem in a Kubernetes-based fog computing environment as a cost-minimizing optimization problem, defining the application latency as a cost measure. They have proposed an approximate solution, which takes into account locality constraints, the links between fog nodes, the links between application components, and their associated data demands. It was shown that implementing this algorithm based on native kubernetes features (such as priority classes and pod affinities) achieves the best results.
The authors in Reference [
43] address the need to support latency-sensitive edge applications using scheduling mechanisms that are aware of the topology of the edge nodes. To this end, they propose to perform periodic delay measurements between the different edge nodes. These measurements are then used to label nodes. Then, when a pod definition specifies a delay constraint, the proposed scheduler checks the node labels and selects the node that can satisfy this delay.
The authors in Reference [
91] present
ge-kube as an extension to Kubernetes, specifically tailored for geo-distributed deployments. The first contribution of this paper is the use of reinforcement learning to determine the appropriate number of replicas of an application. They also propose to greedily place application instances such that the deployment time and the amount of allocated resources are minimized. To do so, their proposal considers the network delays between nodes in addition to the amount of available resources in those nodes. Experiments conducted using a geo-distributed Redis [
88] cluster have shown that ge-kube results in a 3
\(\times\) increase in the number of operations per second, compared to the default Kubernetes scheduler.
Apart from fog/edge computing, the problem of topology-awareness is also relevant in other scenarios, such as the orchestration of VNFs as well as ML/DL clusters, as detailed next.
For example, the authors in Reference [
93] propose an architecture to emulate virtual network scenarios based on the
Network Function Virtualization (NFV) standard by the
European Telecommunications Standards Institute (ETSI). The proposed architecture,
Megalos, contains a scheduling component that makes VNF scheduling decisions and informs the Kubernetes control plane node about the node that should be selected to host the VNF. Its objective is to reduce unnecessary network traffic exchange among the nodes. This is achieved by organizing the VNFs into an undirected graph and dividing it into balanced partitions. The scheduler can also leverage information extracted from the VNF configuration files to fine-tune its decisions. Testbed evaluations have shown that Megalos can achieve reduced startup times for the different considered network scenarios.
Optimus [
84] is among the early efforts towards proposing a scheduler for Kubernetes-based DL clusters. Its first goal is to improve the cluster’s resource efficiency by leveraging dynamic resource allocation instead of relying on predefined resource requests. This dynamic allocation is determined based on the predicted progress of the training job and the cluster load. In addition, Optimus aims to improve the training time for DL jobs by placing tasks in a way that minimizes the data transfers between parameter servers and workers during training. Evaluation results have shown that Optimus significantly improves the job completion time and the makespan, in addition to achieving a higher CPU utilization.
Authors in Reference [
39] propose
Gatekeeper for AI (GAI) as a scheduler specifically tailored to improve the performance of ML training jobs. To facilitate this task, it organizes the cluster nodes into an in-memory resource tree based on the network conditions between them. It also leverages an aggregated priority vector taking into account multiple characteristics of the ML job (such as its runtime, its type, and the number of preemptions it has already experienced). Experiments, based mostly on a simulator, show that GAI achieves a 28% increase in the scheduling throughput and a 21% increase in the training convergence speed, compared to the default Kubernetes scheduler.
The
HiveD scheduler [
119] developed by Microsoft stems from the observation that the current resource reservation in multi-tenant GPU clusters is based on the number of requested GPUs (i.e., the GPU quota) and not on their topology. Since this may result in a performance issue for multi-GPU jobs, the authors propose a new abstraction called
cell that captures the different levels of affinity for a group of GPUs. This allows each tenant to have virtual clusters based on the cell structure. Extensive simulation results have been provided. In particular, it has been shown that HiveD reduces the queuing delays when the cluster’s load is high.
Table
5 summarizes the most relevant characteristics of the contributions dealing with topology-awareness.
Discussion: The aforementioned works show a set of commonly used approaches to deal with topology awareness, as listed below:
•
Labeling nodes based on delay measurements to other nodes in the cluster [
43,
80]. Such measurements are taken periodically. As a result, special attention has to be paid to the periodicity with which such measurements are made to ensure a good balance between obtaining up-to-date measurement values and reducing the overhead that may result from the measurement process.
•
Labeling nodes based on their locations and allowing users to request a particular location in their pod specification [
80]. This implies the existence of an efficient mechanism for identifying the different fog node locations and the users’ awareness of those locations.
•
The use of heuristics (see References [
32,
34,
91]), since it may not be feasible to find the optimal solution to the placement optimization problem in a reasonable time. Such heuristics may lead to sub-optimal placements, but they do so in an acceptable timeframe, which is beneficial in fog/edge environments.
•
The use of graph/tree structures to model the node connections [
39], the connections between the different application components [
93], or both [
32]. Such a modeling can help identify which application components need to be placed on the same node or on nearby nodes to avoid excessive communication costs.
•
Leveraging the Kubernetes affinity concept can be useful for implementing topology-awareness, either by placing pods with high communication patterns together (i.e., pod affinity as done in Reference [
32]) or by using node affinities to direct a pod towards a specific node [
80]. Since this approach is native to Kubernetes, no additional components need to be developed or modified.
4.4 No Support for Co-scheduling
Co-scheduling (or gang scheduling) refers to the ability to schedule a group of pods at once, as opposed to the default Kubernetes behavior that schedules pods one-by-one.
Among the works targeting this objective, we cite Reference [
35], where the authors focus on serverless workloads that rely on the cloud provider to deal with the server-side operations (e.g., the number of servers and the amount of resources allocated to them). For such workloads, the pod-by-pod scheduling approach used in Kubernetes is not efficient, since increased delays occur due to repeated traversals of the set of nodes in each pod scheduling attempt, while the pods to be scheduled have the same characteristics. Consequently, the authors propose to simultaneously schedule a group of pods that share the same image and resource requirements. Preliminary simulation results demonstrate that this scheduling strategy results in reduced pod start-up times.
As for non-academic contributions, Palantir Technologies have implemented a scheduler extender called
k8s-spark-scheduler [
51] to support gang scheduling for Spark applications hosted on Kubernetes. As introduced in Section
3.1, Spark uses two types of pods on Kubernetes, which are
driver pods and
executor pods. k8s-spark-scheduler first ensures that the cluster nodes have enough resources to host the executor pods, then it proceeds to scheduling the driver pods.
In addition, there is a co-scheduling plugin [
24] that is maintained by the Kubernetes scheduling
Special Interest Group (SIG) and is currently in beta status.
16 This plugin is not part of the default Kubernetes installation. However, it can be built, configured, and activated separately. It defines the
PodGroup concept, which can be used in the pod specification to indicate the group that the pod belongs to. This plugin modifies the sorting behavior by comparing priorities of pod groups. When such priorities are equal, the pod groups are compared based on their creation time.
A comparison of the aforementioned works is provided in Table
6.
Discussion: Supporting co-scheduling needs decision-making to be made at the pod group level instead of individual pods. This may require modifications that go beyond the scheduler itself, such as the definition of a custom resource like the PodGroup in the co-scheduling plugin [
24]. In addition, since this plugin modifies the sorting behavior according to the scheduling framework, it is important to ensure that the use of this plugin has no impact on other types of workloads that do not use it. This is due to the fact that only one queue sort plugin is allowed to be enabled at a given time [
96].
4.5 No Support for Batch Scheduling
This section highlights the works that add batch scheduling capabilities to the Kubernetes scheduler. For example, the authors in Reference [
12] consider a multi-tenant cloud environment, where it is important to ensure fairness among the different tenants. Since kube-batch [
59] only considers the
“Dominant Resource Fairness (DRF)” [
41] as a fairness policy, the authors propose a scheduler named
KubeSphere, where two other policies are added: demand-aware and demand-DRF-aware. While the former prioritizes users with higher resource demands, the latter takes into account both the user’s demands and their dominant resource share to avoid resource starvation. KubeSphere adopts a multi-tenant task queue to ensure fairness. Evaluations using a real Kubernetes cluster show that using Kubesphere allows users to experience less waiting times, on average.
The authors in Reference [
113] propose two schedulers to place heterogeneous tasks in a computer science lab cluster. The first one is a batch scheduler to be used in busy hours. This scheduler relies on a queue sorting mechanism based on the tasks’ resource requirements (CPU, memory, IO) to avoid resource fragmentation. The second proposed scheduler is tailored for GPU-intensive tasks and it is mainly responsible for dynamically adjusting the priorities of such tasks based on their waiting and estimated running times. Evaluations indicate that the batch scheduler is able to improve the resource utilization in the cluster, while the dynamic scheduler reduces the task waiting time compared to the default scheduler.
As for non-academic contributions related to batch scheduling, we can cite
kube-batch [
59] and
Apache Yunikorn [
9]. In fact, kube-batch is a project developed by Kubernetes scheduling SIG with the goal of supporting AI/ML, big data and
High Performance Computing (HPC) applications with batch scheduling capabilities. Such capabilities include support for gang scheduling and different job priority classes. kube-batch is used in other well-known Kubernetes-based tools such as Kubeflow [
62] and Volcano [
111]. Apache Yunikorn is an open source scheduler that supports batch scheduling, not only on top of Kubernetes, but also with YARN [
5]. It can support a mixture of workloads, including stateless batch workloads and stateful services. This is achieved using a rich feature set, including hierarchical resource queues, multi-policy job queuing, in addition to ensuring fairness. Evaluations based on Kubemark [
63] have shown that for clusters having thousands of nodes, Yunikorn is able to improve the scheduling throughput and the fairness between queues compared to the default scheduler.
Discussion: As it can be noted from the aforementioned contributions, a common feature needed to support batch scheduling is the customization of the queue sorting behavior. This can be done by prioritizing users’ workloads based on their resources requirements [
12,
113] or by organizing the queue into a hierarchical structure and using multiple policies for managing it [
9]. As it can be seen in Table
7, the works reviewed in this section do not leverage the scheduling framework. However, if that was the case, then special attention has to be paid to ensure that the modified queuing behavior is suitable for all envisioned workload types through a single queue sort plugin. It is also worth noting that providing support for batch scheduling is a complex task that requires significant modifications to the scheduling behavior (e.g., queuing, fairness, co-scheduling), which may justify the absence of implementation approaches based on the scheduler extender mechanism or the scheduling framework, as shown in Table
7.
4.6 No Support for Data Locality Awareness
Supporting awareness of data locality is important for workloads that require frequent and fast access to nodes storing the data they need. The following contributions propose custom schedulers that support this feature:
Skippy has been presented in Reference [
87] to support scheduling requirements in edge computing environments. These requirements impose the need to consider the proximity between nodes and the bandwidth available between them in addition to awareness of the locality of the data that edge applications need to access. To this end, skippy leverages a graph representing the network bandwidth as well as a “storage index” that matches the specific data items with the nodes that store the actual data. Furthermore, skippy is shipped with multiple edge-friendly priority functions, whose weights are determined automatically to address the requested operational objectives. Skippy’s evaluation is based on a simulator fed with real profiling data. The results have shown that the quality of the placements done by skippy outperforms the one obtained by the default scheduler, at the cost of a reduced scheduling throughput caused by the use of the added priority functions.
As for open source software contributions in this area, we list
Stork [
102] and
StorageOS [
101]. Stork is an open source project that supports storage-aware scheduling. It is well-suited for stateful applications such as databases, queues, and key-value stores. It relies on the use of the scheduler extender mechanism that allows Stork to filter the nodes where the Stork storage driver is not running or is in an error state. Then, it scores nodes based on the performance associated with accessing persistent storage from that node. StorageOS uses the scheduler extender mechanism to add data-awareness to the Kubernetes scheduler. More specifically, it ensures that pods get scheduled on the nodes where their data is. To perform this, nodes are given different scores based on their volume types (master, replica, none, unhealthy). Based on these scores, faster I/O operations can be achieved.
The differences and commonalities between the aforementioned contributions are outlined in Table
8.
Discussion: The previous works show that awareness of data locality can be achieved (i) via filtering, if it is a hard requirement, or (ii) via scoring, if it is a soft requirement. As pointed out in Section
3.1, Kubernetes does offer a set of scheduler plugins that deal with data volumes. For instance, the filtering plugin
VolumeZone checks whether the specified zone requirements can be met. If such plugins are not enough to satisfy the user’s expectations in terms of awareness of data locality, then it is possible to alter the filtering/scoring behavior of the Kubernetes scheduler, while implementing any additional customizations that may be needed to interface with the storage drivers used by the volume provider.
4.7 Lack of Real Load Awareness
The focus of this section is to highlight contributions proposing custom schedulers that take the actual load of the cluster nodes (i.e., their actual usage level) into account while making scheduling decisions.
Example contributions include References [
19,
82], where the authors consider an edge computing environment and emphasize the need to support scheduling based on up-to-date node resource usage. Real-time monitoring information, such as load, temperature, and liveness, is collected from the devices and used to calculate the node scores. The results have shown faster scheduling times compared to the default scheduler, while the node temperatures were similar, which implies that the proposed scheduler does not compromise the health of the nodes.
The authors in Reference [
71] propose a modified
particle swarm optimization (PSO) algorithm to determine the scheduling decisions. The algorithm takes into account the nodes’ utilization rates in terms of CPU and memory, the workloads’ usage characteristics, as well as any affinities that may be defined towards certain nodes. A comparison of the proposed algorithm with the default Kubernetes scheduler revealed a 20% improvement in the nodes’ resource usage.
To avoid wasting I/O resources, the authors in Reference [
70] specifically focus on real I/O load awareness. More in detail, they use Prometheus [
85] to collect metrics related to the I/O load and the CPU usage. Those metrics are used in a scheduler extender mechanism where the post score behavior is customized. In fact, two scoring functions are introduced, which are
BalancedDiskIOPriority (BDI) and
BalancedCpuDiskIOPriority (BCDI). The evaluation results indicate a more balanced disk I/O utilization throughout the cluster, whereas the use of BCDI results in balancing both the disk I/O and the CPU usage.
Besides the aforementioned academic contributions, we also mention the load-aware scheduler plugins that were developed by IBM and referred to as
Trimaran [
107]. These plugins were developed with the aim of increasing the cluster’s resource utilization by making the scheduler aware of the mismatch that may exist between the current resource allocation and the current resource utilization. Trimaran consists of two scoring plugins:
TargetLoadPacking and
LoadVariationRiskBalancing. TargetLoadPacking scores nodes based on their actual resource usage while maintaining a predefined resource usage level across all nodes. LoadVariationRiskBalancing scores nodes based on the mean and standard deviation of their resource usage. By default, Trimaran uses Kubernetes MetricsServer [
77] as a metric provider, but other providers such as Prometheus [
85] or SignalFx [
99] could also be used.
Intel also proposes a
telemetry-aware scheduler (TAS) [
105] to support scheduling based on up-to-date telemetry information. The scheduler leverages the extender mechanism and takes into account features such as the node’s power usage, the amount of free RAM, the CPU temperature. TAS can act at the filtering, prioritization (i.e., scoring), and descheduling levels and can be used to offer resilience and self-healing for NFV scenarios [
16].
Table
9 compares the aforementioned works based on our classification criteria.
Discussion: To achieve real load awareness, (near) realtime resource usage metrics need to be available, via external systems that are not natively shipped with Kubernetes. In fact, some authors develop their own metric collection mechanism. For example, References [
19,
82] employ the lightweight
Constrained Application Protocol (CoAP) to retrieve runtime data from worker nodes at the edge. In contrast, other contributions rely on the functionalities of existing metric servers, such as Prometheus in Reference [
70] or metrics-server in Reference [
107]. It is also worth noting that load awareness can combine multiple resource dimensions such as CPU, memory, and disk I/O, depending on the users’ objectives and the type of bottleneck resources in the cluster.
4.8 GPU Sharing
GAIA [
100] is among the first schedulers proposed for GPU sharing. It organizes the GPU cluster into a tree structure based on the communication costs between the different GPUs in addition to their current allocation status (partially or fully allocated). The scheduler uses this tree to determine the best placement for applications requesting a fraction of a GPU, an entire GPU, or more than one GPU. In particular, fractional sharing is made possible by using the device plugin mechanism in Kubernetes to divide a GPU into a set of virtual GPUs. The experiments performed to evaluate GAIA were conducted on Tencent’s container cloud, and they show a 10% increase in the resource utilization of the GPU cluster in addition to reduced training times compared to kube-scheduler.
The work in Reference [
47] uses the scheduler extender mechanism to discard nodes that do not have a GPU with enough graphic memory as requested by the workload. To this end, they define two extended resources that can be handled similarly to the CPU and memory. These resources are the amount of available graphic memory
RESOURCE-MEM and the number of GPU cards on a node
RESOURCE-COUNT. In addition, they use the NVIDIA Device Plugin [
81] to advertise the GPU cards to the kubelet. This was a conceptual proposal, therefore, the paper has provided neither implementation details nor evaluation results.
In Reference [
117], the authors present
KubeShare as a framework for GPU sharing in Kubernetes. More specifically, they defined a custom kind of resource called
SharePod, allowing a custom shared device called
virtual GPU (vGPU) to be attached to it. In addition, they developed two custom controllers called
KubeShare-Sched and
KubeShare-DevMgr. The former is responsible for mapping the vGPUs to containers based on the current resource status, the container’s requirements, and affinity constraints. The latter is in charge of managing the vGPU pool. Experiments conducted on a cloud-hosted cluster have shown a significant improvement in the GPU utilization while adding little overhead (less than 10%) due to the performed customizations.
A different perspective is taken in Reference [
33], where the authors propose a custom scheduler for applications that can be executed in both CPUs and GPUs. The main idea is that it allows a pod to be assigned to a CPU node instead of waiting for a GPU node to become available. The assignment decisions are determined based on the node status (idle/running), the expected completion time of the application on a specific node, the pod priority and its category, and the node type. Different experiments have been conducted to compare the proposed scheduler to the default Kubernetes scheduler, and it was shown that the application’s running time was greatly improved (up to 64%).
The authors in Reference [
54] consider an edge computing environment and address the need for efficient use of the GPU resources attached to an edge server. To do so, they create an extended resource to represent the GPU in the same way as that of a CPU. This allows the scheduler to filter nodes based on this resource. In addition, they include the GPU performance in the node score, more specifically in the NodeAffinity priority function. The experimental results show that the proposed GPU sharing mechanism allows to increase the number of pods that can be placed onto a GPU-based edge server, without impacting the pod performance. Table
10 summarizes the different contributions supporting GPU sharing in Kubernetes schedulers.
Discussion: Supporting the GPU sharing functionality is not the responsibility of the Kubernetes scheduler on its own. In fact, it requires leveraging different extension mechanisms offered by Kubernetes, including:
•
Using extended resources to represent the GPU in a similar way to the CPU and memory, i.e., allowing requests and allocations of integer amounts of this resource.
•
Using the device plugin mechanism to advertise virtual GPU resources.
•
Using custom resources to define a custom kind of API object supporting GPU sharing.
With such extensions in place, the scheduler can satisfy the user-specified GPU requirements by filtering or scoring the nodes appropriately.
4.9 Environmental Impact (O3)
There exist a few contributions that aim to alter the scheduling behavior such that it results in reduced negative impacts on the environment. These are described next.
The authors of the
KEIDS paper [
52] consider Kubernetes clusters consisting of both edge and cloud nodes, which could have renewable energy sources, e.g., wind or solar. They formulate the scheduling problem as a multi-objective optimization problem, where the objectives are to minimize energy consumption and to reduce interference. To this end, they take into account the carbon footprint of different pod-node mappings, the associated energy consumption, the types of containers used in pods (e.g., CPU- or network-intensive), in addition to constraints related to the job deadline, its required resources, and number of replicas. The simulation results demonstrate that KEIDS effectively reduces the number of active nodes, therefore reducing the overall energy consumption by 14%, in addition to reducing interference and carbon footprint levels by 47% and 31%, respectively.
The proposal in Reference [
89] starts from the observation that there is a tradeoff between the need to reduce a cluster’s energy consumption and the use of high-performance hardware nodes. As a result, they propose a custom scheduler, called
HEATS. It takes as an input a user-defined performance-energy tradeoff for the submitted workload. This value indicates the amount of performance reduction the user is willing to accept to contribute to achieving energy savings. In addition to this tradeoff value, the scheduler takes into account the current usage in the cluster and the predicted energy consumption of the workload as well as its predicted runtime performance. The evaluation results have shown that the proposed HEATS scheduler outperformed kube-scheduler in terms of energy savings.
In Reference [
106], the authors emphasize that to achieve energy efficiency in data centers, it is necessary to not only consider hardware models of the physical infrastructure, but also software models that describe the workload’s behavior. Such software models can be learned from historical and online data and can be used to predict the amount of resources that will be used by a submitted workload and for what duration. The authors have conducted experiments on a real Kubernetes cluster and have obtained reductions of 10%–20% in power consumption compared to the kube-scheduler.
The need to reduce power consumption can be also a part of a scheduling strategy that considers multiple other criteria at the same time, as proposed in Reference [
76]. The considered criteria include the utilization rate of the CPU, memory and disk on the nodes, the nodes’ power consumption, the number of running containers, and the time needed for transmitting the container image. The overall goal is to devise a multi-criteria algorithm that selects the node that effectively balances the considered criteria. The evaluation results indicate that this approach contributes to reducing the power consumption, in addition to reductions in the makespan and the average container waiting times.
A comparison of custom schedulers addressing O3 can be found in Table
11.
Discussion: To summarize, we highlight the approaches that can be used by the Kubernetes scheduler to reduce carbon emissions and to minimize the amounts of energy consumed in a Kubernetes cluster. These approaches target the following aspects:
•
Energy sources [
52]: When the cluster contains nodes with renewable energy sources, the scheduler can favor those, since they lead to less carbon emissions.
•
Hardware type [
89]: When the cluster is composed of hardware nodes of varying energy consumption patterns and various performance levels, the scheduler can favor less powerful nodes, which consume less energy, as long as the user explicitly tolerates the performance loss that may occur.
•
Effective infrastructure and workload modeling [
106]: This will allow the scheduler to anticipate the behavior of the cluster in response to the scheduling decisions and their effect on the total energy consumption in particular. When such an approach is used, it is important to ensure that the used models are able to adapt to unseen workload types, their arrival patterns, as well as the dynamics of the cluster (addition of new nodes or node types, node failure).
•
Including the current node power consumption in the scoring criteria [
76]: In this case, the power consumption criterion should be given an appropriate weight (compared to the other criteria) to highlight its importance compared to other, potentially conflicting objectives. In addition, this approach requires the use of a metric collection system that is able to retrieve up-to-date node power consumption information that can be used at scheduling time.
4.10 Specific Contributions
In this section, we describe a number of works that propose custom Kubernetes schedulers in line with the high-level objectives outlined in Figure
3. However, they target specific problems and solution approaches that do not perfectly fit within the categories provided in O1.1 to O3.1 (see Section
3.1).
A scheduler, named
SpeCon, is presented in Reference [
74] with a specific focus on DL training jobs. It addresses the inefficiency characterizing the resource allocation for such jobs at different levels of training progress. For instance, those that are close to convergence do not need a lot of resources allocated to them. As a result, the jobs are classified into three categories based on their training progress. These categories are
progressing,
watching, and
converged. The idea is to migrate the models whose training progress is slow to release resources for the ones that are growing fast. It was shown that the SpeCon scheduler contributes to reducing the job completion time compared to the default scheduler. A similar approach is proposed in Reference [
38].
In Reference [
44], the authors propose to predict the expected runtime of an application on both CPU and GPU nodes. Predictions are made based on different application features (e.g., number of instructions, number of float operations) and the available resources on the nodes. The main goal is to assign the workload to the node that will result in a faster execution time. In addition, the predicted runtime of an application is used to prevent the scheduler from prioritizing workloads with fast runtime. The implementation and evaluation of the proposed scheduling approach are not included and are instead mentioned as future work.
In Reference [
20], the authors present
Commodore as a dynamic cluster autoscaler. Though not directly a custom scheduler, Commodore emerged from the need to address the failure events emitted by the scheduler when no feasible node is found for a scale-up request. Unlike Cluster Autoscaler [
65] where the size of the node pool must be provided in advance, Commodore adopts a dynamic node pool size in addition to advanced auto-scaling features that leverage collected resource usage scores. The evaluations have shown that Commodore successfully responds to increasing scale-up requests, thus leading to reduced application response times.
Similar to the ImageLocality scoring function (see Section
2.1.2), the authors in Reference [
37] propose a dependency-aware scheduler, where the existence of container dependencies in a given node could be taken into account to reduce the pod startup times in an edge computing environment. They propose an
image-match policy (similar to the ImageLocality) and a
layer-match policy that favors nodes that have some layers of the requested container image. Evaluations based on trace-based simulations and a real cluster have shown start-up time reductions compared to the default scheduler, especially with the use of the layer-match policy.
Authors in Reference [
25] propose a custom scheduler that addresses two levels of fairness issues that may arise in a cloud computing environment. The first level is due to the priority-based scheduling that results in lower-priority requests being preempted while higher priority requests have a QoS higher than their targets. The second level deals with the lack of fairness when preempting requests having the same priority, which drives the need to reduce the QoS variability for such requests. To address these issues, their proposed QoS-driven scheduler modifies the preemption and sorting logic of the default scheduler, such that instances with
Service Level Indicators (SLIs) exceeding their
Service Level Objectives (SLOs) get preempted and the released resources get allocated to the instances having SLIs lower than their SLOs. Simulations and cluster-based experiments have shown that under moderate contention levels, the proposed scheduler improves the QoS for low-priority instances while not degrading the performance of the other instances.
The authors in Reference [
3] model the pod assignment problem using the stable matching theory. Based on this model, pods set their preferences towards the cluster nodes based on their current resource usage, while nodes set their preferences towards pods based on their resource demands. Simulation results show that the response times of the services deployed using the proposed scheduler are reduced compared to Kubernetes scheduler.
The authors in Reference [
114] propose to combine an
ant colony optimization (ACO) approach with a
particle swarm optimization (PSO) approach with the objective of minimizing the costs associated with the cluster resource usage. More specifically, the proposed approach considers the price for using a CPU or a memory unit, the node load, as well as the amount of CPU and memory requested by the pod. To validate the proposed approach, the authors conducted a set of simulations that have shown a reduction in the incurred usage costs and the node load, compared to the default scheduler.
In Reference [
18], the authors propose a distributed scheduler called
agentified scheduler that operates on a single cluster. Each cluster node runs an “agent” that executes the scheduling logic, and after a negotiation step, the pod gets assigned to the winner node. Results obtained from a small-scale testbed show that scheduling a first pod of a replica set takes longer than a centralized scheduler. Scheduling of subsequent replicas takes less time.
kube-safe-scheduler [
60] is a repository managed by IBM, and it contains multiple proposals for customizing the Kubernetes scheduler based on the scheduler extender mechanism. The first scheduler achieves safe overloading of nodes based on up-to-date resource information. The second scheduler, called
Pigeon, solves an optimization problem with a pre-defined objective function to make scheduling decisions. The
congestion scheduler takes node congestion into account, while
KubeRL is inspired by reward-based
Reinforcement Learning (RL) approaches and scores nodes based on their runtime performances for different workload types.
Table
12 lists the aforementioned works along with the different criteria used to compare them.
4.11 Multi-cluster Setups
This subsection presents a set of works where the proposed custom scheduler is not limited to the boundaries of a single cluster, but it is rather targeted towards multi-cluster scenarios.
An example is the
RLSK scheduler proposed in Reference [
46], where reinforcement learning is used to find the most suitable cluster for a given job, with the goal of achieving a balanced resource allocation within and among clusters. When the clusters are too loaded to receive the job, no action is taken and the job remains pending until it is attempted again. The simulation results indicate that RLSK can achieve improvements in load balancing and resource utilization, at the cost of a slight increase in the makespan.
The authors in Reference [
68] propose a scheduler for federated Kubernetes clusters with the goal to facilitate the mobility of services between different clusters and expanding upon the resources of a single cluster. The proposed scheduler builds upon the two-stage scheduling process in Kubernetes by applying filtering and scoring on the set of clusters taking part in the federation.
A low-carbon Kubernetes scheduler is proposed in Reference [
50]. The scheduler ranks
data center (DC) locations based on their carbon intensities and the air temperature. Empirical results have shown that the proposed scheduler correctly identifies the most suitable target DC, therefore reducing carbon emissions.
Admiralty [
1], previously known as multicluster-scheduler, is a “system of Kubernetes controllers that intelligently schedules workloads across clusters.” Instead of relying on low-accuracy aggregate data to filter nodes belonging to target clusters, admiralty delegates this task to the target clusters themselves. After performing the filtering step, these clusters return a set of “virtual” nodes to the scheduler in the source cluster, which in turn scores them and decides the pod placement accordingly. Admiralty adopts the scheduler plugins as an extension mechanism.
Table
13 shows the common as well as the distinguishing aspects in the surveyed multi-cluster schedulers.
Discussion: As it can be noted from the previous works, the proposals made to support scheduling across multiple clusters inherit the concepts of filtering and scoring from the default, single-cluster Kubernetes scheduler. However, in most of these works [
46,
50,
68], filtering and scoring are performed at the cluster level, instead of the node level. In addition, these works envision the filtering and the scoring steps to be made at a global level, where the scheduler has a full view of the individual clusters’ information. In this case, it is important to devise suitable mechanisms for aggregating the current state information of the clusters and synchronizing this information with the scheduler, so it can make decisions based on accurate data. In contrast, in admiralty [
1], node-level filtering and scoring are performed, where the target cluster is responsible for filtering nodes, whereas the source cluster is responsible for scoring the filtered nodes.
6 Conclusion
Since its release in 2014, Kubernetes has gained increased popularity among organizations running containerized workloads and has become a major player in the container orchestration landscape. Despite its rich feature set, its default scheduler was not able to cope with the requirements driven by emerging business needs and use cases. As a result, the number of contributions proposing a custom Kubernetes scheduler has increased recently. In this survey, we identified the relevant contributions that have been made in this area, while highlighting how the main drivers for custom scheduling in Kubernetes have evolved over the years. We also presented a methodology for classifying the reviewed works based on criteria such as their high-level objectives, their target environments and workloads, the specific operation(s) of the scheduling process that they have customized, as well as their implementation and evaluation approaches. In addition, we provided a detailed description of the reviewed custom scheduling contributions based on their specific objectives, while analyzing the main trends that have been observed per objective.
Overall, our survey shows that, depending on the complexity of the envisioned objective, it may not be sufficient to customize the scheduler only. In fact, custom resources may need to be developed to further support the intended scheduling behavior. This may also require more fine-grained and low-level telemetry data to support predictive scheduling capabilities. It is also worth noting that Kubernetes in general, and its scheduler in particular, undergo continual improvements from the community. This means that problems that are currently addressed by custom scheduling approaches, may be available by default in future versions of the scheduler. This also means that it is important to closely follow the recent advancements made to the scheduler to ensure that any customizations made to it are in line with the current best practices.