survey

Open access

Custom Scheduling in Kubernetes: A Survey on Common Problems and Solution Approaches

Authors:

Zeineb Rejiba,

Javad ChamanaraAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 7

Article No.: 151, Pages 1 - 37

https://doi.org/10.1145/3544788

Published: 15 December 2022 Publication History

All formats PDF

Abstract

Since its release in 2014, Kubernetes has become a popular choice for orchestrating containerized workloads at scale. To determine the most appropriate node to host a given workload, Kubernetes makes use of a scheduler that takes into account a set of hard and soft constraints defined by workload owners and cluster administrators. Despite being highly configurable, the default Kubernetes scheduler cannot fully meet the requirements of emerging applications, such as machine/deep learning workloads and edge computing applications. This has led to different proposals of custom Kubernetes schedulers that focus on addressing the requirements of the aforementioned applications. Since the related literature is growing in this area, we aimed, in this survey, to provide a classification of the related literature based on multiple criteria, including scheduling objectives as well as the types of considered workloads and environments. Additionally, we provide an overview of the main approaches that have been adopted to achieve each objective. Finally, we highlight a set of gaps that could be leveraged by academia or the industry to drive further research and development activities in the area of custom scheduling in Kubernetes.

1 Introduction

Multiple organizations have adopted virtualization technologies to increase the cost efficiency of their IT infrastructure. As a result, physical infrastructure resources can be logically divided and shared among different applications. Traditionally, this was achieved via the use of virtual machines(VMs).. A VM enables an application to be packaged along with its dependencies and the operating system (OS) it relies on, into a single unit. However, since the OS is included in the VMs, this results in increased VM sizes as well as slow startup time. To address these issues, an alternative lightweight virtualization technique leveraging Linux containers can be used. Containers allow to package the application and its dependencies into a single deployable unit, whereas the OS kernel can be shared with other containers. As a result, containers can be much smaller in size and can be deployed faster [14]. Docker [29] is an open source project that provides an implementation of Linux containers.

To facilitate the automated deployment and management of such containers at scale, organizations rely on the use of container orchestration frameworks. Examples of well-known orchestration frameworks include Kubernetes [64], Docker Swarm [30], and Apache Mesos [6]. Apache Mesos, developed by the University of California at Berkeley, aims at managing compute clusters by abstracting their resources from the machines on which they run. Mesos adopts a two-level scheduling mechanism, where task scheduling is delegated to the underlying client framework (e.g., Hadoop), whereas Mesos itself remains responsible for distributing resource offers across those client frameworks [90]. In the case of Mesos, a task is the basic scheduling unit, which could either refer to a single command or a container to be executed by a Mesos agent. Docker Swarm Mode allows to group multiple hosts running Docker applications into a single cluster.¹ It is integrated with the Docker Engine Command Line Interface (CLI), therefore allowing it to be easily managed by existing Docker users. It also supports scaling, load-balancing, and rolling updates [30]. Kubernetes was first released by Google in 2014 and its subsequent developments were made under the Cloud-Native Computing Foundation (CNCF) umbrella. It is shipped with a rich feature set, including self-healing, horizontal scaling, automated roll-outs and roll-backs, in addition to having an extensible design. Reference [90] provides a system classification for job scheduling in the aforementioned frameworks. Classification criteria include:

•

Node selection: Whether the scheduler considers all or a subset of cluster nodes while making scheduling decisions.

•

Preemption: Whether the corresponding scheduler supports preemption of low-priority tasks in favor of higher-priority ones.

•

Rescheduling: Whether the scheduler can deal with rescheduling, especially in cases of preemption and failure.

•

Placement constraints: Such constraints allow to fine-tune the scheduler to satisfy the needs of workload owners and cluster administrators. It is worth noting that this feature is not applicable in the case of Apache Mesos, since it adopts a two-level scheduling architecture where application framework implements the placement logic [90].

•

Scheduling strategies: This refers to the approach used to split the load among the different cluster nodes.

Table 1 provides an updated view of the classification in Reference [90]. We omit Apache Mesos, since multiple criteria are not applicable to it due to its different architecture. As it can be seen from Table 1, despite a few similarities that exist between the schedulers in Kubernetes and Docker Swarm Mode, Kubernetes offers more flexibility in the scheduler’s configuration options.

Table 1.

	Kubernetes	Docker Swarm Mode
Node selection	Configurable via percentageOfNodesToScore parameter.	All nodes
Preemption	Yes	No [31]
Rescheduling due to preemption	Evicted low-priority pods go back to the scheduling queue and need to be scheduled again.	N/A
Rescheduling due to failure	Not triggered by scheduler.	Not triggered by scheduler.
Rescheduling due to failure	Triggered by replication controller to reconcile current state with desired state.	Triggered by orchestrator to reconcile current state with desired state.
Placement constraints	Labels	Labels
Placement constraints	Affinity/Anti-affinity constraints	Affinity/Anti-affinity no longer supported in Swarm Mode
Scheduling strategies	Different options. Examples: LeastAllocated, MostAllocated, etc...	Single option: Spread.

Table 1. Comparison of the Kubernetes Scheduler with the Docker Swarm Mode Scheduler

Among the three orchestration frameworks, the focus of this article is on Kubernetes in particular. In fact, Kubernetes keeps witnessing increased adoption rates, as stated in the Cloud Native Survey 2020 [21]. More specifically, the survey shows that the use of Kubernetes in production environments has increased to 83%, compared to 78% in 2019. The Sysdig 2022 Cloud-Native Security and Usage Report [103] indicates that 96% of Sysdig’s clients (spanning multiple industries and organizational sizes) use Kubernetes for orchestration. In addition, major cloud providers currently offer managed Kubernetes services, such as Amazon Elastic Kubernetes Service, Google Kubernetes Engine, Azure Kubernetes Service, and Alibaba Cloud Container Service for Kubernetes. Finally, a performance evaluation conducted in Reference [2] demonstrates that Kubernetes outperforms Docker Swarm and Mesos in complex application scenarios.

This survey particularly focuses on custom scheduling in Kubernetes that is conducted to meet user requirements that cannot be satisfied using the default Kubernetes scheduler. The topic of scheduling in Kubernetes is of paramount importance, since it can have negative consequences if not performed in an optimal manner. Indeed, improper scheduling decisions may contribute to degraded performance for workload owners, which may violate their Service Level Agreements (SLAs) with Kubernetes service providers. Additionally, non-optimal scheduling can affect the cluster resource utilization, thus resulting in increased costs associated with cluster operations. The importance of this topic has therefore led to an increase in the number of related contributions, both in academic papers and in industry-driven open source projects. In fact, we have been able to identify 65 relevant contributions that were made between 2017 and 2021 and that we included in our survey. Given such a rich set of contributions, it is important to have a study that classifies them by identifying the situations where such custom schedulers serve best and what options Kubernetes offers to implement such customizations. Such a study should also highlight the open issues that may prevent Kubernetes-based custom schedulers from being implemented or evaluated properly. This would be beneficial for multiple stakeholders involved in the Kubernetes community, including Kubernetes Scheduling Special Interest Group (SIG) [66], Kubernetes cluster administrators and workload owners, as well as large-scale service providers having Kubernetes offerings.

At the time of writing, we did not find any survey that addresses this specific topic. Related surveys rather focus on container orchestration in general. For example, References [17, 83, 90, 108] discuss different container orchestration frameworks, including Kubernetes. Such surveys mostly focus on classifying the surveyed frameworks based on their architectures and features. Even though they consider Kubernetes, these works do not dive deep in the possible approaches that could be used to customize its scheduler. Additionally, there exist other surveys that address the broad topic of scheduling, not necessarily based on Kubernetes. Such surveys may either focus on scheduling in the cloud [10] and its related issues [98], scheduling in fog computing environments [45], or big data applications in data center networks [112]. In contrast, the scope of our survey is specifically targeted towards contributions aiming at the design and implementation of a custom Kubernetes scheduler, in addition to those that present a novel scheduling approach and validate it on top of a Kubernetes cluster. Table 2 positions our survey with regards to the aforementioned related surveys.

Table 2.

Survey	Scheduling vs. Orchestration	Framework
[45]	Scheduling	None
[10]	Scheduling	None
[98]	Scheduling	None
[112]	Scheduling	Other
[83]	Orchestration	K8s+O
[17]	Orchestration	K8s+O
[108]	Orchestration	K8s+O
[90]	Orchestration	K8s+O
This survey	Scheduling	K8s+O

Table 2. Comparison of Related Surveys - Kubernetes (K8s), Kubernetes+Others (K8s+O), and Others vs. None

The contributions of this survey can be summarized as follows:

•

We provide a study that can be used as a guide containing all the relevant contributions addressing custom scheduling in Kubernetes. This would allow readers looking to dive into this topic to discover the different facets characterizing the subject.

•

We highlight the different objectives that have been addressed in the surveyed custom Kubernetes schedulers and the different approaches that could be taken to achieve those objectives. This would allow an easy identification of the lessons learned from related literature for readers targeting specific objectives.

•

We also highlight custom schedulers that were made open source, either by academics or by major industrial players, therefore facilitating adoption and improvements upon such schedulers.

The remainder of this article is organized as follows: In Section 2, we provide the necessary background on Kubernetes and its scheduler, along with an overview of how the related works in this area have evolved over time. In Section 3, we outline the methodology we followed for classifying the surveyed contributions both in terms of their general characteristics and the specific objectives they address. In Section 4, we detail the different works belonging to each considered objective, while highlighting the main aspects that they have in common as well as their differences. In Section 5, we identify the main aspects that are currently missing from the custom Kubernetes scheduling literature. In Section 6, we conclude our work.

2 Background

In this section, we provide the necessary background regarding the Kubernetes architecture, as well as a description of the scheduler’s behavior and its customization options. Then, we present an overview of the state-of-the-art in the area of custom scheduling in Kubernetes and the different trends characterizing it over the years. The Kubernetes version considered at the time we started writing this survey is 1.20, released in December 2020. It is worth noting that the scheduling features that we detail in the following sections are still supported in subsequent Kubernetes versions.

2.1 Kubernetes and Scheduling

2.1.1 Kubernetes Components.

Kubernetes (a.k.a. K8s) is defined as “an open-source system for automating deployment, scaling, and management of containerized applications” [64]. Such applications are packaged into pods, which constitute the smallest, most basic workload object in Kubernetes, the unit of deployment. A pod could contain only one container, or more if the application containers are tightly coupled.

The pod requirements and metadata are usually declared in a YAML² file. Such requirements could include the container image to be used for creating the pod, the required amount of resources (CPU or memory) that each container in the pod needs as well as port numbers, while metadata may include labels that describe the pod. A workload owner then submits the pod (YAML file) to the Kubernetes cluster, where different components contribute to getting the workload up and running in one of the cluster nodes.

In Kubernetes, two types of nodes exist:

•

nodes that run the so-called control plane components (previously known as master nodes³). There are usually more than one of them to achieve high-availability. These nodes are composed of the following components: API server, scheduler, controller manager, as well as a cluster store based on etcd.

•

worker nodes, which are responsible for running pods containing users’ workloads. They are composed of the kubelet, kube-proxy, as well as the container runtime.

Figure 1 shows the different interactions that take place among these components from pod submission until pod execution. Components in curly braces belong to the control plane, whereas the remainder of the components belong to worker nodes.

Fig. 1.

2.1.2 How Does the Scheduler Work?.

To find the best node to run a pod, the scheduler conducts two operations in sequence: filtering and scoring.⁴ The functions used in the filtering step, previously referred to as predicates, constitute a set of hard constraints that a node must meet to be able to run the pod. In contrast, the scoring step is performed against a set of soft constraints, previously referred to as priorities. In both cases, such constraints could either be provided in the pod definition or they could be general constraints, e.g., associated with nodes.

More accurately, for a given pod, each node of the cluster is evaluated against a set of filters. Failing to satisfy a filter removes the node from the feasible nodes subset for that pod and discards executing the subsequent filters. A feasible subset for a pod is the set of all nodes that fulfill all the hard constraints. Example filtering constraints include NodeResourcesFit, which checks whether a node can satisfy the resource requirements of the pod or not (i.e., pod-specific constraint) and NodeUnschedulable, which discards nodes marked as unschedulable due to upcoming maintenance tasks for instance (i.e., general constraint).

After filtering, the set of feasible nodes are passed to the scoring step. Different criteria could be considered for scoring. For instance, when scoring based on ImageLocality is enabled, nodes that already have the requested container image locally score higher (pod-specific soft constraint). Each node is then scored based on a weighted combination of the different scoring criteria. After this stage, the node receiving the highest score will be selected to host the pod. If there is more than one such node, then ties are broken randomly. Finally, the binding process informs the API server with the selected node for the given pod [61]. This entire process is illustrated in the dashed area (labeled scheduling) in Figure 1.

2.1.3 Configuration Options.

Multiple options exist to influence the scheduler’s default behavior, as explained next:

•

nodeSelector: This option allows to select a node that has a label that matches the key-value pairs specified in the nodeSelector field of the pod configuration.

•

Node affinity: Similar to nodeSelector, node affinity allows a pod to express an affinity towards a certain node. The difference is that node affinity adds more flexibility in the match expressions, besides the exact matching performed in nodeSelector. In addition, it is possible to configure this rule to be either required or preferred.

•

Inter-pod affinity and anti-affinity: Using inter-pod affinity, it is possible to instruct the scheduler to place a pod with label X on a node that already has a pod with label Y. An example use case of this is when the first pod is a web application so placing it on a node that already has a running cache pod would result in lower latency for accessing the cached data. In contrast, with anti-affinity, pods are scheduled away from each other. For instance, when a node already has a web application pod, it is better to place a new web application pod on a different node in case the first one fails.

•

Taints and Tolerations: When a taint is added to a node, only pods with a matching toleration are allowed to be placed onto it.

2.1.4 Customization/extension Options.

If the needs of the workloads or the cluster owner cannot be met using the the default parameters associated with the aforementioned options, then different methods exist to implement a custom scheduling logic to fulfill such needs. These methods are listed below:

•

Source code modification (CO1⁵): The most obvious approach to customize the scheduler’s default behavior is to directly modify its source code and recompile it to use the new scheduling logic.

•

Extender mechanism (CO2) [94]: When this approach is used, the scheduler calls an external process via HTTP/S to execute custom filtering and scoring functions.

•

Custom scheduler (CO3) [22]: In this case, multiple custom schedulers could be configured along with the default one, which implies that workload owners can nominate their preferred scheduler in the pod definition.

•

Scheduling framework (CO4) [96]: This framework has been proposed to turn the scheduler into a pluggable component, where multiple plugins could be enabled/disabled according to the workload’s needs or the requirements of the cluster owner. Such plugins are then compiled into the main scheduler code, thus removing the need to use multiple schedulers. Different steps of the scheduling process can be modified by writing custom plugins that implement the corresponding extension point interfaces. So, unlike the scheduler extender approach, where it is only possible to modify the filtering/scoring behaviors, the scheduler framework allows writing custom logic to sort the pod queue, too.

•

Input to the scheduler from another component (CO5)⁶: In this case, the kube-scheduler itself is not modified. However, its behavior is modified based on inputs from a custom component. This custom component contains the necessary logic to generate labels indicating affinity towards certain nodes.

2.2 The State-of-the-art

Since Kubernetes was first released in 2014, the early contributions presenting custom scheduling solutions were proposed in 2017 [75, 102]. For example, Reference [75] addressed the problem of interference among co-located workloads competing for the same resource type and its impact on the workload’s execution performance. A few more contributions started to emerge one year later. In particular, the proposal in Reference [116] allowed users to set bandwidth requests and limits, similar to CPU and memory, thus supporting multiple network Quality of Service (QoS) classes. In addition, support for Machine Learning (ML) and Deep Learning (DL) workloads was proposed in References [39] and [100]. The latter specifically focuses on the problem of GPU sharing, which is not available by default in Kubernetes, thus resulting in GPU under-utilization, especially for ML inference workloads.

Starting from 2019, the number of contributions proposing custom Kubernetes schedulers increased significantly, as shown in Figure 2(a).⁷ In fact, 78% of the surveyed contributions have been published between 2019 and 2020. It is worth mentioning that schedulers proposed in the context of fog/edge computing have the greatest share of these contributions [18, 19, 32, 34, 37, 43, 53, 79, 80, 82, 92], as shown in Figure 2(b). This is due to the fact that fog/edge computing has recently gained increasing interest, due to the benefits brought by bringing computation closer to the end-users. In addition, these paradigms rely on the use of lightweight virtualization techniques, usually based on Docker containers, which makes Kubernetes a suitable candidate for managing them. However, since the default Kubernetes scheduler is not aware of the topology of the fog/edge nodes that are part of the cluster, it results in non-optimal pod placements that violate the latency requirements needed in such setups. Section 4.3 sheds more light on the contributions that belong to this research direction.

Fig. 2.

The second most occurring trend in this time period is the need to support deep learning workloads, especially with the advances made in the field of Natural Language Processing (NLP), where models such as GPT-2 [86] and GPT-3 [15] have shown their capacity to produce human-like text. The support for such workloads in custom Kubernetes schedulers comes in two ways: either by reducing the interference that arises among those workloads so faster training speeds could be achieved [11, 55, 56, 57], or by enabling GPU sharing [33, 44, 47, 117, 118, 119], which is useful in inference scenarios where it is not necessary to dedicate a whole GPU to the workload. Contributions in these areas are further detailed in Sections 4.1 and 4.8.

In addition to these two trends, support for network-awareness is also present in the surveyed works cited in References [40, 42, 73, 93].

Finally, another trend worth mentioning is the need to reduce energy consumption in data centers, with multiple vendors such as Amazon AWS [4], Google [48], Microsoft [78] committing efforts in this direction. As a result, different works such as References [50, 52, 89, 106] have been proposed to address the energy efficiency and carbon footprint concerns.

3 Classification Methodology

In this section, we describe how we used the scheduling objective as the main criterion to classify the surveyed works. In addition, we identify a set of other scheduler characteristics, which allows us to gain more insights into which ones are more common and which ones could be considered for further research.

3.1 Detailed Objective Classification

There are three main objectives that are addressed by custom Kubernetes schedulers: (1) the optimization of the workloads’ performance, (2) the optimization of the utilization of resources in the cluster, and (3) the reduction of negative environmental impacts. Each one of these categories can be further classified into the specific problem that the custom scheduler attempts to solve. Figure 3 illustrates those finer-level objectives, which are described next.

Fig. 3.

•

O1.1: Interference and colocation: Colocating multiple workloads that compete for the same (set of) resource(s) on the same node leads to interference problems. This in turn results in a degradation of the workload performance. To avoid this, proposals have been made to minimize interference using knowledge about the workload characteristics in terms of resource usage intensity.

•

O1.2: Lack of support for network QoS: By default, Kubernetes allows defining resource requirements for CPU, memory, and local ephemeral storage. These requirements are then taken into account by the scheduler in the filtering phase. However, no such requirements are available for the network bandwidth. As a result, it is not possible to guarantee a certain level of bandwidth to the workloads that request it.

•

O1.3: Topology-awareness: The lack of the scheduler’s awareness of the inter-node topology creates an issue for workloads that require frequent communication with each other. More specifically, increased latencies will be observed if such workloads are placed on distant nodes, which should be avoided.

•

O1.4: No support for co-scheduling: Since Kubernetes schedules only one pod at a given time, support for applications that require a group of pods to be scheduled together is not provided. Apache Spark [8] is an example of such an application. It is a distributed stream processing framework, where one node runs a component called a driver program, whereas a set of worker nodes run the actual data processing tasks. For these kinds of applications to work properly, it is necessary to have the total amount of resources needed by the whole set of components to be able to place and run them. Otherwise, the application cannot be started due to one component waiting for the availability of the resources it needs. Different co-scheduling works have been proposed to overcome this limitation.

•

O1.5: No support for batch scheduling: Since the default scheduler in Kubernetes lacks features needed for batch scheduling, such as multi-priority queuing, fairness, and advanced preemption, multiple schedulers have been proposed to address these limitations.

•

O1.6: No support for data locality awareness: To achieve low-latency data access, it is beneficial to place workloads as close as possible to the data they need. However, when it comes to data volumes, the default Kubernetes scheduler only offers basic features. As a result, more advanced proposals have been made to make it aware of the data locality. This is especially important in privacy-aware edge computing settings where the data cannot be moved across boundaries.

•

O2.1: Lack of real load awareness: The scheduler takes into account the resource requests, i.e., the minimum amount of resources that have to be provisioned for the pod to function properly, and the resource limits, i.e., the maximum amount of resources that the pod is allowed to use. This is applicable in the case of CPU, memory, and local ephemeral storage resources. When the scheduler tries to determine the availability of resources for a given pod, it evaluates the difference between the node’s capacity and the amount of resources allocated to already running pods, and it compares the result to the amount requested by the pod. However, it is possible that some running pods are not using the total amount of resources allocated to them. Those unused resources can be leveraged to fit other pending pods. For this reason, multiple proposals have emerged to make the scheduler aware of the real load of a given node.

•

O2.2: GPU Sharing: In Kubernetes, it is possible to set an integer number of GPUs to be provided for a given pod. However, it is not possible to set a fractional share of the GPU to a pod, thus preventing multiple pods from sharing a single GPU [97]. This results in inefficient usage of costly GPU resources, especially when dealing with applications that do not require an entire GPU. This limitation has led to the proposal of custom schedulers that address this issue by taking into account the fractional amount of GPU that users have set in their pod specification. It is worth noting that to support this feature, other customizations have to be made, going beyond the scheduler’s scope.

•

O3.1: Environmental impact: Since the operation of data centers requires the consumption of excessive amounts of energy, there is an urgent need to optimize the data centers’ energy efficiency and as a result contribute to reducing their carbon emissions. Since Kubernetes clusters are in general run on such data centers, multiple scheduler proposals have been made with the goal of making the workload placement more energy-efficient.

3.2 Additional Scheduler Characteristics

As depicted in Figure 4, we compiled a set of characteristics that may differ across the surveyed schedulers. In the rest of this section, we explain such characteristics in more detail.

Fig. 4.

•

Objective: This indicates the different objectives of the studied schedulers, as explained in Section 3.1.

•

Target environment: The aforementioned objectives differ with respect to the nature of the considered environment, e.g., cloud and edge/fog. In the case of cloud, we further specified whether the considered cloud data center is a pure ML/DL cluster. As for the edge/fog environments, they are characterized with geo-distribution and impose low latency constraints.

•

Target workloads: Different workloads have different performance requirements and different resource usage patterns. In particular, there are custom schedulers that target AI/ML workloads, which may affect both training and inference phases. In addition, there are schedulers that are specifically tailored towards short-lived serverless workloads. Other categories we identified include big data analytics, stateful apps (e.g., databases), virtual network functions (VNFs), network-intensive as well as delay-sensitive applications. There are also cases where the considered custom scheduler targets a heterogeneous mix of workloads. If the specific targeted workloads are not mentioned explicitly, then we may infer those from the workloads used to obtain evaluation results.

•

Affected scheduler operation: This refers to the specific scheduling operation that is customized, e.g., sorting the pod queue, filtering, or scoring the nodes. For example, Reference [73] customizes the filtering behavior by discarding nodes that do not fulfill the pod’s bandwidth requirements. It may also be the case that a custom scheduler proposes to customize more than one operation at a time.

•

Implementation approach: This refers to the specific extension option that has been used to customize the default scheduler (see Section 2.1.4).⁸

•

Evaluation approach: Based on the surveyed literature, we distinguished between custom schedulers that have been evaluated using a real Kubernetes cluster and the ones that have been evaluated based on simulations. There are even papers that provide both results, where the simulation results illustrate a larger experiment scale. Additionally, there are a few papers where no evaluation results have been provided.

4 Custom Kubernetes Schedulers: Objective-oriented Analysis

In this section, we summarize the different contributions addressing each one of the objectives defined in Section 3.1. If a paper is deemed to address multiple objectives, then we classified it based on the most relevant one. Additionally, we dedicated a subsection (see Section 4.10) to those papers that have conducted specific contributions that could not fit well in the identified objectives. We explain the different approaches taken by the contributions to achieve the listed objectives as well as the corresponding results, if any. Each subsection includes a comparative table of the reviewed works, based on the criteria presented in the classification of Section 3.2.

4.1 Interference and Colocation

The works discussed in this section propose different solutions to prevent the scheduler from co-locating workloads that contend for the same resource type and therefore have a high risk of interfering with each other.

As an example, in Reference [75], the authors propose that application developers create labels for their applications based on their expected resource usage. Example application labels include high CPU, low CPU, high disk I/O, and low disk I/O. The scheduler is able to take these labels into account by calculating a penalty for each node using the labels of the applications that are already running on it. As shown in their evaluation results based on a real cluster, the proposed scheduler successfully places applications that consume a high amount of a given resource away from each other.

The work in Reference [11] addresses the problem of interference among co-located ML jobs. More specifically, they consider distributed ML jobs based on the worker-parameter server (PS) architecture. The authors proposed Harmony as a generalized scheduling solution based on deep reinforcement learning, in contrast to existing approaches that rely on detailed workload profiles and interference models. Scheduling decisions are made based on the requested number of workers and PSs and their corresponding resource demands, the amount of available resources, as well as the placement matrix of concurrent jobs. A GPU-based testbed has been set up to evaluate the performance of the proposed approach. The obtained results have shown that Harmony was able to reduce job completion times compared to the considered baselines.

Reference [40] addresses the colocation problem from a network perspective. More specifically, this problem occurs when workloads having a light network footprint suffer from a performance degradation when colocated with those that have heavy traffic generation patterns. To prevent this from happening, the authors propose to tag workloads having a heavy network footprint, so the scheduler uses this tag to place these workloads away from the light ones. This network footprint is learned based on statistics collected from Linux connection tracking utility conntrack [23]. The authors evaluate the proposed solution for excess flow processing overhead and they find it negligible.

Since coarse metrics such as those collected by Prometheus [85] or the Elastic Search, Fluentd and Kibana (EFK) tool suite are not representative of the real cluster state, the work in Reference [109] relies on the use of low-level metrics retrieved at the level of micro-architecture events. These metrics include the instructions per cycle (IPC) and the read/write traffic on a socket. Such metrics are used to score nodes, thus allowing to reduce interference affecting co-located workloads that share low-level resources such as low-level caches. This, in turn, results in an improved application performance compared to the default scheduler.

Similarly, in Reference [67], the authors use low-level telemetry data retrieved from the CPU micro-architectural components (e.g., Level 3 caches) to influence the scheduler’s behavior. In particular, nodes not having enough available memory bandwidth⁹ are discarded in the filtering process. Then, the feasible nodes are scored based on their available memory bandwidth, their memory latency, as well as their CPU utilization. The authors’ goal was to achieve predictable function performance in Function as a Service (FaaS) environments. Indeed, the evaluation results obtained from a four-node testbed show that the proposed custom scheduler obtains acceptable function execution times, when the number of requests per second made to the functions is less than 7.

The work in Reference [118] specifically deals with DL workloads that share GPU resources. Since this sharing mechanism may result in decreased workload performance, the authors propose to proactively predict the GPU utilization of each workload before running it. The predicted value is used by a modified version of the scheduler that uses First Fit Decreasing bin-packing. Evaluations conducted in a lab cluster show that this proactive approach results in a lower makespan and a better GPU utilization, compared to the default scheduler and a reactive profiling approach.

A similar problem is addressed in Reference [55]. However, in this case, the authors propose to predict the level of interference between the workload currently being scheduled and the ones that are already running in the cluster nodes. This interference level corresponds to the ratio between the execution time of the application when it is co-executed with another application and when it is executed in isolation. The scheduler then finds the pair with the lowest level of interference and selects the corresponding node. The evaluation results show reductions in the average job completion time as well as the makespan.

The authors in Reference [110] propose a hybrid shared-state scheduling framework that addresses the limitations of centralized and distributed cluster schedulers. More specifically, while centralized schedulers lead to high latencies especially in cases of interference and preemption, distributed schedulers lack full observability of the cluster state and may therefore lead to scheduling conflicts. To mitigate these limitations, the proposed scheduler inherits functionalities from both types of schedulers, in addition to using a shared-state mechanism where copies of the cluster state are present within each application-level scheduler. Additional features of the proposed scheduler include opportunistic scheduling of low priority pods when there are enough resources and correction of scheduling decisions to mitigate co-location interference. The authors have not provided a detailed implementation of their proposed scheduler, but they envision it to be an extension or a replacement of the default scheduler.

Table 3 provides a summary of the aforementioned works based on the criteria listed in Section 3.2.¹⁰

Table 3.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[75]	Cloud	Heterogeneous	App labels (high/low CPU, high/low I/O disk)	SC	CO3	No	C	7
[11]	ML clusters	ML training	Resource demands of workers	SC	CO3	No	C	6
			Resource demands of parameter servers
			Number of workers requested
			Number of parameter servers requested
			Amounts of available resources
			Placement matrix of concurrent jobs
[40]	Cloud	Heterogeneous	Traffic volume generated by container	F	CO5	No	C	5
[40]	Cloud	Heterogeneous	Traffic volume generated by container	SC	CO5	No	C	5
[109]	Cloud	Heterogeneous	Low level (socket and core) metrics	SC	CO3	No	C	4
[67]	Cloud	Serverless	Available memory bandwidth	F	CO2	No	C	3
			Node memory latency	F
			Node CPU utilization	SC
[118]	DL clusters	DL training	Predicted GPU usage of the workload	-	-	No	C	12
[55], [56], [57]	ML cluster	ML training and inference	App profiling data	-	CO2	No	C	3
[110]	Cloud	Heterogeneous	Resource utilization	F	CO3/CO4 ^*	No	N	-
			Job priorities	F
			Predicted task runtime	SC

Table 3. Interference and Colocation - Summary

Columns: Affected Scheduling Phase (ASP): Sorting (S), Filtering (F), Scoring (SC). Implementation Approach (IA). Code Available (CA). Evaluation approach: Cluster (C), Simulation (S), None (N). Cluster Size (CS). *Envisioned.

Discussion: To summarize, we highlight the main approaches that have been used to add interference-awareness to the Kubernetes scheduler:

•

Evaluating the interference level of the current workload with the ones that are currently running on the cluster nodes. This has been achieved in different ways, such as:

—

Labeling/tagging the workloads based on their resource usage (high/low). While this is a simple approach that could work well for well-known application types, using predefined labels could fail to capture some specific application attributes. As a result, it may be better to combine this approach with a clustering algorithm that will categorize applications into different clusters based on their attributes. The resulting cluster identifiers could then be used as labels that the scheduler can leverage while making its decisions.

—

Profiling workloads in an offline manner, storing the resulting metrics, and then using those metrics to predict interference levels. Since predictions occur proactively, this could help save scheduling time, especially for time-sensitive workloads. However, a considerable amount of cluster resources could be wasted while collecting the metrics needed to make predictions.

—

Using a deep reinforcement learning (DRL) agent that observes the set of co-located workloads and decides on a placement that will minimize the interference levels based on an appropriate reward mechanism. In this case, it is important to train the DRL agent using available past experience data (e.g., obtained from logs), instead of relying on an online approach, where there is a risk of making bad decisions, especially during the early stages of the learning process.

•

Using low-level metrics that measure current per-node interference levels as inputs for scoring the nodes. This approach can be useful for discarding or providing low scores to nodes having high interference levels.

•

Correcting scheduling decisions that led to high interference levels. As rescheduling is not envisioned to be part of the main functionalities of the Kubernetes scheduler, such a corrective approach should be implemented in a different component, such as the descheduler [28].

4.2 Lack of Support for Network QoS

Since guaranteeing bandwidth is important especially for workloads that involve transfers of large amounts of data over the network, a number of works have emerged to provide support for this feature in the Kubernetes scheduler.

For instance, in Reference [116], the authors propose a “network bandwidth management system” called NBWguard. Since Kubernetes does not identify the network as a resource (similar to CPU and memory resources), the authors propose the addition of such an extended resource. This would allow users to specify their requests and limits in terms of network bandwidth, which will impact their QoS level. NBWguard leverages the capabilities offered by Linux-based network management tools (e.g., tc [104] and iptables [49]) to throttle traffic. Evaluations based on a real cluster have shown that NBWguard effectively takes into account the pods’ QoS when allocating bandwidth. In addition, it ensures that pod traffic does not exceed the specified values.

A similar problem is addressed in Reference [73], however with a specific focus on Remote Direct Memory Access (RDMA) networking. RDMA refers to the mechanism allowing “network adapters to directly transfer data with no CPU involvement,” which allows to achieve higher throughput and lower latencies. In this context, the authors propose the following features, which are currently not fully supported in Kubernetes. First, they allow setting requests and limits for RDMA bandwidth in the pod specification. They also propose a scheduler extender that considers the amount of currently used bandwidth. Finally, their proposed approach allows multiple RDMA interfaces to be allocated to a single pod. A subsequent work [42] shows that the proposed approach successfully meets the bandwidth requirements of the applications. This is achieved at the cost of a slightly increased latency. In addition, the bandwidth limits are also enforced by the scheduler.

The authors in Reference [92] consider a fog computing environment where latency-sensitive and data-intensive smart city applications are prevalent. The requirements of these applications cannot be effectively met by the default scheduler, since it does not take up-to-date network status information into account. As a result, the authors present a network-aware scheduler (NAS), where the fog computing infrastructure nodes are labeled based on the round trip times between them. In addition, the available bandwidth is checked against the value requested in the pod definition to filter out unsuitable nodes. The experimental evaluation has shown that NAS was able to reduce network latency by 80% compared to the Kubernetes scheduler. However, it results in an increased scheduling time, since it uses the scheduler extender mechanism, where a call to an external process needs to be performed in each scheduling decision.

The different characteristics of the aforementioned contributions are described in Table 4.

Table 4.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[116]	Cloud	Network-intensive	Bandwidth requests and limits	-	CO5	No	C	2
[73] [42]	Cloud	-	Currently used bandwidth Number virtual functions (VFs) requested by a pod Minimum Pod bandwidth requirements	F	CO2	No	C	2
[92]	Fog	Delay-sensitive	Round-trip time to target location Bandwidth	F	CO2	Yes¹¹	C	14

Table 4. Lack of Support for Network QoS - Summary

Discussion: As it can be noted, a common aspect in the aforementioned approaches is that they add support for specifying requests and limits for the network bandwidth in the pod specification. This can be performed by (i) adding network as an extended resource as done in Reference [116], (ii) using annotations [42], or (iii) using labels in the pod specification [92].¹²

To manage lower-level network operations, the aforementioned mechanisms are often combined with the use of existing network management tools (Linux’s traffic control (tc) [104] in Reference [116]) or the implementation of a custom plugin according to the Container Networking Interface (CNI) (as in References [42, 73]).

4.3 Topology-awareness

Most of the papers reviewed in this subsection deal with fog/edge computing scenarios, where the cluster nodes are distributed across different locations. As a result, the topology of the nodes should be taken into account to minimize excessive communication costs between them.

In Reference [53], the authors consider the problem of limited fog node resources, which may be insufficient to host a multi-container pod. As a result, they propose to split such pods into their constituent containers and distribute them across multiple fog nodes, while taking into account the inter-node communication costs. To do so, the authors propose to sort the pod queue by decreasing order of degree of communication between containers. Then, if no feasible node was found, the pod will be split such that some nodes are able to host its individual containers. Finally, a topology-aware node scoring is performed, taking into account the distance between the different fog nodes. The authors have proposed to implement these different steps as scheduler plugins. However, no specific implementation was provided.

The authors in Reference [79] propose to extend the operation of Kubernetes over a wide area network (WAN), specifically to focus on application deployment at the edge. The authors modify the Kubernetes scheduler to take into account the autonomous system (AS) path of the Border Gateway Protocol (BGP), by prioritizing shorter AS paths. The experimental results show that the proposed approach allows the application to obtain shorter access times, which is needed in an edge environment.

In Reference [34], authors address the need to ensure low-latency replica placements in a fog computing environment, while aiming to minimize the load imbalance by reducing the number of application replicas with a low resource utilization. To do so, the authors propose Hona as a scheduler for Kubernetes, taking into account the amount of available resources, the inter-node latencies, as well as the traffic sources and their corresponding traffic volumes. This is achieved by leveraging a “random search heuristic” and a heuristic that uses network latencies using Vivaldi coordinates [26]. It was shown that these heuristics are able to find acceptable solutions in a short time.

In Reference [80], the paper considers the problem of increased latencies that occur due to users being served from distant fog nodes. In fact, the scheduling mechanism in Kubernetes does not add new replicas to meet the increasing demands at certain fog node locations. As a result, the authors propose to tag the nodes with their locations and allow users to specify their requested deployment location in the pod specification, possibly according to different weights. Nodes are then scored such that the ones where the application is most requested are given a higher priority. The evaluation results show that this approach allows to cope with increasing client requests in certain locations, therefore resulting in minimized latencies for the affected users.

The authors in Reference [32] target industrial automation applications aiming at minimizing the application latency. To achieve that, they formulate the application placement problem in a Kubernetes-based fog computing environment as a cost-minimizing optimization problem, defining the application latency as a cost measure. They have proposed an approximate solution, which takes into account locality constraints, the links between fog nodes, the links between application components, and their associated data demands. It was shown that implementing this algorithm based on native kubernetes features (such as priority classes and pod affinities) achieves the best results.

The authors in Reference [43] address the need to support latency-sensitive edge applications using scheduling mechanisms that are aware of the topology of the edge nodes. To this end, they propose to perform periodic delay measurements between the different edge nodes. These measurements are then used to label nodes. Then, when a pod definition specifies a delay constraint, the proposed scheduler checks the node labels and selects the node that can satisfy this delay.

The authors in Reference [91] present ge-kube as an extension to Kubernetes, specifically tailored for geo-distributed deployments. The first contribution of this paper is the use of reinforcement learning to determine the appropriate number of replicas of an application. They also propose to greedily place application instances such that the deployment time and the amount of allocated resources are minimized. To do so, their proposal considers the network delays between nodes in addition to the amount of available resources in those nodes. Experiments conducted using a geo-distributed Redis [88] cluster have shown that ge-kube results in a 3 \(\times\) increase in the number of operations per second, compared to the default Kubernetes scheduler.

Apart from fog/edge computing, the problem of topology-awareness is also relevant in other scenarios, such as the orchestration of VNFs as well as ML/DL clusters, as detailed next.

For example, the authors in Reference [93] propose an architecture to emulate virtual network scenarios based on the Network Function Virtualization (NFV) standard by the European Telecommunications Standards Institute (ETSI). The proposed architecture, Megalos, contains a scheduling component that makes VNF scheduling decisions and informs the Kubernetes control plane node about the node that should be selected to host the VNF. Its objective is to reduce unnecessary network traffic exchange among the nodes. This is achieved by organizing the VNFs into an undirected graph and dividing it into balanced partitions. The scheduler can also leverage information extracted from the VNF configuration files to fine-tune its decisions. Testbed evaluations have shown that Megalos can achieve reduced startup times for the different considered network scenarios.

Optimus [84] is among the early efforts towards proposing a scheduler for Kubernetes-based DL clusters. Its first goal is to improve the cluster’s resource efficiency by leveraging dynamic resource allocation instead of relying on predefined resource requests. This dynamic allocation is determined based on the predicted progress of the training job and the cluster load. In addition, Optimus aims to improve the training time for DL jobs by placing tasks in a way that minimizes the data transfers between parameter servers and workers during training. Evaluation results have shown that Optimus significantly improves the job completion time and the makespan, in addition to achieving a higher CPU utilization.

Authors in Reference [39] propose Gatekeeper for AI (GAI) as a scheduler specifically tailored to improve the performance of ML training jobs. To facilitate this task, it organizes the cluster nodes into an in-memory resource tree based on the network conditions between them. It also leverages an aggregated priority vector taking into account multiple characteristics of the ML job (such as its runtime, its type, and the number of preemptions it has already experienced). Experiments, based mostly on a simulator, show that GAI achieves a 28% increase in the scheduling throughput and a 21% increase in the training convergence speed, compared to the default Kubernetes scheduler.

The HiveD scheduler [119] developed by Microsoft stems from the observation that the current resource reservation in multi-tenant GPU clusters is based on the number of requested GPUs (i.e., the GPU quota) and not on their topology. Since this may result in a performance issue for multi-GPU jobs, the authors propose a new abstraction called cell that captures the different levels of affinity for a group of GPUs. This allows each tenant to have virtual clusters based on the cell structure. Extensive simulation results have been provided. In particular, it has been shown that HiveD reduces the queuing delays when the cluster’s load is high.

Table 5 summarizes the most relevant characteristics of the contributions dealing with topology-awareness.

Table 5.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[53]	Fog	Heterogeneous	Communication costs between nodes	S	CO4 ^*	No	C	4
				F
				SC
[79]	Edge	Latency-sensitive	AS Path	SC	-	No	C	3
[34]	Fog	Latency-sensitive	System resource availability	S	CO5	No	C	22 (C)
			Latencies between nodes				C	22 (C)
			Traffic sources and corresponding traffic volume				S	500 (S)
[80]	Fog	Latency-sensitive	Location Incoming network traffic	SC	CO5	No	C	3-5
[32]	Fog	Latency-sensitive	Memory CPU	S	CO5	No	S	11
			Locality constraints	S
			Links between fog nodes	SC
			Links between application components
			Their associated data demands
[43]	Edge	Latency-sensitive	App delay constraints	F	CO3	Yes¹³	C	4
[91]	Edge	Latency-sensitive	Network delays among geo-distributed nodes	SC	CO3	No	C	12
[91]	Edge	Latency-sensitive	Available computing resources in nodes	SC	CO3	No	C	12
[93]	Cloud	VNFs	Network topology	SC	CO5	Yes¹⁴	C	50
[84]	DL clusters	DL training	Current resource availability	S	CO3	No	C	13 (C)
[84]	DL clusters	DL training	Job Resource demand	SC	CO3	No	S	13 (C)
[39]	Cloud	ML training	Distribution of ML Jobs	SC	CO3	No	C	10 (C)
			ML Job Runtime
			Type of ML Jobs (production/research)
			Type of Dominant Resource				S	20,000 (S)
			Number of Preemptions				S	20,000 (S)
[119]	Cloud	DL training	GPU topology	-	CO2	Yes¹⁵	C	24 (C)
[119]	Cloud	DL training	GPU topology	-	CO2	Yes¹⁵	S	24 (C)

Table 5. Topology-awareness - Summary

Discussion: The aforementioned works show a set of commonly used approaches to deal with topology awareness, as listed below:

•

Labeling nodes based on delay measurements to other nodes in the cluster [43, 80]. Such measurements are taken periodically. As a result, special attention has to be paid to the periodicity with which such measurements are made to ensure a good balance between obtaining up-to-date measurement values and reducing the overhead that may result from the measurement process.

•

Labeling nodes based on their locations and allowing users to request a particular location in their pod specification [80]. This implies the existence of an efficient mechanism for identifying the different fog node locations and the users’ awareness of those locations.

•

The use of heuristics (see References [32, 34, 91]), since it may not be feasible to find the optimal solution to the placement optimization problem in a reasonable time. Such heuristics may lead to sub-optimal placements, but they do so in an acceptable timeframe, which is beneficial in fog/edge environments.

•

The use of graph/tree structures to model the node connections [39], the connections between the different application components [93], or both [32]. Such a modeling can help identify which application components need to be placed on the same node or on nearby nodes to avoid excessive communication costs.

•

Leveraging the Kubernetes affinity concept can be useful for implementing topology-awareness, either by placing pods with high communication patterns together (i.e., pod affinity as done in Reference [32]) or by using node affinities to direct a pod towards a specific node [80]. Since this approach is native to Kubernetes, no additional components need to be developed or modified.

4.4 No Support for Co-scheduling

Co-scheduling (or gang scheduling) refers to the ability to schedule a group of pods at once, as opposed to the default Kubernetes behavior that schedules pods one-by-one.

Among the works targeting this objective, we cite Reference [35], where the authors focus on serverless workloads that rely on the cloud provider to deal with the server-side operations (e.g., the number of servers and the amount of resources allocated to them). For such workloads, the pod-by-pod scheduling approach used in Kubernetes is not efficient, since increased delays occur due to repeated traversals of the set of nodes in each pod scheduling attempt, while the pods to be scheduled have the same characteristics. Consequently, the authors propose to simultaneously schedule a group of pods that share the same image and resource requirements. Preliminary simulation results demonstrate that this scheduling strategy results in reduced pod start-up times.

As for non-academic contributions, Palantir Technologies have implemented a scheduler extender called k8s-spark-scheduler [51] to support gang scheduling for Spark applications hosted on Kubernetes. As introduced in Section 3.1, Spark uses two types of pods on Kubernetes, which are driver pods and executor pods. k8s-spark-scheduler first ensures that the cluster nodes have enough resources to host the executor pods, then it proceeds to scheduling the driver pods.

In addition, there is a co-scheduling plugin [24] that is maintained by the Kubernetes scheduling Special Interest Group (SIG) and is currently in beta status.¹⁶ This plugin is not part of the default Kubernetes installation. However, it can be built, configured, and activated separately. It defines the PodGroup concept, which can be used in the pod specification to indicate the group that the pod belongs to. This plugin modifies the sorting behavior by comparing priorities of pod groups. When such priorities are equal, the pod groups are compared based on their creation time.

A comparison of the aforementioned works is provided in Table 6.

Table 6.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[35]	Cloud	Serverless	CPU	SC	-	No	S	1000
			Memory
			Presence of image in node
[51]	-	Analytics	Availability of resources for Spark executors	-	CO2	Yes [51]	-	-
[51]	-	Analytics	Order of creation of Spark drivers	-	CO2	Yes [51]	-	-
[24]	-	Analytics	Minimal number of members needed to run the pod group	S	CO4	Yes [24]	-	-
[24]	-	Analytics	Amount of available resources for the pod group	S	CO4	Yes [24]	-	-

Table 6. No Support for Co-scheduling - Summary

Discussion: Supporting co-scheduling needs decision-making to be made at the pod group level instead of individual pods. This may require modifications that go beyond the scheduler itself, such as the definition of a custom resource like the PodGroup in the co-scheduling plugin [24]. In addition, since this plugin modifies the sorting behavior according to the scheduling framework, it is important to ensure that the use of this plugin has no impact on other types of workloads that do not use it. This is due to the fact that only one queue sort plugin is allowed to be enabled at a given time [96].

4.5 No Support for Batch Scheduling

This section highlights the works that add batch scheduling capabilities to the Kubernetes scheduler. For example, the authors in Reference [12] consider a multi-tenant cloud environment, where it is important to ensure fairness among the different tenants. Since kube-batch [59] only considers the “Dominant Resource Fairness (DRF)” [41] as a fairness policy, the authors propose a scheduler named KubeSphere, where two other policies are added: demand-aware and demand-DRF-aware. While the former prioritizes users with higher resource demands, the latter takes into account both the user’s demands and their dominant resource share to avoid resource starvation. KubeSphere adopts a multi-tenant task queue to ensure fairness. Evaluations using a real Kubernetes cluster show that using Kubesphere allows users to experience less waiting times, on average.

The authors in Reference [113] propose two schedulers to place heterogeneous tasks in a computer science lab cluster. The first one is a batch scheduler to be used in busy hours. This scheduler relies on a queue sorting mechanism based on the tasks’ resource requirements (CPU, memory, IO) to avoid resource fragmentation. The second proposed scheduler is tailored for GPU-intensive tasks and it is mainly responsible for dynamically adjusting the priorities of such tasks based on their waiting and estimated running times. Evaluations indicate that the batch scheduler is able to improve the resource utilization in the cluster, while the dynamic scheduler reduces the task waiting time compared to the default scheduler.

As for non-academic contributions related to batch scheduling, we can cite kube-batch [59] and Apache Yunikorn [9]. In fact, kube-batch is a project developed by Kubernetes scheduling SIG with the goal of supporting AI/ML, big data and High Performance Computing (HPC) applications with batch scheduling capabilities. Such capabilities include support for gang scheduling and different job priority classes. kube-batch is used in other well-known Kubernetes-based tools such as Kubeflow [62] and Volcano [111]. Apache Yunikorn is an open source scheduler that supports batch scheduling, not only on top of Kubernetes, but also with YARN [5]. It can support a mixture of workloads, including stateless batch workloads and stateful services. This is achieved using a rich feature set, including hierarchical resource queues, multi-policy job queuing, in addition to ensuring fairness. Evaluations based on Kubemark [63] have shown that for clusters having thousands of nodes, Yunikorn is able to improve the scheduling throughput and the fairness between queues compared to the default scheduler.

Discussion: As it can be noted from the aforementioned contributions, a common feature needed to support batch scheduling is the customization of the queue sorting behavior. This can be done by prioritizing users’ workloads based on their resources requirements [12, 113] or by organizing the queue into a hierarchical structure and using multiple policies for managing it [9]. As it can be seen in Table 7, the works reviewed in this section do not leverage the scheduling framework. However, if that was the case, then special attention has to be paid to ensure that the modified queuing behavior is suitable for all envisioned workload types through a single queue sort plugin. It is also worth noting that providing support for batch scheduling is a complex task that requires significant modifications to the scheduling behavior (e.g., queuing, fairness, co-scheduling), which may justify the absence of implementation approaches based on the scheduler extender mechanism or the scheduling framework, as shown in Table 7.

Table 7.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[12]	Cloud	-	Dominant resource share	S	CO3	No	C	4
			Resource demand of the task
			Average waiting time
[113]	Cloud	Heterogeneous	Estimated running time	S	CO1	No	S	776-945
			Waiting time
			Requested GPU resources
			I/O					40-60
			Node resource occupation					40-60
[59]	-	AI/ML	Pod group	S	CO3	Yes [59]	-	-
		Analytics	Job priority	F
		Analytics	Job priority	SC
[9]	Cloud	Heterogeneous	Contextual information about users, apps, and queues	S	CO3	Yes [9]	S	2000/4000
[9]	Cloud	Heterogeneous	Current usage of resource queues	S	CO3	Yes [9]	S	2000/4000

Table 7. No Support for Batch Scheduling - Summary

4.6 No Support for Data Locality Awareness

Supporting awareness of data locality is important for workloads that require frequent and fast access to nodes storing the data they need. The following contributions propose custom schedulers that support this feature:

Skippy has been presented in Reference [87] to support scheduling requirements in edge computing environments. These requirements impose the need to consider the proximity between nodes and the bandwidth available between them in addition to awareness of the locality of the data that edge applications need to access. To this end, skippy leverages a graph representing the network bandwidth as well as a “storage index” that matches the specific data items with the nodes that store the actual data. Furthermore, skippy is shipped with multiple edge-friendly priority functions, whose weights are determined automatically to address the requested operational objectives. Skippy’s evaluation is based on a simulator fed with real profiling data. The results have shown that the quality of the placements done by skippy outperforms the one obtained by the default scheduler, at the cost of a reduced scheduling throughput caused by the use of the added priority functions.

As for open source software contributions in this area, we list Stork [102] and StorageOS [101]. Stork is an open source project that supports storage-aware scheduling. It is well-suited for stateful applications such as databases, queues, and key-value stores. It relies on the use of the scheduler extender mechanism that allows Stork to filter the nodes where the Stork storage driver is not running or is in an error state. Then, it scores nodes based on the performance associated with accessing persistent storage from that node. StorageOS uses the scheduler extender mechanism to add data-awareness to the Kubernetes scheduler. More specifically, it ensures that pods get scheduled on the nodes where their data is. To perform this, nodes are given different scores based on their volume types (master, replica, none, unhealthy). Based on these scores, faster I/O operations can be achieved.

The differences and commonalities between the aforementioned contributions are outlined in Table 8.

Table 8.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[87]	Edge	Serverless	Proximity to data storage nodes	SC	CO3	Yes¹⁷	S ¹⁸	8
			Proximity to the container registry
			Availability of hardware accelerator
			Locality (edge/cloud)
[101]	Cloud	Stateful applications	The node is running	F	CO2	Yes [101]	-	-
			StorageOS (F)
			The node is healthy (F)
			The node is not
			StorageOS Cordoned (F)
			The node is not in a
			StorageOS Drained state (F)
			The node is not a
			StorageOS compute-only node (F)	SC
			Node with master volume (SC)
			Node with replica volume (SC)
			Node with no master or replica volume (SC)
			Node with unhealthy volume or unsynced replica (SC)
[102]	Cloud	Stateful applications	Storage driver is not running or is in error state. (F)	F	CO2	Yes [102]	-	-
[102]	Cloud	Stateful applications	Node performance for persistent storage access (SC)	SC	CO2	Yes [102]	-	-

Table 8. No Support for Data Locality Awareness - Summary

Discussion: The previous works show that awareness of data locality can be achieved (i) via filtering, if it is a hard requirement, or (ii) via scoring, if it is a soft requirement. As pointed out in Section 3.1, Kubernetes does offer a set of scheduler plugins that deal with data volumes. For instance, the filtering plugin VolumeZone checks whether the specified zone requirements can be met. If such plugins are not enough to satisfy the user’s expectations in terms of awareness of data locality, then it is possible to alter the filtering/scoring behavior of the Kubernetes scheduler, while implementing any additional customizations that may be needed to interface with the storage drivers used by the volume provider.

4.7 Lack of Real Load Awareness

The focus of this section is to highlight contributions proposing custom schedulers that take the actual load of the cluster nodes (i.e., their actual usage level) into account while making scheduling decisions.

Example contributions include References [19, 82], where the authors consider an edge computing environment and emphasize the need to support scheduling based on up-to-date node resource usage. Real-time monitoring information, such as load, temperature, and liveness, is collected from the devices and used to calculate the node scores. The results have shown faster scheduling times compared to the default scheduler, while the node temperatures were similar, which implies that the proposed scheduler does not compromise the health of the nodes.

The authors in Reference [71] propose a modified particle swarm optimization (PSO) algorithm to determine the scheduling decisions. The algorithm takes into account the nodes’ utilization rates in terms of CPU and memory, the workloads’ usage characteristics, as well as any affinities that may be defined towards certain nodes. A comparison of the proposed algorithm with the default Kubernetes scheduler revealed a 20% improvement in the nodes’ resource usage.

To avoid wasting I/O resources, the authors in Reference [70] specifically focus on real I/O load awareness. More in detail, they use Prometheus [85] to collect metrics related to the I/O load and the CPU usage. Those metrics are used in a scheduler extender mechanism where the post score behavior is customized. In fact, two scoring functions are introduced, which are BalancedDiskIOPriority (BDI) and BalancedCpuDiskIOPriority (BCDI). The evaluation results indicate a more balanced disk I/O utilization throughout the cluster, whereas the use of BCDI results in balancing both the disk I/O and the CPU usage.

Besides the aforementioned academic contributions, we also mention the load-aware scheduler plugins that were developed by IBM and referred to as Trimaran [107]. These plugins were developed with the aim of increasing the cluster’s resource utilization by making the scheduler aware of the mismatch that may exist between the current resource allocation and the current resource utilization. Trimaran consists of two scoring plugins: TargetLoadPacking and LoadVariationRiskBalancing. TargetLoadPacking scores nodes based on their actual resource usage while maintaining a predefined resource usage level across all nodes. LoadVariationRiskBalancing scores nodes based on the mean and standard deviation of their resource usage. By default, Trimaran uses Kubernetes MetricsServer [77] as a metric provider, but other providers such as Prometheus [85] or SignalFx [99] could also be used.

Intel also proposes a telemetry-aware scheduler (TAS) [105] to support scheduling based on up-to-date telemetry information. The scheduler leverages the extender mechanism and takes into account features such as the node’s power usage, the amount of free RAM, the CPU temperature. TAS can act at the filtering, prioritization (i.e., scoring), and descheduling levels and can be used to offer resilience and self-healing for NFV scenarios [16].

Table 9 compares the aforementioned works based on our classification criteria.

Table 9.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[82] [19]	Edge	-	CPU temperature	S	CO3	No	C	4
			CPU usage
			Total amount of free/used memory
			Total amount of swap memory	SC
			Latency
			Jitter
			Packet loss
[71]	Cloud	Analytics	CPU utilization rate	SC	CO1	No	C	20
			Memory utilization rate
			Usage characteristics (high memory/IO/CPU/GPU) affinity
[70]	Cloud	Heterogeneous	Disk I/O load of node	SC	CO2	No	S	3
[70]	Cloud	Heterogeneous	CPU usage of node	SC	CO2	No	S	3
[107]	Cloud	Heterogeneous	Actual node CPU utilization Mean/standard deviation of node CPU utilization	SC	CO4	Yes [107]	N	-
[105]	-	NFV	Node power usage	F	CO2	Yes [105]	-	-
			Free RAM
			Temperature
			Cache hit ratio	SC
			VPU utilization
			Health metric

Table 9. Lack of Real Load Awareness - Summary

Discussion: To achieve real load awareness, (near) realtime resource usage metrics need to be available, via external systems that are not natively shipped with Kubernetes. In fact, some authors develop their own metric collection mechanism. For example, References [19, 82] employ the lightweight Constrained Application Protocol (CoAP) to retrieve runtime data from worker nodes at the edge. In contrast, other contributions rely on the functionalities of existing metric servers, such as Prometheus in Reference [70] or metrics-server in Reference [107]. It is also worth noting that load awareness can combine multiple resource dimensions such as CPU, memory, and disk I/O, depending on the users’ objectives and the type of bottleneck resources in the cluster.

4.8 GPU Sharing

GAIA [100] is among the first schedulers proposed for GPU sharing. It organizes the GPU cluster into a tree structure based on the communication costs between the different GPUs in addition to their current allocation status (partially or fully allocated). The scheduler uses this tree to determine the best placement for applications requesting a fraction of a GPU, an entire GPU, or more than one GPU. In particular, fractional sharing is made possible by using the device plugin mechanism in Kubernetes to divide a GPU into a set of virtual GPUs. The experiments performed to evaluate GAIA were conducted on Tencent’s container cloud, and they show a 10% increase in the resource utilization of the GPU cluster in addition to reduced training times compared to kube-scheduler.

The work in Reference [47] uses the scheduler extender mechanism to discard nodes that do not have a GPU with enough graphic memory as requested by the workload. To this end, they define two extended resources that can be handled similarly to the CPU and memory. These resources are the amount of available graphic memory RESOURCE-MEM and the number of GPU cards on a node RESOURCE-COUNT. In addition, they use the NVIDIA Device Plugin [81] to advertise the GPU cards to the kubelet. This was a conceptual proposal, therefore, the paper has provided neither implementation details nor evaluation results.

In Reference [117], the authors present KubeShare as a framework for GPU sharing in Kubernetes. More specifically, they defined a custom kind of resource called SharePod, allowing a custom shared device called virtual GPU (vGPU) to be attached to it. In addition, they developed two custom controllers called KubeShare-Sched and KubeShare-DevMgr. The former is responsible for mapping the vGPUs to containers based on the current resource status, the container’s requirements, and affinity constraints. The latter is in charge of managing the vGPU pool. Experiments conducted on a cloud-hosted cluster have shown a significant improvement in the GPU utilization while adding little overhead (less than 10%) due to the performed customizations.

A different perspective is taken in Reference [33], where the authors propose a custom scheduler for applications that can be executed in both CPUs and GPUs. The main idea is that it allows a pod to be assigned to a CPU node instead of waiting for a GPU node to become available. The assignment decisions are determined based on the node status (idle/running), the expected completion time of the application on a specific node, the pod priority and its category, and the node type. Different experiments have been conducted to compare the proposed scheduler to the default Kubernetes scheduler, and it was shown that the application’s running time was greatly improved (up to 64%).

The authors in Reference [54] consider an edge computing environment and address the need for efficient use of the GPU resources attached to an edge server. To do so, they create an extended resource to represent the GPU in the same way as that of a CPU. This allows the scheduler to filter nodes based on this resource. In addition, they include the GPU performance in the node score, more specifically in the NodeAffinity priority function. The experimental results show that the proposed GPU sharing mechanism allows to increase the number of pods that can be placed onto a GPU-based edge server, without impacting the pod performance. Table 10 summarizes the different contributions supporting GPU sharing in Kubernetes schedulers.

Table 10.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[100]	Cloud	DL training	GPU topology	SC	-	No	C	2
[47]	Cloud	ML training	Memory requirement of workload (in MiB)	F	CO2	No	N	-
[117]	Cloud	DL training and inference	Current resource status	F	CO3	Yes¹⁹	C	8
			Container Resource requirements	F
			Locality constraints: exclusion, affinity, anti-affinity (based on container labels)	SC
[33]	Cloud	Heterogeneous	Whether nodes are idle/running pods	SC	CO3	No	C	5
			Expected completion time of pod on nodes
			Priorities of pods running on the node
			Types of nodes (CPU/GPU)
			Pod category (processing of an input or not)
[54]	Edge	DL inference	GPU resource used for filtering	F	CO1	No	C	1
[54]	Edge	DL inference	GPU performance included in the priority function	SC	CO1	No	C	1

Table 10. GPU Sharing - Summary

Discussion: Supporting the GPU sharing functionality is not the responsibility of the Kubernetes scheduler on its own. In fact, it requires leveraging different extension mechanisms offered by Kubernetes, including:

•

Using extended resources to represent the GPU in a similar way to the CPU and memory, i.e., allowing requests and allocations of integer amounts of this resource.

•

Using the device plugin mechanism to advertise virtual GPU resources.

•

Using custom resources to define a custom kind of API object supporting GPU sharing.

With such extensions in place, the scheduler can satisfy the user-specified GPU requirements by filtering or scoring the nodes appropriately.

4.9 Environmental Impact (O3)

There exist a few contributions that aim to alter the scheduling behavior such that it results in reduced negative impacts on the environment. These are described next.

The authors of the KEIDS paper [52] consider Kubernetes clusters consisting of both edge and cloud nodes, which could have renewable energy sources, e.g., wind or solar. They formulate the scheduling problem as a multi-objective optimization problem, where the objectives are to minimize energy consumption and to reduce interference. To this end, they take into account the carbon footprint of different pod-node mappings, the associated energy consumption, the types of containers used in pods (e.g., CPU- or network-intensive), in addition to constraints related to the job deadline, its required resources, and number of replicas. The simulation results demonstrate that KEIDS effectively reduces the number of active nodes, therefore reducing the overall energy consumption by 14%, in addition to reducing interference and carbon footprint levels by 47% and 31%, respectively.

The proposal in Reference [89] starts from the observation that there is a tradeoff between the need to reduce a cluster’s energy consumption and the use of high-performance hardware nodes. As a result, they propose a custom scheduler, called HEATS. It takes as an input a user-defined performance-energy tradeoff for the submitted workload. This value indicates the amount of performance reduction the user is willing to accept to contribute to achieving energy savings. In addition to this tradeoff value, the scheduler takes into account the current usage in the cluster and the predicted energy consumption of the workload as well as its predicted runtime performance. The evaluation results have shown that the proposed HEATS scheduler outperformed kube-scheduler in terms of energy savings.

In Reference [106], the authors emphasize that to achieve energy efficiency in data centers, it is necessary to not only consider hardware models of the physical infrastructure, but also software models that describe the workload’s behavior. Such software models can be learned from historical and online data and can be used to predict the amount of resources that will be used by a submitted workload and for what duration. The authors have conducted experiments on a real Kubernetes cluster and have obtained reductions of 10%–20% in power consumption compared to the kube-scheduler.

The need to reduce power consumption can be also a part of a scheduling strategy that considers multiple other criteria at the same time, as proposed in Reference [76]. The considered criteria include the utilization rate of the CPU, memory and disk on the nodes, the nodes’ power consumption, the number of running containers, and the time needed for transmitting the container image. The overall goal is to devise a multi-criteria algorithm that selects the node that effectively balances the considered criteria. The evaluation results indicate that this approach contributes to reducing the power consumption, in addition to reductions in the makespan and the average container waiting times.

A comparison of custom schedulers addressing O3 can be found in Table 11.

Table 11.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[52]	Edge	-	Carbon footprint	-	N	No	S	408
			Interference
			Energy consumption
[89]	Cloud	Heterogeneous	User-defined performance-energy tradeoff for the submitted workload	SC	CO3	Yes²⁰	C	8
			Current resource usage in the cluster
			Predicted energy consumption of the submitted workload on the different machine types
			Predicted runtime performance of the submitted workload on the different machine types
[106]	Cloud	-	Accurate hardware model	-	CO3	No	C	56/215
			Server temperature
			Metrics that model workload behavior
[76]	Cloud	Heterogeneous	Node CPU utilization rate	SC	CO1	No	C	4
			Memory utilization rate in each node
			Disk utilization rate in each node
			Node power consumption
			Number of running containers
			Container image transmission time

Table 11. Environmental Impact - Summary

Discussion: To summarize, we highlight the approaches that can be used by the Kubernetes scheduler to reduce carbon emissions and to minimize the amounts of energy consumed in a Kubernetes cluster. These approaches target the following aspects:

•

Energy sources [52]: When the cluster contains nodes with renewable energy sources, the scheduler can favor those, since they lead to less carbon emissions.

•

Hardware type [89]: When the cluster is composed of hardware nodes of varying energy consumption patterns and various performance levels, the scheduler can favor less powerful nodes, which consume less energy, as long as the user explicitly tolerates the performance loss that may occur.

•

Effective infrastructure and workload modeling [106]: This will allow the scheduler to anticipate the behavior of the cluster in response to the scheduling decisions and their effect on the total energy consumption in particular. When such an approach is used, it is important to ensure that the used models are able to adapt to unseen workload types, their arrival patterns, as well as the dynamics of the cluster (addition of new nodes or node types, node failure).

•

Including the current node power consumption in the scoring criteria [76]: In this case, the power consumption criterion should be given an appropriate weight (compared to the other criteria) to highlight its importance compared to other, potentially conflicting objectives. In addition, this approach requires the use of a metric collection system that is able to retrieve up-to-date node power consumption information that can be used at scheduling time.

4.10 Specific Contributions

In this section, we describe a number of works that propose custom Kubernetes schedulers in line with the high-level objectives outlined in Figure 3. However, they target specific problems and solution approaches that do not perfectly fit within the categories provided in O1.1 to O3.1 (see Section 3.1).

A scheduler, named SpeCon, is presented in Reference [74] with a specific focus on DL training jobs. It addresses the inefficiency characterizing the resource allocation for such jobs at different levels of training progress. For instance, those that are close to convergence do not need a lot of resources allocated to them. As a result, the jobs are classified into three categories based on their training progress. These categories are progressing, watching, and converged. The idea is to migrate the models whose training progress is slow to release resources for the ones that are growing fast. It was shown that the SpeCon scheduler contributes to reducing the job completion time compared to the default scheduler. A similar approach is proposed in Reference [38].

In Reference [44], the authors propose to predict the expected runtime of an application on both CPU and GPU nodes. Predictions are made based on different application features (e.g., number of instructions, number of float operations) and the available resources on the nodes. The main goal is to assign the workload to the node that will result in a faster execution time. In addition, the predicted runtime of an application is used to prevent the scheduler from prioritizing workloads with fast runtime. The implementation and evaluation of the proposed scheduling approach are not included and are instead mentioned as future work.

In Reference [20], the authors present Commodore as a dynamic cluster autoscaler. Though not directly a custom scheduler, Commodore emerged from the need to address the failure events emitted by the scheduler when no feasible node is found for a scale-up request. Unlike Cluster Autoscaler [65] where the size of the node pool must be provided in advance, Commodore adopts a dynamic node pool size in addition to advanced auto-scaling features that leverage collected resource usage scores. The evaluations have shown that Commodore successfully responds to increasing scale-up requests, thus leading to reduced application response times.

Similar to the ImageLocality scoring function (see Section 2.1.2), the authors in Reference [37] propose a dependency-aware scheduler, where the existence of container dependencies in a given node could be taken into account to reduce the pod startup times in an edge computing environment. They propose an image-match policy (similar to the ImageLocality) and a layer-match policy that favors nodes that have some layers of the requested container image. Evaluations based on trace-based simulations and a real cluster have shown start-up time reductions compared to the default scheduler, especially with the use of the layer-match policy.

Authors in Reference [25] propose a custom scheduler that addresses two levels of fairness issues that may arise in a cloud computing environment. The first level is due to the priority-based scheduling that results in lower-priority requests being preempted while higher priority requests have a QoS higher than their targets. The second level deals with the lack of fairness when preempting requests having the same priority, which drives the need to reduce the QoS variability for such requests. To address these issues, their proposed QoS-driven scheduler modifies the preemption and sorting logic of the default scheduler, such that instances with Service Level Indicators (SLIs) exceeding their Service Level Objectives (SLOs) get preempted and the released resources get allocated to the instances having SLIs lower than their SLOs. Simulations and cluster-based experiments have shown that under moderate contention levels, the proposed scheduler improves the QoS for low-priority instances while not degrading the performance of the other instances.

The authors in Reference [3] model the pod assignment problem using the stable matching theory. Based on this model, pods set their preferences towards the cluster nodes based on their current resource usage, while nodes set their preferences towards pods based on their resource demands. Simulation results show that the response times of the services deployed using the proposed scheduler are reduced compared to Kubernetes scheduler.

The authors in Reference [114] propose to combine an ant colony optimization (ACO) approach with a particle swarm optimization (PSO) approach with the objective of minimizing the costs associated with the cluster resource usage. More specifically, the proposed approach considers the price for using a CPU or a memory unit, the node load, as well as the amount of CPU and memory requested by the pod. To validate the proposed approach, the authors conducted a set of simulations that have shown a reduction in the incurred usage costs and the node load, compared to the default scheduler.

In Reference [18], the authors propose a distributed scheduler called agentified scheduler that operates on a single cluster. Each cluster node runs an “agent” that executes the scheduling logic, and after a negotiation step, the pod gets assigned to the winner node. Results obtained from a small-scale testbed show that scheduling a first pod of a replica set takes longer than a centralized scheduler. Scheduling of subsequent replicas takes less time.

kube-safe-scheduler [60] is a repository managed by IBM, and it contains multiple proposals for customizing the Kubernetes scheduler based on the scheduler extender mechanism. The first scheduler achieves safe overloading of nodes based on up-to-date resource information. The second scheduler, called Pigeon, solves an optimization problem with a pre-defined objective function to make scheduling decisions. The congestion scheduler takes node congestion into account, while KubeRL is inspired by reward-based Reinforcement Learning (RL) approaches and scores nodes based on their runtime performances for different workload types.

Table 12 lists the aforementioned works along with the different criteria used to compare them.

Table 12.

Paper	Target environment	Target workloads	Considered parameters	ASP	IA	CA	EA	CS
[74] [38]	Cloud	DL training	Node resource usage	SC	CO3	No	C	4/8
[74] [38]	Cloud	DL training	Container category (progressing, watching, converged)	SC	CO3	No	C	4/8
[44]	CPU-GPU clusters	Heterogeneous	Input data size	S	None	No	N	-
			Input data nature
			Predicted runtime (fast/slow)
[20]	Cloud	-	Scheduler failure events Underutilized nodes	-	-	No	C	1
[37], [36]	Edge	Latency-sensitive	Presence of container image in nodes	SC	CO1	Simulator²¹	S	200
[37], [36]	Edge	Latency-sensitive	Presence of container image layers in nodes	SC	CO1	Simulator²¹	S	200
[25]	Cloud	Heterogeneous	SLO promised to admitted requests	S	CO1	No	S	Varying
[25]	Cloud	Heterogeneous	Current QoS (i.e., SLI)	S	CO1	No	S	Varying
[3]	Cloud	Heterogeneous	CPU and memory usage	SC	CO3	No	S	3
[3]	Cloud	Heterogeneous	Resources (CPU/memory) needed by container	SC	CO3	No	S	3
[60]	-	-	Actual load (safe)	SC	CO2	Yes [60]	-	-
			Congestion level (congestion)
			Runtime performance of workloads on nodes (kubeRL)
[114]	Cloud	-	Node’s CPU price per unit	SC	-	No	S	10K–20K
			Node’s memory price per unit
			Node load
			Amount of memory requested by pod
			Amount of CPU requested by pod
[18]	Fog	Latency-sensitive	Allocatable memory	F	CO3	No	C	4
[18]	Fog	Latency-sensitive	Allocatable memory	SC	CO3	No	C	4

Table 12. Specific Contributions - Summary

4.11 Multi-cluster Setups

This subsection presents a set of works where the proposed custom scheduler is not limited to the boundaries of a single cluster, but it is rather targeted towards multi-cluster scenarios.

An example is the RLSK scheduler proposed in Reference [46], where reinforcement learning is used to find the most suitable cluster for a given job, with the goal of achieving a balanced resource allocation within and among clusters. When the clusters are too loaded to receive the job, no action is taken and the job remains pending until it is attempted again. The simulation results indicate that RLSK can achieve improvements in load balancing and resource utilization, at the cost of a slight increase in the makespan.

The authors in Reference [68] propose a scheduler for federated Kubernetes clusters with the goal to facilitate the mobility of services between different clusters and expanding upon the resources of a single cluster. The proposed scheduler builds upon the two-stage scheduling process in Kubernetes by applying filtering and scoring on the set of clusters taking part in the federation.

A low-carbon Kubernetes scheduler is proposed in Reference [50]. The scheduler ranks data center (DC) locations based on their carbon intensities and the air temperature. Empirical results have shown that the proposed scheduler correctly identifies the most suitable target DC, therefore reducing carbon emissions.

Admiralty [1], previously known as multicluster-scheduler, is a “system of Kubernetes controllers that intelligently schedules workloads across clusters.” Instead of relying on low-accuracy aggregate data to filter nodes belonging to target clusters, admiralty delegates this task to the target clusters themselves. After performing the filtering step, these clusters return a set of “virtual” nodes to the scheduler in the source cluster, which in turn scores them and decides the pod placement accordingly. Admiralty adopts the scheduler plugins as an extension mechanism.

Table 13 shows the common as well as the distinguishing aspects in the surveyed multi-cluster schedulers.

Table 13.

Paper	Considered parameters	ASP	IA	CA	EA
[46]	Current resource usage of each cluster	SC	-	No	S
[46]	Features of pending jobs (ratio of job resource requirement to the corresponding cluster resource)	SC	-	No	S
[68]	Available resources (CPU/mem/storage) in the target cluster	F	-	No	N
[68]	Network usage level	SC	-	No	N
[50]	Carbon intensity	SC	-	No	C
[50]	Air temperature	SC	-	No	C
[1]	Aggregate data about target clusters	SC	CO4	Yes [1]	-

Table 13. Multi-cluster Setups - Summary

Discussion: As it can be noted from the previous works, the proposals made to support scheduling across multiple clusters inherit the concepts of filtering and scoring from the default, single-cluster Kubernetes scheduler. However, in most of these works [46, 50, 68], filtering and scoring are performed at the cluster level, instead of the node level. In addition, these works envision the filtering and the scoring steps to be made at a global level, where the scheduler has a full view of the individual clusters’ information. In this case, it is important to devise suitable mechanisms for aggregating the current state information of the clusters and synchronizing this information with the scheduler, so it can make decisions based on accurate data. In contrast, in admiralty [1], node-level filtering and scoring are performed, where the target cluster is responsible for filtering nodes, whereas the source cluster is responsible for scoring the filtered nodes.

5 Summary and Identified Gaps

Our takeaway from the surveyed literature is that regardless of the schedulers’ objectives, their target environments and workloads, almost all of them use one of following Kubernetes-native mechanisms to achieve their custom scheduling behavior:

•

Annotations, Labels in conjunction with affinity rules: Can be used to add custom information to pods or nodes. This information can then be used to enforce certain constraints in the scheduling logic.

•

Extended resources: Useful for advertising node-level resources in the same way as CPU and memory. Examples of such extended resources could be network or GPU resources.

•

Custom Resource Definitions: Useful for creating a new API object, such as a PodGroup [24], which allows a group of pods to be scheduled together in the context of co-scheduling, or SharePod [117] to support GPU sharing.

•

Device Plugins: Useful when the scheduler has to deal with vendor-specific hardware setups, such as GPU and network devices.

It is also clear that telemetry and instrumentation systems that allow measuring different performance indicators related to nodes and workloads play a major role in a variety of custom scheduling use cases. For instance, they can be used to assess workload interference levels as well as the real-time usage of the different resources in the cluster. All of these mechanisms are then combined with one of the extension options presented in Section 2.1.4 to achieve a full custom scheduling solution. In addition to the aforementioned findings, we have noticed that the following aspects are missing or have not been studied much in the literature:

•

Small cluster sizes used in evaluations: We noticed that many papers evaluated their custom scheduling proposals based on small cluster sizes. More specifically, 70% of the papers that provide evaluations based on real clusters use a cluster size of less than 10 nodes. The problem with this approach is that the obtained results may not remain valid for larger cluster sizes. An alternative approach in this case is to resort to simulations. In this case, it is important for the simulator to capture the main characteristics of a real large-scale cluster.

•

Limited number of contributions leveraging the new scheduling framework (see Reference [96] and Section 2.1.4): A look at Tables 3–13 indicates that only a few contributions have leveraged the new plugin-based scheduling framework to implement their proposed custom scheduling approach. This may be due to the fact that it has been made generally available in version 1.20 (December 2020). However, given its pluggable nature, it would be interesting to see more efforts in this direction, potentially evaluating the impact of the use of different plugin combinations.

•

A common repository for hosting the scheduler plugins: Currently, the Kubernetes SIG scheduling maintains a repository for a set of out-of-tree scheduler plugins [95]. These plugins are included in a custom scheduler image that can be used to replace the default scheduler. In addition, they have different maturity levels, starting from plugins built for demonstration purposes to those that are currently stable. We believe that this repository could be further enriched by including other scheduling plugins resulting from academic publications or other industrial efforts, provided that they comply with the SIG scheduling guidelines.

•

Leveraging historical scheduling data to improve scheduling decisions based on ML: While we identified multiple contributions targeting ML/DL workloads, only a few [11, 44, 46, 55, 56, 57, 118] leverage machine learning techniques to customize the scheduler’s behavior. As a result, efforts in this direction will be needed to evaluate the use of popular ML techniques in custom schedulers, in addition to identifying the tradeoffs between potential improvements and any overhead that may be caused by the learning process.

•

Cold start problem: The cold start delay refers to the time taken to spin up new containers to accommodate the increase of incoming requests, specifically affecting serverless workloads. Our literature review reveals the lack of efforts that aim to address this issue via custom Kubernetes schedulers. Related efforts in this area include Reference [13]. In this paper, the authors show how (i) pre-creating pause containers, (ii) sharing library layers among containers (similar to Reference [37]), and (iii) using an imperative configuration script to launch multiple application containers can lead to reducing cold start delays in Kubernetes. Additionally, the work in Reference [72] proposes the use of a pool with pre-started containers having different combinations of container sizes to deal with multiple types of serverless workloads. The authors also propose a two-level resource management mechanism, where the higher level entirely leverages the Kubernetes scheduler, whereas the lower level is specific to function dispatching. Reference [115] presents the concept of “lifecycle-aware scheduling,” where the main idea is to prevent the eviction of containers that may be needed soon and to favor existing containers that will become available sooner than containers that have to be created anew. While this contribution is not directly based on Kubernetes, it leverages the open source serverless platform OpenWhisk [7], which accepts Kubernetes as one of its deployment options.²²

•

Decision making at the edge - Scheduling delay: In this survey, we have identified multiple contributions targeting edge computing environments. Such contributions mostly focus on taking the topology of the edge nodes into account (see Section 4.3). However, to further support edge applications having strict delay requirements, special attention should be paid to the scheduling delay itself. Only References [19, 82, 92] have evaluated this aspect. Their results could be considered as a starting point for further research by the edge computing research community.

6 Conclusion

Since its release in 2014, Kubernetes has gained increased popularity among organizations running containerized workloads and has become a major player in the container orchestration landscape. Despite its rich feature set, its default scheduler was not able to cope with the requirements driven by emerging business needs and use cases. As a result, the number of contributions proposing a custom Kubernetes scheduler has increased recently. In this survey, we identified the relevant contributions that have been made in this area, while highlighting how the main drivers for custom scheduling in Kubernetes have evolved over the years. We also presented a methodology for classifying the reviewed works based on criteria such as their high-level objectives, their target environments and workloads, the specific operation(s) of the scheduling process that they have customized, as well as their implementation and evaluation approaches. In addition, we provided a detailed description of the reviewed custom scheduling contributions based on their specific objectives, while analyzing the main trends that have been observed per objective.

Overall, our survey shows that, depending on the complexity of the envisioned objective, it may not be sufficient to customize the scheduler only. In fact, custom resources may need to be developed to further support the intended scheduling behavior. This may also require more fine-grained and low-level telemetry data to support predictive scheduling capabilities. It is also worth noting that Kubernetes in general, and its scheduler in particular, undergo continual improvements from the community. This means that problems that are currently addressed by custom scheduling approaches, may be available by default in future versions of the scheduler. This also means that it is important to closely follow the recent advancements made to the scheduler to ensure that any customizations made to it are in line with the current best practices.

Footnotes

Not to be confused with classic/standalone Docker Swarm, which has been deprecated in v20.10. [27].

JSON is also supported.

https://github.com/kubernetes/website/issues/6525.

⁴

The pod sorting step is explained as part of the scheduling framework (see Section 2.1.4).

⁵

CO: Customization Option.

⁶

While items CO1 to CO4 are well-known customization options, CO5 is an additional option that we identified from the surveyed literature.

⁷

For open-source, non-academic contributions, the year corresponds to the year of the initial release or the one corresponding to the first commit.

⁸

It is worth mentioning that there are some papers where it is not clear which option has been used.

⁹

Memory bandwidth refers to the rate at which data can be transferred to/from memory [58].

¹⁰

In this table and the subsequent ones, cluster size corresponds to the number of worker nodes in the cluster. In addition, the link to the schedulers’ code can be found in the footnotes for academic contributions and in the references for non-academic ones.

¹¹

https://github.com/jpedro1992/sfc-controller.

¹²

Annotations and labels both allow to add metadata to a Kubernetes object, such as a pod. The difference between them is that labels can be used to select objects based on a set of criteria.

¹³

http://bit.ly/2QfVqxq.

¹⁴

https://github.com/KatharaFramework.

¹⁵

https://github.com/microsoft/hivedscheduler.

¹⁶

Used in companies and developed actively [24].

¹⁷

Simulator code: https://github.com/edgerun/faas-sim.

¹⁸

Based on testbed profiling.

¹⁹

https://github.com/NTHU-LSALAB/KubeShare/.

²⁰

https://github.com/legato-project/heats-scheduler.

²¹

https://github.com/depsched/sim.

²²

A pause container allows for namespace sharing among all containers within a given pod [69].

References

[1]

Admiralty. 2018. GitHub - Admiraltyio/admiralty: A System of Kubernetes Controllers That Intelligently Schedules Workloads Across Clusters. Retrieved from https://github.com/admiraltyio/admiralty.

Abstract

1 Introduction

2 Background

2.1 Kubernetes and Scheduling

2.1.1 Kubernetes Components.

2.1.2 How Does the Scheduler Work?.

2.1.3 Configuration Options.

2.1.4 Customization/extension Options.

2.2 The State-of-the-art

3 Classification Methodology

3.1 Detailed Objective Classification

3.2 Additional Scheduler Characteristics

4 Custom Kubernetes Schedulers: Objective-oriented Analysis

4.1 Interference and Colocation

4.2 Lack of Support for Network QoS

4.3 Topology-awareness

4.4 No Support for Co-scheduling

4.5 No Support for Batch Scheduling

4.6 No Support for Data Locality Awareness

4.7 Lack of Real Load Awareness

4.8 GPU Sharing

4.9 Environmental Impact (O3)

4.10 Specific Contributions

4.11 Multi-cluster Setups

5 Summary and Identified Gaps

6 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Cost-efficient scheduling algorithms based on beetle antennae search for containerized applications in Kubernetes clouds

Kubernetes Scheduling: Taxonomy, Ongoing Issues and Challenges

Minmax scheduling problems with a common due-window

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations