Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters

Published: 19 July 2023 Publication History

Abstract

Colocating multiple jobs on the same server has been widely applied for improving resource utilization in cloud datacenters. However, the colocated jobs would contend for the shared resources, which could lead to significant performance degradation. An efficient approach for eliminating performance interference is to partition the shared resources among the colocated jobs. However, this makes the resource management in datacenters very challenging. In this paper, we propose JointOPT, the first resource management framework that optimizes job assignment and resource partitioning jointly for improving the throughput of cloud datacenters. JointOPT uses a local search based algorithm to find the near optimal job assignment configuration, and uses a deep reinforcement learning (DRL) based approach to dynamically partition the shared resources among the colocated jobs. In order to reduce the interaction overhead with real systems, it leverages deep learning to estimate job performance without running them on real servers. We conduct extensive experiments to evaluate JointOPT and the results show that JointOPT significantly outperforms the state-of-the-art baselines, with an advantage from 13.3% to 47.7%.

1 Introduction

Modern cloud datacenters run many throughput-oriented jobs such as big data analytics services, machine learning tasks, scientific computing jobs, and the like. Such jobs are usually consolidated among the physical servers for maintaining a modest hardware cost. However, the jobs sharing the same server would contend for the shared resources, such as CPU cores, Last Level Cache (LLC), memory bandwidth, and so on. Such resource contention could easily lead to performance interference among the jobs and cause significant throughput degradation [11, 23, 41].
An efficient approach for reducing resource contention is to use the software or hardware level resource isolation techniques (such as Intel’s CAT [26] and MBA [6]) to partition the shared resources among the colocated jobs. By allocating dedicated resources to each job, performance interference can be prevented or mitigated. We observe that the throughput of a datacenter is determined together by how the jobs are assigned among the servers (referred to as the job assignment problem) and how the shared resources are partitioned among the colocated jobs on each server (referred to as the resource partitioning problem). It implies that in order to maximize the system throughput, job assignment and resource partitioning should be optimized jointly.
However, finding the optimal job assignment and resource partitioning configuration to maximize the system throughput faces several critical challenges. First, the solution space of the problem is prohibitively large. Exploring the full search space to find the optimal solution is not feasible as it is very time-consuming. Second, developing precise analytical performance models to guide the exploration of search space is challenging because the relationship is very complex due to multiple impacting factors (e.g., performance interference, resource sensitivities, etc). Third, jobs usually face dynamic workload changes and a robust resource management strategy is desirable to handle dynamic system state changes, which further increases the difficulty.
In order to address the above challenges, we propose JointOPT, which is the first framework that optimizes job colocating and resource partitioning jointly. JointOPT has several promising properties. First, facing the large solution space, it uses a local search based algorithm to find the near optimal job assignment configuration. Instead of evaluating the colocations in real systems, it uses a deep learning based prediction model to estimate the performance of colocations, which can significantly reduce the overhead. Second, JointOPT leverages deep reinforcement learning (DRL) to build the resource partitioning model, which is effective, efficient, and adaptive to maintain the optimal resource partitioning configuration in the system with dynamic workload changes. Experimental results show that JointOPT can improve the system throughput by 13.3% to 47.7% compared to the state-of-the-art baselines.
The contributions of this paper are as follows.
We formally define the problem that jointly optimizes job assignment and resource partitioning.
We propose a local search-based job assignment algorithm that can find the near-optimal solution efficiently.
We propose a DRL-based resource partitioning approach, which is effective, efficient, and robust.
We leverage deep-learning based models to predict job performance without running them on real servers.
Experimental results show that the proposed approach outperforms the state-of-the-art baselines significantly.
This paper is extended from our preliminary work DRLPart [8]. However, DRLPart only considers the resource partitioning issue for a single colocation, which runs on a single server. In contrast, JointOPT is datacenter-wide resource management framework, which optimizes job assignment and resource partitioning jointly. The experimental results show that JointOPT can further improve system throughput by up to 25.3% compared to DRLPart.
The rest of the paper is organized as follows. Section 2 illustrates the motivation. The joint optimization problem is formally formulated in Section 3. Section 4 presents the detailed design of JointOPT. The evaluations are presented in Section 5. Section 6 discusses the limitations of the framework. Section 7 summarizes the related work. Conclusions and future work are summarized in Section 8.

2 Motivation

2.1 Importance of Jointly Optimizing Job Assignment and Resource Partitioning

We first show the impact of job assignment. Consider a system running three jobs (641.leela_s, 628.pop2_s, 603.bwaves_s from SPEC CPU2017 [3]) on two servers. Figure 1(a) shows the system throughput under different job placement configurations. The system throughput refers to the average IPC (instructions per cycle) of all jobs over an examining period of 60 seconds. The jobs on the same server share the resources (i.e., resource partitioning is not considered). It can be seen that the system performance is different across different job assignment configurations, with a gap up to 0.62 (1.96 VS. 2.58). This is because the resource contention between different jobs is different, leading to diverse performance degradations. The results imply that how the jobs are assigned among the servers has significant impact on system performance.
Fig. 1.
Fig. 1. (a) System throughput produced by different job assignment configurations when resources are fully shared. (b) System throughput produced by different resource partitioning configurations.
We next show the impact of resource partitioning. Consider a system running two jobs on one server. Suppose the resources are partitioned among the two jobs. Figure 1(b) shows the system throughput under different resource partitioning configurations. We have two observations. First, resource partitioning could improve system performance compared to the system fully sharing the resources (i.e., no resource partitioning). Second, the system throughput is different across different partitioning configurations, which implies that how the resources are partitioned among jobs has significant impact on system performance.
Summary: Both job assignment and resource partitioning have a significant impact on system performance. Therefore, they should be optimized jointly in order to maximize the system performance.

2.2 Limitation of Existing Solutions

Since no existing work studies job placement and resource partitioning jointly, we discuss the existing solutions for these two problems separately.

2.2.1 Job Placement.

The job placement problem has been extensively studied in the literature. Existing solutions can roughly be divided into two categories: bin packing based methods and contention-aware methods.
Bin Packing based Methods. These kinds of methods describe each job by a resource demand vector, and aim to pack the jobs among the servers such that: (1) the number of servers used is minimized or load balance is achieved, and (2) the total resource consumption of each server does not exceed server capacity [18, 19, 33, 37]. However, bin packing based methods have two drawbacks. First, they generally assume a fixed resource demand for each job, which cannot adapt to dynamic workload changes. Second, they do not consider resource fungibility (a job can have the same performance with different resource allocations), thus they cannot utilize the resource complementarity among different jobs to improve server utilization.
Contention-aware Methods. These kinds of methods strive to place jobs among servers such that the performance interference among the colocated jobs is minimized [11, 23, 30]. They generally build a prediction model to estimate performance interference among the colocated jobs, and use heuristic algorithms to search a near-optimal job placement configuration that minimizes the total performance interference. However, existing contention-aware methods generally assume the server resources are fully shared among the colocated jobs (i.e., resource partitioning is not considered), thus performance interference still exists.
Summary: Existing job placement works either ignore resource partitioning or maintain a static and suboptimal partitioning configuration.

2.2.2 Resource Partitioning.

Existing resource partitioning solutions can be divided into two categories: performance model based methods and online tuning based methods.
Performance Model based Methods. These kinds of methods generally build a performance model to evaluation partitioning configurations, and use heuristic algorithms to find the near-optimal partitioning configuration based on the performance model [14, 27, 34, 36]. However, such methods can only cope with single resource and cannot easily be applied to multiple resources. This is because: (1) building performance model is difficult for multiple resources, since the contention behavior becomes much more complex; and (2) the solution space grows exponentially when multiple resources are considered coordinately, thus using heuristic algorithm to explore the solution space is not feasible.
Online Tuning based Methods. These kinds of methods dynamically tune the partitioning configuration in an online manner according to the feedback from a real system [9, 12, 22, 28, 29]. Such methods can partition multiple resources coordinately. However, due to the lack of a precise performance model, feedback based methods have to evaluate partitioning configurations in real systems. That is rather inefficient when exploring a large solution space.
Summary: None of the existing resource partitioning solutions can achieve optimality, efficiency, and robustness simultaneously.

3 Problem Formulation

Consider a cloud datacenter. Given a set of throughput-oriented jobs, our aim is to properly assign the jobs to the servers (refers to the job assignment problem) and partition the resources among the colocated jobs on each server (refers to the resource partitioning problem), so that the system throughput is maximized. The system throughput is defined as the total performance of all the jobs, where the performance of each job is measured by the time-averaged IPC (instructions per cycle). Job assignment is performed before job execution, while resource partitioning is performed during job execution. Since the workloads of jobs change dynamically, the partitioning configuration should be dynamically adjusted in order to maintain the optimality consistently.
The joint optimization problem is formally defined as follows. Let \(N_{ser}\) denote the number of servers. Denote by \(\mathcal {J}\) the given set of jobs, and \(N_{job}\) the number of jobs in \(\mathcal {J}\) . Let \(\mathcal {R}\) denote the set of resources that can be partitioned. Let \(\mathcal {C} = \lbrace c_1, c_2, \ldots , c_{N_{ser}}\rbrace\) denote a job assignment configuration, where \(c_i\) refers to the set of jobs (also called a colocation) assigned to the i-th server. Let \(p_{i,t}\) denote the resource partitioning configuration of \(c_i\) at time t. Denote by \(IPC_{j}\) the time-averaged IPC of the j-th job in \(\mathcal {J}\) . Our goal is to find the optimal \(\mathcal {C}\) and \(\lbrace p_{i,t}\rbrace\) such that the system throughput is maximized, i.e.,
\[\begin{eqnarray} \max \sum _{1\le j \le N_{job}} IPC_{j} \end{eqnarray}\]
(1)
The key notations are summarized in Table 1. We assume the servers are homogeneous, while the proposed solution can easily be extended to the scenario with heterogeneous servers (Section 6.4). We do not consider dynamic job arrivals and departures, and assume the jobs stay in the system for a relatively long time (such examples are common in modern datacenters, e.g., the containers running long-term services). However, the proposed solution can easily be extended to the scenario with dynamic job arrivals and departures (Section 6.3).
Table 1.
\(N_{ser}\) number of servers
\(\mathcal {J}\) set of jobs
\(N_{job}\) number of jobs in \(\mathcal {J}\)
\(\mathcal {R}\) set of resources to be partitioned
\(c_i\) set of jobs assigned to the i-th server
\(\mathcal {C}\) a job assignment configuration, \(\mathcal {C}=\lbrace c_1,c_2, \ldots ,c_{N_{ser}}\rbrace\)
\(p_{i,t}\) resource partitioning configuration for \(c_i\) at time t
\(IPC_j\) time-averaged IPC of the j-th job in \(\mathcal {J}\)
\(IPC_i^j\) time-averaged IPC of the j-th job in \(c_i\)
\(pmc_{i}^j\) time-averaged performance counters of the j-th job in \(c_i\)
\(PMC_{i}\) \(PMC_{i}=\lbrace pmc_{i}^1, pmc_{i}^2, \ldots , pmc_{i}^{|c_i|}\rbrace\)
\(pmc_{i,t}^j\) performance counters of the j-th job in \(c_i\) at time t
\(PMC_{i,t}\) \(PMC_{i,t}= \lbrace pmc_{i,t}^1, pmc_{i,t}^2, \ldots , pmc_{i,t}^{|c_i|}\rbrace\)
\(d_{i}^j\) number of threads of the j-th job in \(c_i\)
\(D_{i}\) \(D_{i}=\lbrace d_{i}^1, d_{i}^2, \ldots , d_{i}^{|c_i|}\rbrace\)
Table 1. Summary of Key Notations

4 Design of JointOPT

In this paper, we propose JointOPT, a framework that jointly optimizes job assignment and resource partitioning, for maximizing the system throughput in cloud datacenters. Briefly, JointOPT uses a heuristic algorithm to find the near optimal job assignment configuration, and a deep reinforcement learning (DRL) based model to dynamically partition the resources among the colocated jobs. The detailed design of JointOPT is presented in the rest of this section.

4.1 Step I: Job Assignment

In order to find the optimal job assignment configuration, two issues need to be addressed: (1) how to evaluate a job assignment configuration (i.e., judge whether a job assignment configuration is good or not); and (2) how to find the optimal job assignment configuration (the one achieving the best performance). For the first issue, the challenge is that job performance is not only dependent on the job assignment configuration, but also impacted by the resource partitioning configuration. However, the resource partitioning configuration is computed online, which is not available in the job assignment phase. To address this issue, we build a deep learning based prediction model to estimate the impact of resource partitioning, instead of measuring it in a real system. For the second issue, the main challenge is that the solution space is huge, thus searching for the optimal solution is very time consuming. To address this issue, we propose a local search based heuristic algorithm to find a near-optimal solution.

4.1.1 Colocation Evaluating Model.

Given a colocation of jobs, the colocation evaluating model is to predict the performance of each job under a specific resource partitioning configuration. Note that the performances of jobs may vary dynamically over time due to workload changes. To simplify the problem, the prediction model estimates the time-averaged performance of each job, instead of considering the performance at every time point. With this model, we can roughly estimate the system throughput under any given job assignment configuration and resource partitioning configuration without evaluating them in a real system.
However, developing a precise prediction model is challenging, because the relationship between job performance and resource partitioning configuration is very complex. To address this issue, we leverage deep learning technology to build the evaluating model. In our context, the input features of the learning model include the features of jobs and the resource partitioning configuration. The features of a job consists of the performance counters of the job when it runs alone on the servers and the number of threads that it has. We use performance counters as the main features of jobs, because they are able to characterize the resource sensitivities of jobs effectively and can easily be collected without incurring too much overhead [4]. Since the performance counters change dynamically as the workload varies, we use the time-averaged values of the performance counters over the profiling period. This is reasonable because: (1) we are concerned with the average behaviors of jobs, and the average performance counters are sufficient to ensure a good estimation (validated by the results in Table 5); and (2) we want to keep a small overhead, thus we try to use short input features for the machine learning model. Modern processors provide hundreds of performance counters, but not all of them are useful in our case. To pick up the most important ones for reducing the input dimension, we use a random forest model to select the top important performance counters (see details in Section 5.2). We include the number of threads of each job in the features, because multi-threaded jobs are sensitive to the number of CPU cores allocated.
Consider a colocation \(c_i\) (i.e., the set of jobs assigned to the i-th server). Let \(|c_i|\) denote the number of jobs in \(c_i\) . For each \(1\le j\le |c_i|\) , let \(pmc_{i}^j\) denotes the time-averaged performance counters associated with the j-th job when it runs alone on the server. Let \(d_{i}^j\) denotes the number of threads of the j-th job. Let \(PMC_i=\lbrace pmc_{i}^1, pmc_{i}^2, \ldots , pmc_{i}^{|c_i|}\rbrace\) and \(D_i=\lbrace d_{i}^1, d_{i}^2, \ldots , d_{i}^{|c_i|}\rbrace\) . Let \(IPC_{i}^j\) denote the time-averaged IPC of the j-th job in \(c_i\) . The prediction model aims to predict \((IPC_{i}^1,IPC_{i}^2,\ldots ,IPC_i^{|c_i|})\) with the input \((PMC_i, D_i, p_i)\) .
The design of the colocation evaluating model is shown in Figure 2. Briefly, two fully-connected (fc) layers are used to encode the input features, and a padding mask is used to align the input vector to a fixed length for handling different colocation sizes. A multi-head attention module is used to synthesize the outputs of the fc layers. Multi-head attention [32] is a powerful mechanism to learn the dependencies among input features, which has been widely used in a variety of tasks including reading comprehension, abstractive summarization, and so on. The multi-head attention module is followed by another two fc layers, which generates the final output, i.e., the IPCs of the jobs.
Fig. 2.
Fig. 2. Design of the job colocation evaluating model.

4.1.2 Searching the Optimal Job Assignment Configuration.

As mentioned earlier, the solution space of the job assignment problem is huge. Therefore, we propose a local search based heuristic algorithm to find the near-optimal solution. The details are presented in Algorithm 1. There are two steps: in the first step (line 2 to 12), a greedy algorithm is used to quickly find an initial solution; in the second step (line 14 to 23), a local search process is conducted to improve the solution.
The basic idea of the greedy algorithm is to assign the jobs that are not likely to incur resource contention into the same colocation. Intuitively, the jobs with similar resource sensitivities are more likely to cause resource contention, because their resource demands are similar. As we use performance counters to characterize resource sensitivities, we define the similarity of resource sensitivities between two jobs as the “distance” between the performance counters of the two jobs, where the distance is defined as the Eular distance between the two vectors representing the performance counters of the two jobs. A smaller distance indicates a high similarity, and vice versa. The greedy algorithm attempts to divide the jobs into \(N_{ser}\) colocations (one for each server), and sequentially determines the jobs for each colocation. To balance the jobs among colocations, there are \(\lfloor N_{job}/N_{ser} \rfloor + 1\) jobs is each of the first \(N_{job}\%N_{ser}\) colocations, and \(\lfloor N_{job}/N_{ser} \rfloor\) jobs for each of the rest colocations. To determine the jobs for a colocation, we first randomly select a job among those that are not assigned, then, we iteratively choose the jobs until a sufficient number of jobs are selected. In each iteration, we choose the job such that: (1) the job has not been assigned; and (2) the total similarity between the job and the already selected jobs is the minimum, with ties broken randomly.
After the initial solution is obtained, we conduct a local search process to improve it. The local search process iterates for a number of rounds, and performs two local search operations in each round. The first operation is shift, which attempts to move one job from one colocation to another one. We first randomly select a colocation among the top \(x\%\) (a tunable parameter) colocations with the lowest average throughput, and randomly choose a job in the selected colocation. After that, we randomly select a target colocation among the top \(x\%\) colocations with the highest average throughput. The shift operation is accepted if the system throughput is improved after the operation is performed, and rejected otherwise. The system throughput refers to the total throughput of all the colocations, where the throughput of a colocation is defined as the total IPCs of the jobs in the colocation under the optimal partitioning configuration (i.e., the partitioning configuration that produces the maximum throughput). The second operation is swap, which attempts to exchange two jobs in two colocations, where the two jobs are randomly selected from the top \(x\%\) colocations with the lowest average throughput. Again, the operation is accepted if the system throughput is improved after it is performed, and rejected otherwise. The local search process terminates after \(rn_1\) rounds, where \(rn_1\) can be tuned according to practical requirements.
In order to estimate the throughput of a colocation, we need to find the optimal partitioning configuration for the colocation. Searching for the optimal solution is time consuming, thus we use a hill climbing algorithm to find the approximate solution. The details are presented in Algorithm 2. Initially, we assume the resources are equally partitioned among the jobs in the colocation (line 1). Then, the algorithm runs for \(rn_2\) rounds (a tunable parameter). In each round, for each resource \(r\in \mathcal {R}\) , we randomly select two jobs and attempt to shift one unit of r from one job to the other job. If the throughput is improved, we accept this operation (line 8 to 10), and reject otherwise.
Complexity. Since we only need to store the knowledge of the jobs and servers, the space complexity of the job assignment algorithm is \(O(N_{job} + N_{ser})\) . For the time complexity, the hill climbing algorithm has a complexity of \(O(rn_2|\mathcal {R}|)\) . For the job assignment algorithm, the first step (i.e., the greedy algorithm) has a complexity of \(O(N_{ser}^2)\) . The mean size of colocations is around \(N_{job}/N_{ser}\) . Therefore, the second step of the job assignment algorithm (i.e., the local search algorithm) has a complexity of \(O(rn_1(N_{job}/N_{ser})^2 x\%)\) . Therefore, the overall time complexity of the job assignment algorithm is \(O(N_{ser}^2 + rn_1 rn_2 |\mathcal {R}| (N_{job}/N_{ser})^2 x\%\) .

4.2 Step II: Resource Partitioning

As the workloads of jobs change dynamically, the optimal resource partitioning configuration for each colocation would vary over time. In order to achieve the optimal system throughput consistently, we need an adaptive resource partitioning strategy which can tune resource partitioning configuration dynamically according to the workload variations. To address this issue, we leverage deep reinforcement learning (DRL) to propose an end-to-end resource partitioning strategy. The DRL-based resource partitioning approach is very efficient, which directly generates the optimal resource partitioning decision according to the real-time system state. Moreover, the DRL-based approach naturally has generalization ability, making it highly robust to dynamic workload changes.

4.2.1 Background of Deep Reinforcement Learning.

Reinforcement Learning (RL) is a kind of approach in which an agent interacts with the environment and learns an optimal policy by trial and error. At each time step t, the agent receives a state \(S_t\) and selects an action \(A_t\) according to a policy \(\pi (A_t|S_t)\) . The policy is the strategy that the agent employs to determine the next action based on the current state, which is a mapping from state \(S_t\) to actions \(A_t\) . After the action is performed, the agent receives a scalar reward \(w_t\) , and the environment transitions to the next state \(S_{t+1}\) . The reward is the feedback reflecting how good is the action \(A_t\) in the state \(S_t\) (a larger reward indicates a better action). This process continues until the agent finds the optimal policy, which maximizes a discounted and accumulated reward \(\sum _{t=0}^{\infty } \gamma ^t w_{t}\) ( \(\gamma \in (0,1]\) ) is the discount factor). Finding the optimal policy is crucial for RL, but is not an easy task. A recently popular approach is to use deep neural network (DNN) to approximate the policy, which we call Deep Reinforcement Learning (DRL). This is because DNN has strong fitting ability to establish the mapping from input to output without relying on domain knowledge.
We would like to say that DRL is suitable for resource partitioning. Once the DRL model is well-trained, the partitioning decision can be made very quickly, because the inference of neural network is fast. A well-trained neural network has natural generalization to the inputs which have not appeared in the training scenario, in our context, it can adapt to unseen workloads, colocations and jobs, making it highly robust to dynamic system state changes.

4.2.2 The DRL Model for Resource Partitioning.

A DRL model consists of the following concepts: environment, action, state, reward, and policy. In the context of resource partitioning, the concepts are defined as follows:
Environment. The DRL model works on a single server, for partitioning the shared resources among the jobs colocated on the server. So, we define the server as the environment.
State. The state is defined as the server state in our context, which is also the input of the DRL model. How to define the server state is non-trivial, because (1) the server state must be able to reflect the workload characteristics of jobs and resource contention behaviors across jobs; and (2) the server state must be easy to collect since the resource partitioning decision should be generated quickly. The server state in our DRL model is composed by three parts: performance counters associated with jobs, the current partitioning configuration, and the number of threads of jobs. We include the number of threads of each job in the system state, because multi-threaded jobs are sensitive to the number of CPU cores allocated.
Consider the i-th server in the datacenter. Denote by \(c_i\) the set of jobs colocated on the i-th server. We use \(PMC_{i,t}=(pmc_{i,t}^1, \ldots , pmc_{i,t}^{|c_i|})\) to denote the performance counters for the jobs in \(c_i\) at time step t, where \(pmc_{i,t}^j\) refers to the performance counters associated with the j-th job. We use \(p_{i,t}=\lbrace p_{i,t}^1,\ldots , p_{i,t}^{|c|}\rbrace\) to denote the resource partitioning configuration of \(c_i\) at time step t, where \(p_{i,t}^j\) refers to the allocation of resources for the j-th job. So, the server state at time step t can be represented by \(S_{i,t} = \lbrace PMC_{i,t}, D_i, p_{i,t}\rbrace\) , where \(D_i\) denotes the numbers of threads of jobs.
Action. The action is the output of DRL model, which refers to the resource partitioning decision in our context. For the i-the server, we use \(A_{i,t} = \lbrace a_{i,t}^1, \ldots , a_{i,t}^{|c|}\rbrace\) to denote the action at time step t, where \(a_{i,t}^j\) refers to the resource allocation for the j-th job.
Reward. The goal of resource partitioning is to maximize the system throughput. Let \(S_{i,t}\) denote state of the i-th server at time t. Suppose an action \(A_{i,t}\) (i.e., a new resource partitioning configuration) produced by the DRL model is applied to the i-th server at time t. Let \(S_{i,t+1}\) denote the state of the i-th server after \(A_{i,t}\) is applied. The reward \(w_{i,t}\) of the DRL model for \(A_{i,t}\) is defined as the server throughput in state \(S_{i,t+1}\) .
Policy. The DRL method uses DNN to approximate the policy model. The policy model maps the input (i.e., the server state) to the output (i.e., the partitioning decision). Figure 3 shows the design of the DRL policy model. Overall, the policy model employs an encoder-decoder architecture. The encoder first transforms the input to a high-dimensional vector through a stack of fully connected (fc) layers. Then, the transformed features are fed to the decoder network through an attention layer. The decoder network adopts the Gated Recurrent Unit (GRU) to generate the partitioning decision. The GRU architecture is commonly used for sequential decision. In our context, it determines the resource allocation for the jobs one by one in a sequential manner.
Fig. 3.
Fig. 3. Design of the DRL model.

4.2.3 Reward Prediction for Reducing Training Overhead.

DRL typically requires a large number of interactions between agent and environment to train the model. In the context of resource partitioning, a single interaction may take over 10 seconds, because we need to generate a new partitioning configuration according to the partially-trained DRL model, apply the new generated configuration to real system, observe the reward, and finally update the DRL model using the reward. Since the DRL model needs to be trained by tens of millions of interactions, the total training time is about \(O(10^5)\) hours, which is not feasible even if the training process is one-shot. So, a crucial issue of applying DRL to resource partitioning is how to reduce the training overhead.
To address this issue, we propose a prediction model to estimate the reward of a given action, without interacting with the real system. Suppose the i-th server is in state \(S_i = \lbrace PMC_i, D_i, p_i\rbrace\) at some time point, where \(PMC_i\) denotes the performance counters of jobs, \(p_i\) denotes the current resource partitioning configuration, and \(D_i\) denotes the numbers of threads of jobs. Given an action \(A_i\) (i.e., a resource partitioning decision), let \(IPC_{i}^j\) denote the IPC of the j-th job in \(c_i\) if \(A_i\) is applied to \(S_i\) . The prediction model takes \((S_i, A_i)\) as input to estimate \((IPC_i^1, IPC_{i}^2,\ldots , IPC_{i}^{|c_i|})\) , without applying \(A_i\) to the real system. The throughput of the colocation is defined as the total IPC of all the jobs, which can be computed by \(\sum _{j=1}^{|c_i|} IPC_{i}^j\) .
We leverage DNN to build the resource partitioning reward prediction model. Briefly, an fc layer is used to encode the input features at the current time step, and a multi-head attention module is used to synthesize the outputs of the fc layer. The multi-head attention module is followed by another fc layer, which generates the final output, i.e., the IPCs of the jobs at the next time step.
Training the reward prediction model still needs to interact with real systems to collect training samples. However, as the number of training samples required for training the prediction model is much less than that for training the DRL model, the prediction model can significantly reduce the training time of the DRL model (see details in Section 5.5.7).

4.3 Putting It All Together

Figure 4 presents the overview of JointOPT. The model training module takes charge of training the machine learning models used in JointOPT, including the colocation evaluating model, the reward prediction model, and the DRL-based resource partitioning model. It is worth nothing that the machine learning models are trained using history data, which is performed only once. For a given set of jobs, the job assignment module first computes the job assignment configuration, and then runs the colocations on the servers. For each colocation, the DRL-based model is used to dynamically partition the resources among the colocated jobs.
Fig. 4.
Fig. 4. Overview of JointOPT, composed by the model training module, job assignment module, resource partitioning module, and model refining module.
Although the machine learning models can be generalized to new jobs never seen before, the accuracy would decrease as the number of new jobs grows. Re-training the learning models from scratch is time consuming. To address this issue, we refine the models once the the percentage of new jobs exceeds a certain threshold. Model refining is based on transfer learning, whose basic idea is to retain the input layer and hidden layer parameters, and randomly retain part of the output layer parameters to retrain the model. Since most of the parameters are not changed, model refining is fast and incurs small overhead.

5 Evaluations

We conduct extensive experiments to evaluate the proposed framework and compare it with the state-of-the-art baselines. This section presents the details.

5.1 Experimental Settings

Hardware Platform. We implement and evaluate the proposed framework on servers with the configuration shown in Table 2. For all the experiments, the Hyper Threading and Turbo Boost features of the evaluated CPU are disabled.
Table 2.
ProcessorIntel® Xeon Silver 4114 2.20 GHz, 10 cores
L1 cacheprivate, 32KB per core, split D/I, 8 ways
L2 cacheprivate, 1MB per core, 16 ways
L3 cacheshared, 13.75MB, 11 ways
Memory128GB (4 × 32GB DIMMs), DDR4 2133MT/s
OSCentOS7.8 with Linux version 3.10.0
Table 2. Server Configuration
Resources to be Partitioned. In this paper, we particularly focus on the partitioning of last level cache (LLC), memory bandwidth and CPU cores, which are the most important resources affecting the job performance. However, our proposed framework is applicable to an arbitrary number of resources.
LLC is organized as a set of sequentially indexed cache ways. The partitioning of LLC is implemented by Intel’s CAT [26]. Specifically, the allocation of LLC for a job is (must be) defined as a set of continuously indexed cache ways (e.g., {1,2,3} or {3,4}). Formally, the allocation for LLC is represented as: \(llc = \lbrace llc^{l}, llc^{r}\rbrace\) , where \(llc^{l}\) represents the index of left boundary and \(llc^{r}\) denotes that of right boundary.
The partitioning of memory bandwidth is implemented by Intel’s MBA [6]. Specifically, for each job, we specify the maximum proportion from the entire bandwidth that it can use. The maximum proportion is defined to be one of the ten throttling levels: \(b \in \lbrace 10\%, 20\%, \ldots , 100\%\rbrace\) .
The CPU is partitioned at core granularity. The number of CPU cores allocated to a job is denoted by an integer. At least one CPU core will be allocated to each job. The CPU cores are exclusively allocated among jobs, i.e., each CPU core cannot be shared by multiple jobs.
Note that the DRL model only tells the number of CPU cores, but does not indicate which cores are allocated to each job. We use a greedy algorithm to map the CPU cores to jobs, aiming to let jobs use their original CPU cores as far as possible for minimizing the transition penalty. Specifically, if a job’s CPU cores need to be reduced, the CPU cores to be reduced are randomly chosen from its original CPU cores. If a job’s CPU cores need to be added, the CPU cores to be added are randomly chosen from those reduced by other jobs.
Workloads. We use a total of 120 jobs for evaluation, with 43 jobs from SPEC CPU2017 [3], 28 jobs from SPEC CPU2006 [2], and 49 jobs from pyperformance [1].
Performance Counters. The performance counters are selected using a random forest model [20]. Specifically, we randomly generate 800 colocations, with 200 colocations of 4-jobs, 5-jobs, 6-jobs, and 7-jobs, respectively. We run each colocation for 60 seconds under a randomly generated resource partitioning configuration and measure the performance counters (around 400 counters are measured) and the IPCs of jobs in every 10 seconds. We use these samples to build a random forest tree, which takes the performance counters as input to predict the corresponding IPCs of jobs. After that, we compute the Gini importance [5] for each performance counter, which indicates the importance of the performance counter on affecting the prediction accuracy.
Table 3 shows how the prediction accuracy varies for our colocation evaluation model (denoted by CE) and reward prediction model (denoted by PR) when a different number of performance counters (with top Gini importance) is used. As can be seen, both of the two models achieve the highest accuracy when 20 performance counters are used. Therefore, we use the top 20 performance counters with the highest Gini importance in our implementation. Table 4 gives the details of the performance counters.
Table 3.
 K51015202530
Mean AccuracyCE73.12%76.53%95.15%96.43%94.47%90.35%
PR59.71%65.87%85.25%93.52%90.13%89.35%
Table 3. The Prediction Accuracy of the Colocation Evaluation Model (CE) and Reward Prediction Model (PR) when a Different Number of Performance Counters is Used
Table 4.
#Counter Name#Counter Name
1.ld_blocks.no_sr10.exe_activity.4_ports_util
2.inst_retired.prec_dist11.branch-loads
3.uops_retired.retire_slots12.branch-load-misses
4.L1-dcache-loads13.ld_blocks.store_forward
5.LLC-loads14.inst_retired.total_cycles_ps
6.branch-misses15.LLC-store-misses
7.branch-loads16.cpu_clk_unhalted.thread_p
8.dTLB-load-misses17.L1-dcache-load-misses
9.dTLB-loads18.l2_rqsts.all_demand_miss
19.offcore_requests.demand_data_rd
20.frontend_retired.latency_ge_2_bubbles_ge_3
Table 4. The Performance Counters Selected for Characterizing Jobs

5.2 Implementation of JointOPT

Colocation Evaluating Model. Each of the two fc layers has neurons of sizes (64, 1). We use over 20,000 samples to train the model, with 70% samples for training and 30% samples for testing.
Reward Prediction Model. The two fc layers at the input side have 512 and 512 nodes, respectively. We use the rectified nonlinearity (ReLU) [25] as the activation function and add a dropout layer after each fc layer to prevent over-fitting and enhance generalization. The hyperparameters of multi-head attention layer are set according to [32]. The two fc layers at the output side have 512 and 64 nodes, respectively. We use over 20,000 samples to train the model, with 70% samples for training and 30% samples for testing.
DRL Model. For the DRL model, each of the fc layers in the encoder has 256 nodes. The GRU has a hidden size of 256 and an output layer composed by four vectors of 10 dimensions. The output layer represents four probability distributions, which are used to generate the allocation of LLC (represented by two values, i.e., index of left boundary and index of right boundary), memory bandwidth, and CPU cores, respectively. The output of GRU is then decoded by three fc layers with size (256, 256, 1). The training process iterates for multiple rounds until it converges. In each iteration, we update the DRL model using 20,000 samples according to the Advantage Actor-Critic algorithm.
Model Refining. We refine the machine learning models using transfer learning once the percentage of new jobs exceed 10%. For the colocation evaluating model, since the model is small, we update all the parameters of the neural networks in the refining process. For the reward prediction model and the DRL model, since they are very large, we only update the last layer of the neural networks in the refining process. Each time, we use 3,600 samples to refine the colocation evaluating model, and 12,000 samples to refine the reward prediction model and the DRL model.
Parameter Setting. We had tested a large variety of parameter settings in JointOPT, and found that the following settings achieve sufficiently good performance while keeping acceptable overhead: the DRL model makes a decision every 10 seconds; the loop in the hill climbing algorithm (Algorithm 2) runs for 250 rounds; the local search process in the job assignment algorithm (Algorithm 1) runs for 500 rounds, and the parameter \(x\%\) is set to \(30\%\) .
The entire framework is implemented by Python. We leave one CPU core to run the framework and the other 9 CPU cores to run colocations on each server.

5.3 Baselines

We compare JointOPT with several state-of-the-art baselines, including both job assignment solutions and resource partitioning solutions.
The job assignment baselines include:
Random. In this method, each job is assigned to a randomly selected server.
Bin Packing. This method describes each job by a resource demand vector and strive to pack the jobs onto the servers such that load balance is achieved or using the minimum number of servers. In our implementation, we define the resource demand of each job as the resource consumption when it runs alone on the server. We use the algorithm proposed by [33] to pack the jobs, which is a two-round heuristic algorithm aiming to achieve load balance.
Contention-aware. This method strives to assign the jobs onto the servers such that the performance interference among the colocated jobs is minimized. In our implementation, we use the algorithm proposed by [11] to assign the jobs. It is a greedy algorithm that assigns the jobs one by one, with each job assigned to the server that incurs the minimum performance interference.
The resource partitioning baselines include:
NoPart. NoPart simply runs the jobs on the server without considering resource partitioning.
DCAPS. DCAPS [34] partitions LLC only among the colocated jobs. It determines a partitioning decision by a simulated-annealing-based search, where the performance of each partitioning decision is estimated by a performance model build based on domain knowledge.
CLITE. CLITE [29] leverages Bayesian Optimization to build approximate performance model online and uses the performance model to guide an intelligent search for the near-optimal resource partitioning configuration. In our implementation, we consider partitioning three resources: CPU cores, LLC, and memory bandwidth.
DRLPart. DRLPart is the preliminary version of JointOPT, which considers a single colocation and uses the DRL-based model to partition resources among the colocated jobs.

5.4 Performance Metrics

Given a set of jobs, our goal is to maximize the system throughput which is defined as the total time-averaged IPC of all the jobs. If a job is multi-threaded, the IPC refers to the total IPC of all the threads. A higher throughput indicates a better system performance.
By default, we run each colocation for a period of 305 seconds, with the first 5 seconds for warm-up. To run a colocation, we start the jobs in the colocation simultaneously and immediately restart the job if it completes earlier. We measure the system state and system throughput every 10 seconds and compute the mean throughput over the examining period. We repeat each experiment for five times, and report the mean and standard deviation of the results.

5.5 Results and Discussion

5.5.1 Performance of the Prediction Models.

We first show the performance of the prediction models, including the colocation evaluating model and the DRL reward prediction model. We take the mean accuracy and RMSE (root-mean-square error) as metrics to evaluate the accuracy of the placement and resource partitioning prediction model. The mean accuracy is defined by \(\frac{1}{n}\sum _{i=1}^n |y_i-y^{\prime }_i|/y_i\) , where \(y_i\) is the actual observed value and \(y^{\prime }_i\) is the predicted value. RMSE is a frequently used measure of the variance of prediction errors, which is defined by \(\sqrt {\frac{1}{n}\sum _{i=1}^n (y_i-y\prime _i)^2}\) . Generally, a smaller RMSE indicates a higher prediction accuracy.
The results for the colocation evaluating model and the DRL reward prediction model are shown in Tables 5 and 6, respectively. As can be seen, the prediction accuracy is very high (over 90%) for both of the two models over different colocation sizes. We also observe that the prediction accuracy for colocations of 4-jobs and 7-jobs is slightly worse than that for colocations of 5-jobs and 6-jobs for both of the two models. This is because: (1) the difficulty of prediction would increase as the colocation size grows, due to a higher input dimension and more complex relationship; and (2) an individual job in a smaller colocation has more resource allocation configurations and thus a more diverse performance, which also makes the prediction more difficult.
Table 5.
 4-jobs5-jobs6-jobs7-jobs
Mean Accuracy95.39%95.48%96.8%96.43%
RMSE0.0400.0340.0290.032
Table 5. Mean Accuracy and RMSE of the Colocation Evaluating Model for Different Colocation Sizes
Table 6.
 4-jobs5-jobs6-jobs7-jobs
Mean Accuracy92.87%95.43%96.83%93.52%
RMSE0.0960.0370.0280.051
Table 6. Mean Accuracy and RMSE of the DRL Reward Prediction Model for Different Colocation Sizes

5.5.2 Impact of Parameters.

We next show how we determine the number of rounds (i.e., the parameter \(rn_2\) ) that the hill climbing algorithm (Algorithm 2) runs for finding the optimal partitioning configuration. We randomly generate 360 colocations with sizes ranging from 4-jobs to 7-jobs. Then, we run Algorithm 2 for each colocation with different numbers of rounds, and compute the mean throughput of the colocations for each colocation size for each number of rounds. Figure 5(a) shows the results. We have several observations. First, the throughput grows as the number of rounds increases for all colocation sizes, because using more iterations has more chance to find a better partitioning configuration. Second, the algorithm converges slightly slowly as the colocation size increases. This is because the solution space grows as the colocation size increases, while exploring a larger solution space is more difficult. Third, the algorithm converges at 250 rounds for all the colocation sizes. Based on the results, we set the number at 250 in the experiments.
Fig. 5.
Fig. 5. (a) Impact of running rounds for the hill climbing algorithm; (b) Impact of running rounds for job assignment algorithm; (c) Impact of the parameter \(x\%\) for job assignment algorithm.
We then show how we determine the number of rounds (i.e., the parameter \(rn_1\) ) that the job assignment algorithm (Algorithm 1) runs. We randomly generate three groups of jobs, with 100, 300, and 500 jobs, respectively, and run them on 15, 50, 90 servers. We run the job assignment algorithm with a different number of rounds for each group of jobs, and compute the system throughput for each group over each number of rounds. Figure 5(b) shows the results. We have several observations. First, the throughput grows as the number of rounds increases for all the groups of jobs, because using more rounds has more chance to find a better solution. Second, the algorithm converges slightly slowly as the number of jobs increases. This is because the solution space grows as the colocation size increases, while exploring a larger solution space is more difficult. Third, the algorithm converges at 500 rounds for all the groups. Based on the results, we set the number at 500 in the experiments.
Next, we show how we determine the parameter \(x\%\) in the job assignment algorithm (Algorithm 1). Recall that the jobs in the local search operations are selected in the top \(x\%\) colocations with the highest/lowest throughput (such colocations are more crucial for bringing throughput improvement). We randomly generate 500 jobs and divide them into 90 colocations using the job assignment algorithm. We compute the mean throughput of the colocations for running the algorithm with different \(x\%\) . Figure 5(c) shows the results. As can be seen, JointOPT achieves the best performance for \(x\% = 30\%\) . If \(x\%\) is too small, the algorithm can converge more quickly, but some important colocations may be missed. In contrast, if \(x\%\) is too large, the local search process has low efficiency, because many unimportant colocations will be measured. Based on the results, we set \(x\%=30\%\) in all the other experiments.

5.5.3 Overall Performance of JointOPT.

We next show the overall performance of JointOPT compared to the baselines. We generate 2,200 jobs and strive to assign the jobs to 400 servers. Figure 6 compares the throughput of JointOPT with the resource partitioning baselines when they are integrated with different job assignment methods. We have several observations from the results. First, for all the job assignment methods, NoPart always performs worst among the baselines. This is because NoPart incurs the largest performance interference since resource partitioning is not considered. This result confirms the importance of resource partitioning. Second, for all the job assignment methods, DRLPart always performs best among the baselines, which demonstrates the effectiveness of our DRL-based resource partitioning solution. The inefficiency of DCAPS is mainly because it partitions only one resource, while DRLPart partitions multiple resources coordinately. The problem of CLITE is mainly because it evaluates partitioning configurations in real systems, which is inefficient when exploring a large solution space.
Fig. 6.
Fig. 6. Performance of different resource partitioning baselines integrated with different job assignment methods, compared with the performance of JointOPT (denoted by the red line).
Third, NoPart, DCAPS, and CLITE achieve higher throughput when integrated with our job assignment method compared to other job assignment methods. This is because: (1) the random and bin packing job assignment methods do not consider resource sensitivities of jobs, which may colocate the jobs with serious performance interference together; and (2) the contention-aware job assignment method does not consider the resource fungibility, which cannot fully utilize the server resources. The results demonstrate the effectiveness of our job assignment. Fourth, JointOPT always achieves the highest throughput compared to the baselines, with advantages from 13.3% to 47.7%, which demonstrates the effectiveness of optimizing job assignment and resource partitioning jointly.

5.5.4 Scalability.

Figure 8 breaks down the results according to different colocation sizes, where the y-axis represents the throughput improvement of each approach over NoPart. All the resource partitioning methods are integrated with the random job assignment method. We have three observations from the results. First, JointOPT always outperforms the other approaches for all the colocation sizes, which demonstrates the good scalability of JointOPT. Second, the advantage of JointOPT over the baselines is different across different colocation sizes. Specifically, JointOPT achieves the highest benefit for the colocation size of 5-jobs (57.4%), and less benefit for large colocation size (e.g., 35.5% for 7-jobs). This is because each job will be allocated a lesser amount of resources in a large colocation, so the jobs always have bad performance no matter how the resources are partitioned. Third, CLITE performs as bad as NoPart for colocations larger than 5-jobs. This is because the time required by CLITE to generate a decision grows exponentially as the colocation size increases. When the colocation size is larger than 5-jobs, CLITE cannot even make one partitioning decision during the examining period.
We do not consider colocations larger than 7-jobs, because (1) due to the hardware limitation of Intel’s MBA technology (the number of CLOSes in MBA is limited to 8 in our platform), our platform can only deploy fewer than 8 jobs if we want to partition the memory bandwidth; and (2) we only have 9 CPU cores to assign jobs and at least one CPU core should be allocated to each job.

5.5.5 Impact of Colocation Types.

In this experiment, we show how the performance of each approach is impacted by the colocation types. We say a job is sensitive to a specific resource if its performance degradation exceeds 15% when the amount of assigned resource decreases from the maximum to the minimum. If a job is sensitive to only one resource, we say it is single-sensitive (Single-S). If a job is sensitive to more than one resources, we say it is multiple-sensitive (Multi-S). If a job is not sensitive to any resource, we say it is insensitive (IN-S). We divide colocations into different types according to the types of jobs contained in each colocation. Table 7 summarizes the definition of each colocation type.
Table 7.
 Colocation Types
 ABCDEF
Multi-S \(\checkmark\) \(\checkmark\) \(\checkmark\)    
Single-S  \(\checkmark\)   \(\checkmark\) \(\checkmark\)  
IN-S   \(\checkmark\) \(\checkmark\)   \(\checkmark\)
Table 7. Definitions of Colocation Type
Columns represent colocation type, marked rows represent the types of jobs contained in the colocation type.
We randomly generate 100 colocations for each colocation type, with colocation size ranging from 4-jobs to 7-jobs. Figure 7 shows the throughput improvement over NoPart achieved by each approach for colocation of 4-jobs and 7-jobs. We have two observations from the results. First, the throughput improvement is more significant for colocations of Type-A, Type-B than other colocation types for most of the approaches. For example, the throughput improvement of our approach is 45.0%, 51.5%, 51.2%, and 25.6% for colocations of Type-A, Type-B, Type-C, and Type-E, respectively for colocations of 4-jobs, while it is only 15.3%, 5.0% for colocations of Type-D and Type-F. This is because the jobs in the colocations of Type-D and Type-F are less sensitive to resources, so the benefit of resource partitioning is insignificant. Second, similar to the previous results, CLITE has no benefit over NoPart for colocations of 7-jobs, which confirms that CLITE fails to handle a large colocation size due to high computational overhead.
Fig. 7.
Fig. 7. The throughput improvement of each resource partitioning approach over NoPart for different colocation types. All the partitioning approaches are integrated with random job assignments.
Fig. 8.
Fig. 8. Performance of JointOPT over different colocation sizes, compared with the resource partitioning baselines.

5.5.6 Effectiveness of Model Refining.

In this experiment, we randomly generate 360 colocations with sizes from 4-jobs to 7-jobs, which contain a proportion of new jobs. We compare the performance of JointOPT with/without model refining against different proportions of new jobs. In the implementation, we refine the model once over 10% new jobs are added. Figure 9 shows the results. We observe that if model refining is not used, the performance of JointOPT gets worse as the proportion of new jobs increases, despite the performance degradation is not significant (4.2% for 30% new jobs). It implies that although JointOPT has good generalization to new jobs, it still suffers from performance loss when facing a large portion of new jobs. The performance of JointOPT with model refining consistently achieves good performance over different proportions of new jobs, which confirms the effectiveness of model refining.
Fig. 9.
Fig. 9. Performance of JointOPT with/without model refining.

5.5.7 Overhead.

Model Training. All the machine learning models are trained on an Nividia® GeForce® RTX 2080 Ti GPU. The training process for the colocation evaluating model takes around 50 minutes. The training process for the DRL reward prediction model takes around 3 hours. The training process for the DRL model converges in around 1,000 iterations, which takes around 16 hours. Recall that if the reward prediction model is not used, the training process of the DRL model will take around \(10^5\) hours. It implies that the reward prediction model reduces the interaction overhead by three orders of magnitude. The space consumption on model training is around 2GB, but it is offline and one-short.
Job Assignment. The inference time of the colocation evaluating model is around 300 microseconds each time. The running time of the hill climbing algorithm for finding the optimal partitioning configuration is around 0.075s for a colocation of 7 jobs. For the job assignment algorithm, the total running time is around 37.5s when there are 500 jobs. Note that the throughput-oriented jobs generally have no strict response latency requirement. Thus, the running time of the job assignment algorithm is not an issue. The space consumption on job assignment is around 6MB.
Resource Partitioning. The DRL-based model makes partitioning decisions very quickly, which takes less than 2s to make a decision using Nividia® GeForce® RTX 2080 Ti GPU. It is much faster than DCAPS (which consumes more than 10s due to a heuristic search using simulated annealing) and CLITE (which consumes several minutes due to high computational overhead). So, our approach is more efficient to handle fast-paced workload changes. The space consumption on the DRL-based framework is around 7MB.
Model Refining. The refining time of the machine learning models is much shorter than their training time. Specifically, it takes around 10 minutes, 0.5 hours, and 3 hours for refining the colocation evaluating model, the reward prediction model, and the DRL model, respectively. Compared to training the models from scratch, transfer learning can speed up by 5× (50 minutes vs. 10 minutes), 6× (3 hours vs. 0.5 hours), and 5× (16 hours vs. 3 hours) for the colocation evaluating model, the reward prediction model and DRL model, respectively. The space consumption on model refining is around 972kb.

6 Discussions

6.1 Application Scenario

The proposed approach is designed for improving the throughput of servers running non-latency critical applications such as HPC workloads (e.g., scientific computing) or batch jobs (e.g., data analysis, machine learning tasks). However, our approach cannot be applied to the scenario where the running time of applications is very short (e.g., serverless functions). This is because the partitioning decision is made according to real-time performance counters, which requires a few seconds to collect the performance counters. We assume that resources can be arbitrarily partitioned among jobs, i.e., there is no resource cap for each job.
For job assignment, we need the prior knowledge of jobs (the performance counters during a short profiling period). For resource partitioning, no prior knowledge of jobs is required, since the inputs of the DRL model can be collected online. Training the prediction models and the DRL model needs prior knowledge of jobs, but the training process is one-short.

6.2 More Comprehensive Performance Metrics

Note that JointOPT is not limited to optimizing a single performance metric (i.e., the throughput), which can also be used for more comprehensive performance metrics. To this end, we only need to design the heuristic algorithms according to the new performance metric and change the definition of reward when training the machine learning models. For example, if we want to maximize the throughput while guaranteeing the fairness (all jobs have similar slowdown compared to running alone), we only need to: (1) ignore the partitioning configurations whose fairness is not guaranteed in the heuristic algorithms; and (2) define a negative reward for the partitioning decisions without fairness guarantee when training the DRL model.
Table 8 shows the performance of JointOPT for the new performance metric (maximizing the throughput with fairness guarantee). In this experiment, we say the fairness is guaranteed if the standard deviation of the slowdowns of jobs is below 0.05. We randomly generate 100 colocations and report the average performance for each colocation size. We observe that the standard deviation of slowdowns is smaller than 0.05 for all colocation sizes, implying that JointOPT can always guarantee the fairness. Moreover, the throughput improvement of JointOPT over NoPart is still significant (other baselines are not compared because they are not applicable for the new performance metric), from 29% to 43%, implying that JointOPT can still maximize the throughput while guaranteeing the fairness. The results confirm that JointOPT is still effective for the new performance metric.
Table 8.
 4-jobs5-jobs6-jobs7-jobs
Standard Dev.0.0430.0450.0460.045
Throughput Imp.39%43%33%29%
Table 8. Performance Improvement of JointOPT over NoPart for New Performance Metric (Maximizing the Throughput with Fairness Guarantee)

6.3 Dynamic Job Arrivals and Departures

The current JointOPT assumes the jobs arrive simultaneously and processes the jobs in batch. It can easily be extended to support dynamic job arrivals and departures. For example, when a new job arrives, the job colocating algorithm can simply assign the new job to the server (there are already some jobs running on the server) that leads to the highest throughput, where the throughput is estimated according to the performance prediction model. For the new job, we can randomly allocate some resources to start the new job, and then the DRL model can generate partitioning decisions directly according to real time performance counters.

6.4 Heterogeneous Environment

Although our problem setting assumes homogeneous servers, the proposed framework can easily be extended to the scenario with heterogeneous servers. To this end, we only need to train separate machine learning models (including the colocation evaluating model, reward prediction model, and the DRL model) for each type of server. Moreover, the proposed framework does not assume any specific properties of jobs, which is applicable to both homogeneous and heterogeneous jobs.

7 Related Work

Workload consolidation has been extensively studied in the literature for improving system performance or resource utilization. Earlier studies generally fully share the server resources among the colocated jobs, which would cause significant performance degradation due to resource contention [21, 23, 24, 31, 38, 39, 40].
In order to avoid resource contention, a commonly used approach is to allocate dedicated resources to each job according to its resource demand, and consolidate jobs onto the servers in a bin packing manner [18, 33, 37]. However, this approach cannot adapt to dynamic workload changes since it assumes a fixed resource demand for each job. Moreover, this approach ignores resource fungibility of jobs and resource complementary among different jobs, thus it cannot maximize server utilization.
Many works have investigated how to implement more efficient and robust resource partitioning solutions. Earlier studies usually establish dedicated analytical models relying on extensive domain knowledge to estimate partitioning configurations, and use handcrafted heuristics to search the near-optimal partitioning configuration [14, 27, 34, 35, 36]. However, such solutions basically can only work on a single resource and cannot easily be extended to multiple resources, which could cause significant performance sacrifice.
The approaches for partitioning multiple resources coordinately also have been studied. A simple approach is to dynamically tune the partitioning configuration online according to the feedback from real systems [9, 12, 22, 28]. However, this approach is rather inefficient to explore a large solution space, because it needs to evaluate a large number of partitioning configurations in real systems, which is very time consuming. CLITE [29] leverages Bayesian Optimization to make the online tuning process more intelligent. However, Bayesian Optimization will lose efficiency for high dimension, which cannot handle large colocations and fast-paced workload changes.
Machine learning has been applied in resource management problems in recent years [16]. For example, [10, 13] leverage general machine learning models to predict the performance interference among colocated jobs. The works [7, 15] use RL models to place jobs in clusters. However, they focus on scheduling and do not consider resource partitioning. The work [17] adopts a multi-agent RL approach to partition cache among colocated jobs. However, it can only handle a single resource. Our previous work [8] proposed a DRL-based resource partitioning framework. However, job assignment is not considered. The summary of the comparison between our proposed approach and other closely related works is given in Table 9.
Table 9.
  related works
[33][11][34][29][8][37][19][18][9][23][30]Our
scenariocluster wide \(\checkmark\) \(\checkmark\)     \(\checkmark\) \(\checkmark\) \(\checkmark\)   \(\checkmark\) \(\checkmark\) \(\checkmark\)
single server   \(\checkmark\) \(\checkmark\) \(\checkmark\)     \(\checkmark\)    \(\checkmark\)
objectivethroughput \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\)     \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\)
cost      \(\checkmark\) \(\checkmark\) \(\checkmark\)     
problemjob placement \(\checkmark\) \(\checkmark\)     \(\checkmark\) \(\checkmark\) \(\checkmark\)   \(\checkmark\) \(\checkmark\) \(\checkmark\)
resource partition   \(\checkmark\) \(\checkmark\) \(\checkmark\)     \(\checkmark\)    \(\checkmark\)
isolationresource shared  \(\checkmark\)         \(\checkmark\) \(\checkmark\)  
resource isolated \(\checkmark\)   \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\)    \(\checkmark\)
methodheuristics \(\checkmark\) \(\checkmark\) \(\checkmark\)    \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\)  
intelligent methods    \(\checkmark\) \(\checkmark\)        \(\checkmark\)
Table 9. Summary of the Related Work, Closest to JointOPT, with Respect to Various Evaluation Criteria

8 Conclusion

We propose JointOPT, a datacenter-wide resource management framework that addresses job assignment and resource partitioning jointly. For job assignment, we use a local search based algorithm to find the near optimal job assignment configuration. We build a deep learning based prediction model to estimate the performance of colocations, which can significantly reduce the interaction overhead with real systems. For resource partitioning, we leverage deep reinforcement learning (DRL) to build the resource partitioning model, which is effective, efficient, and adaptive to dynamic workload changes. Experimental results show that the proposed framework outperforms the state-of-the-art baselines significantly.
There are several further directions that we would like to investigate. First, we want to extend the proposed framework to the scenario where cloud service dependability and QoS must be maintained. Second, we want to extend the proposed framework to the new cloud computing architectures such as microservice and serverless computing. Third, we want to employ explainable AI approaches to ensure the AI-based framework makes informed resource management decisions. Fourth, we want to extend the proposed framework to edge computing or quantum computing.

References

[1]
2006. The Python Performance Benchmark Suite. https://pyperformance.readthedocs.io/.
[2]
2006. The SPEC CPU® 2006 Benchmark Suite. https://www.spec.org/cpu2006/.
[3]
2017. The SPEC CPU® 2017 Benchmark Suite. https://www.spec.org/cpu2017/.
[4]
2020. perf: Linux Profiling with Performance Counters. https://perf.wiki.kernel.org/index.php/.
[6]
Andrew J. Herdrich, Khawar M. Abbasi, and Marcel D. Cornu. 2019. Introduction to Memory Bandwidth Allocation. https://software.intel.com/en-us/articles/introduction-to-memory-bandwidth-allocation.
[7]
Yixin Bao, Yanghua Peng, and Chuan Wu. 2019. Deep learning-based job placement in distributed machine learning clusters. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 505–513.
[8]
Ruobing Chen, Jinping Wu, Haosen Shi, Yusen Li, Xiaoguang Liu, and Gang Wang. 2020. DRLPart: A deep reinforcement learning framework for optimally efficient and robust resource partitioning on commodity servers. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 175–188.
[9]
Shuang Chen, Christina Delimitrou, and F. José Martínez. 2019. PARTIES: QoS-aware resource partitioning for multiple interactive services. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2019), 107–120.
[10]
Yuxia Cheng, Wenzhi Chen, Zonghui Wang, and Yang Xiang. 2017. Precise contention-aware performance prediction on virtualized multicore system. Journal of Systems Architecture 72 (2017), 42–50.
[11]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13), Vol. 48. ACM, 77–88.
[12]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127–144.
[13]
Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2019. Poise: Balancing thread-level parallelism and memory system performance in GPUs using machine learning. 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2019), 492–505.
[14]
Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, Xiaosong Ma, and Daniel Sanchez. 2018. KPart: A hybrid cache partitioning-sharing technique for commodity multicores. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 104–117.
[15]
Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Spotlight: Optimizing device placement for training deep neural networks. In International Conference on Machine Learning. 1662–1670.
[16]
Sukhpal Singh Gill, Minxian Xu, Carlo Ottaviani, et al. 2022. AI for next generation computing: Emerging trends and future directions. Internet of Things 19 (2022), 100514.
[17]
Rahul Jain, Preeti Ranjan Panda, and Sreenivas Subramoney. 2017. A coordinated multi-agent reinforcement learning approach to multi-level cache co-partitioning. In Design, Automation & Test in Europe Conference & Exhibition (DATE’17). IEEE, 800–805.
[18]
Ayaz Ali Khan, Muhammad Zakarya, Rajkumar Buyya, Rahim Khan, Mukhtaj Khan, and Omer Rana. 2021. An energy and performance aware consolidation technique for containerized datacenters. IEEE Transactions on Cloud Computing 9, 4 (2021), 1305–1322.
[19]
Xin Li, Jie Wu, Shaojie Tang, and Sanglu Lu. 2014. Let’s stay together: Towards traffic aware virtual machine placement in data centers. In IEEE INFOCOM 2014. 1842–1850.
[20]
Yi Lin and Yongho Jeon. 2002. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association (2002), 101–474.
[21]
David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards energy proportionality for large-scale latency-critical workloads. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 301–312.
[22]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In International Symposium on Computer Architecture (ISCA’15), Vol. 43. ACM, 450–462.
[23]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 248–259.
[24]
Nikita Mishra, John D. Lafferty, and Henry Hoffmann. 2017. ESP: A machine learning approach to predicting application interference. In 2017 IEEE International Conference on Autonomic Computing (ICAC’17). 125–134.
[25]
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In ICML.
[26]
Khang T. Nguyen. 2019. Introduction to Cache Allocation Technology in the Intel® Xeon® Processor E5 v4 Family. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology/.
[27]
Konstantinos Nikas, Nikela Papadopoulou, Dimitra Giantsidi, Vasileios Karakostas, Georgios Goumas, and Nectarios Koziris. 2019. DICER: Diligent cache partitioning for efficient workload consolidation. In Proceedings of the 48th International Conference on Parallel Processing. 15.
[28]
Jinsu Park, Seongbeom Park, and Woongki Baek. 2019. CoPart: Coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–10.
[29]
T. Patel and D. Tiwari. 2020. CLITE: Efficient and QoS-aware co-location of multiple latency-critical jobs for warehouse scale computers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA’20). 193–206.
[30]
Francisco Romero and Christina Delimitrou. 2018. Mage: Online and interference-aware scheduling for multi-scale heterogeneous systems. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT’18). Article 19, 13 pages.
[31]
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys’13). 351–364.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[33]
Zhaorui Wu, Yuhui Deng, Hao Feng, Yi Zhou, and Geyong Min. 2021. Blender: A traffic-aware container placement for containerized data centers. In 2021 Design, Automation and Test in Europe Conference and Exhibition (DATE’21). 986–989.
[34]
Yaocheng Xiang, Xiaolin Wang, Zihui Huang, Zeyu Wang, Yingwei Luo, and Zhenlin Wang. 2018. DCAPS: Dynamic cache allocation with partial sharing. In Proceedings of the Thirteenth EuroSys Conference 2018. 13.
[35]
Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, and Zhenlin Wang. 2019. EMBA: Efficient memory bandwidth allocation to improve performance on Intel commodity processor. In Proceedings of the 48th International Conference on Parallel Processing. 16.
[36]
Cong Xu, Karthick Rajamani, Alexandre Ferreira, Wesley Felter, Juan Rubio, and Yang Li. 2018. dCat: Dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. In Proceedings of the Thirteenth EuroSys Conference 2018. 14.
[37]
Muhammad Zakarya, Lee Gillam, Khaled Salah, Omer Rana, Santosh Tirunagari, and Rajkumar Buyya. 2022. CoLocateMe: Aggregation-based, energy, performance and cost aware VM placement and consolidation in heterogeneous IaaS clouds. IEEE Transactions on Services Computing (2022), 1–14. DOI:
[38]
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU performance isolation for shared compute clusters(EuroSys’13). 379–391.
[39]
Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and Lingjia Tang. 2014. SMiTe: Precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 406–418.
[40]
Jiacheng Zhao, Huimin Cui, Jingling Xue, and Xiaobing Feng. 2016. Predicting cross-core performance interference on multicore processors with regression analysis. IEEE Transactions on Parallel and Distributed Systems 27, 5 (2016), 1443–1456.
[41]
Haishan Zhu and Mattan Erez. 2016. Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16). 33–47.

Cited By

View all
  • (2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024
  • (2024)ARTEMIS: Detecting Airdrop Hunters in NFT Markets with a Graph Learning SystemProceedings of the ACM Web Conference 202410.1145/3589334.3645597(1824-1834)Online publication date: 13-May-2024

Index Terms

  1. Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 20, Issue 3
      September 2023
      346 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3609237
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 July 2023
      Online AM: 18 April 2023
      Accepted: 11 April 2023
      Revised: 08 April 2023
      Received: 18 October 2022
      Published in TACO Volume 20, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Workload consolidation
      2. performance interference
      3. job assignment
      4. resource partitioning
      5. deep reinforcement learning
      6. deep learning
      7. transfer learning

      Qualifiers

      • Research-article

      Funding Sources

      • Key-Area Research and Development Program of Guangdong Province
      • National Science Foundation of China
      • NSF of Tianjin

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)860
      • Downloads (Last 6 weeks)73
      Reflects downloads up to 18 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024
      • (2024)ARTEMIS: Detecting Airdrop Hunters in NFT Markets with a Graph Learning SystemProceedings of the ACM Web Conference 202410.1145/3589334.3645597(1824-1834)Online publication date: 13-May-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media