A common approach to improving resource utilization in data centers is to adaptively provision resources based on the actual workload. One fundamental challenge of doing this in microservice management frameworks, however, is that different components of a service can exhibit significant differences in their impact on end-to-end performance. To make resource management more challenging, a single microservice can be shared by multiple online services that have diverse workload patterns and SLA requirements.

We present an efficient resource management system, namely Erms, for guaranteeing SLAs with high probability in shared microservice environments. Erms profiles microservice latency as a piece-wise linear function of the workload, resource usage, and interference. Based on this profiling, Erms builds resource scaling models to optimally determine latency targets for microservices with complex dependencies. Erms also designs new scheduling policies at shared microservices to further enhance resource efficiency. Experiments across microservice benchmarks as well as trace-driven simulations demonstrate that Erms can reduce SLA violation probability by 5× and more importantly, lead to a reduction in resource usage by 1.6×, compared to state-of-the-art approaches.

1 Introduction

Recent years have witnessed a rapid emergence and wide adoption of microservice architecture in cloud data centers [4, 14, 15]. Compared to the conventional monolithic architecture that runs different components of a service within a single application, a microservice system decouples an application into multiple small pieces for ease of management, maintenance, and update [18, 25, 42, 42, 43, 48]. Due to this, microservices are light-weight and loosely-coupled. As a consequence, when microservice architectures are exposed to growing load, the system manager can locate individual microservices that may experience heavy load and scale them independently instead of scaling the whole application [13, 16].

Despite the flexibility, microservice architecture brings several new challenges in providing service-level agreement (SLA) guarantees for efficient resource management. First, a service request needs to be processed by hundreds of microservices [1, 28]. These microservices can form a complex dependency graph consisting of parallel, sequential and even alternative executions, as shown in Figure 1. It becomes extremely difficult to manage resources at the granularity of microservices so as to maximize resource efficiency and in the meanwhile, ensure the end-to-end SLA. Second, microservice containers [10] are usually colocated with batching applications [26]. Resource interference can degrade differently the performance of microservices since some microservices are sensitive to resource interference. Third, resource interference can further cause performance imbalances between containers from the same microservice, especially when the workload is heavy, as a microservice usually contains hundred to thousands of containers.

Fig. 1.

Existing approaches provide SLA guarantees for microservice management via handcrafted heuristics, reinforcement learning approaches, or deep learning algorithms [7, 23, 36, 38, 41, 46, 47]. In particular, several heuristics adopt the average and covariance of microservice response time to determine the contribution that each microservice makes toward guaranteeing the end-to-end SLA requirement [23, 47]. One fundamental limitation of such solutions is that their derived contributions are fixed and do not change with the dynamic workload. The reinforcement learning approaches need substantial efforts for labeling critical microservices that have a great impact on SLA [38]. Moreover, when one service contains multiple critical microservices, independently keeping them tuning up can easily lead to sub-optimal results. Deep learning approaches need to evaluate a large number of potential resource configurations in order to find an efficient allocation without SLA violation [36, 46]. However, this is not scalable for complex services in production environments where a service can consist of 1,000+ microservices with many tiers [28].

Furthermore, there is no study so far to investigate microservices sharing among different services with complex dependencies. However, shared microservices create a new opportunity to improve resource efficiency through global resource management among all services. To demonstrate this, we conduct a simple experiment to show that prioritizing services at a shared microservice can save more than 40% of resources (details are shown in Section 2.3). As such, there is a crucial need for more efficient schemes that can globally manage SLAs for all services.

This paper addresses the aforementioned limitations by introducing efficient resource management system (Erms), a new system designed for efficient and scalable resource management in a shared microservice execution framework to provide SLA guarantees with high probability. Erms characterizes microservice tail latency in terms of a piecewise function of the workload, the number of deployed containers, and resource interference. With this characterization, Erms manages to dissect the detailed structure of microservice dependency graphs through explicit quantification and global optimization. This makes Erms fundamentally different from deep learning approaches [36, 46] and other heuristic solutions [23, 47].

Erms determines the latency target of each microservice so as to satisfy the end-to-end SLA requirement with minimum resource usage, based on the observed workload. At a shared microservice, Erms implements priority-based scheduling to orchestrate the execution of all requests from different online services. Under this scheduling, priority is given to services that include more latency-sensitive microservices, so as to significantly improve resource efficiency. Erms adopts a probability-based approach to implement priority scheduling, which can avoid potential starvation. Furthermore, Erms proposes a new interference-aware cluster-wide placement strategy aimed at balancing the latency across microservice containers and enhancing the overall performance of online services. Erms also incorporates careful designs to make the system scalable and applicable to production environments. The key techniques are in the application of convex optimization results and in the design of novel graph algorithms with low complexity.

We build a prototype of Erms on top of Kubernetes [24]. We evaluate Erms via real deployment on microservice benchmarks including DeathStarBench [18] and TrainTicket [49]. Additionally, we run large-scale simulations with real traces. Experimental results demonstrate Erms can reduce the number of deployed containers by up to 1.6 \(\times\) and reduce SLA violation probability by 5 \(\times\) compared to the state-of-the-art approaches. In summary, Erms has made the following contributions:

—

Optimal computation of microservice latency target. To the best of our knowledge, Erms is the first system to systematically determine an optimal latency target for each microservice to meet SLA requirements. Erms is scalable to handle complex dependencies without any restrictions on graph topology.

—

New scheduling policy at shared microservices. Another contribution of Erms is to design a new scheduling policy for shared microservices with theoretical performance guarantees. This policy assigns priority to requests from different services, and also globally coordinates resource scaling for all microservices. With this new policy, Erms can further reduce the number of used containers by up to 50%.

—

Implementation. We provide a prototype implementation of Erms on top of Kubernetes [24], a widely adopted container orchestration system. We implement dynamic resource provisioning to place containers so as to control the overall resource interference.

2 Background and Motivation

2.1 Microservice Background

A production cluster often deploys various applications and each application contains multiple different online services to serve users’ requests [29]. Usually, a service request is sent to an entering microservice, e.g., Nginx, which will then trigger a set of calls between multiple microservices. A microservice shall proceed to call its multiple downstream microservices either in a sequential manner or in parallel, when handling a call from its upstream microservice. Moreover, a microservice usually runs in multiple containers (with the same configuration) to serve all requests sent to it. In this paper, we adopt the number of containers as a metric for estimating resource usage, which is consistent with prior research [23, 38, 46, 47] and is widely recognized in the industry [28].

When handling a user request, the set of calls along with the associated microservices form a dependency graph. The performance of a user request, i.e., the end-to-end latency is determined by the longest execution time of all critical paths in the graph. Here, a critical path is a path that starts with a user request and ends with the service response to the corresponding request [38]. It is worth noting that a graph can contain multiple critical paths. For example, the dependency graph in Figure 1 has two critical paths highlighted in blue and green colors respectively, \(CP_1 = \lbrace T,U,C\rbrace\) and \(CP_2 = \lbrace T,Url,C\rbrace\) . In addition, the execution time on each critical path is the sum of all microservice latency along that path.

In addition to complex call dependency, microservices can also be multiplexed among multiple online services. We depict the degree of microservice sharing in Figure 2 for traces collected from Alibaba clusters [1]. These traces include more than 20,000 microservices and 1,000 online services. Figure 2 shows that 40% of microservices are shared by more than 100 online services. A shared microservice needs to process all requests from different services. When the workload of one online service (i.e., the request arrival rate) grows suddenly, the latency of requests from other services experienced at this microservice will increase significantly. Consequently, the end-to-end latency of one service can be greatly impacted by other services in a shared microservices execution framework.

Fig. 2.

2.2 Quantification of Microservice Latency

Compared to the end-to-end latency of online service, microservice latency is treated as a more fine-grained metric in terms of quantifying the resource pressure of deployed containers. Due to this, recent works begin to investigate how this performance metric can be affected by various factors such as the workload and resource interference on the physical host [7, 23, 47].

As shown in Figure 1, the latency of a request at each microservice includes both the queuing time (in gray color) and processing time (in red color), which however, are difficult to obtain from a microservice tracing system since they require to probe the Linux kernel with high-overhead tools [19, 48]. By contrast, the timestamp of each SEND event and RECEIVE event of a request and a response in Figure 1 is available from the tracing framework such as Jaeger [2]. Leveraging such information, we can derive the latency of a microservice by subtracting its downstream microservice response time from its own response time. More specifically, let \(R_i^m\) and \(S_i^m\) denote the timestamp that the ith request arrives at Microservice m (aka RECEIVE) and the corresponding response leaves m (aka SEND), respectively. When d is the only downstream microservice of m, the latency of request i at m is:

\(\begin{equation} L_i^m = (S_i^m - R_i^m) - (S_i^d - R_i^d). \end{equation}\)

(1)

If m calls its multiple downstream microservices sequentially, each microservice’s response time, i.e., \((S_i^d - R_i^d)\) should be subtracted from \((S_i^m - R_i^m)\) in Equation (1). By contrast, if m calls several downstream microservices in parallel, only the maximum response time of these microservice shall be subtracted from \((S_i^m - R_i^m)\) . Note that, \(L_i^m\) also includes the transmission latency, which can be obtained from the tracing system directly.

The study of microservice latency in existing works focuses on the first and second-order statistics across different workloads. In general, when the tail-latency of a microservice grows significantly in the workload, i.e., the call arrival rate, this microservice is considered to be critical for resource management. However, we observe from both existing microservice benchmarks [18] and Alibaba traces [1] that microservice always presents non-uniform delay performance when workload changes. As shown in Figure 3, in each curve, each curve exhibits a distinct cut-off point (indicated by a black circle), which can be automatically determined as outlined in Section 5.2.1. Before reaching this cut-off point, the tail latency gradually and linearly increases with the workload. Conversely, once the workload surpasses this threshold, the latency of microservices experiences a considerably faster (almost linear) growth. The reason behind this is that each microservice container maintains a certain number of threads to process requests in parallel; therefore, when the workload is heavy and beyond a certain point, many requests need to be queued, resulting in a rapid increase in response time. As a result, current investigations on microservice latency are not meticulous and can easily lead to poor scaling decisions since they depend on a constant mean and variance [23, 47].

Fig. 3.

Another limitation is that, existing studies do not quantify the impact of resource interference on the slope of the latency curve [23, 46, 47]. Here, we measure resource interference in terms of resource usage of CPU, memory capacity, memory bandwidth and network bandwidth on physical hosts. Our quantification of microservice latency reveals that the slope changes when interference varies. As depicted in Figure 3(a), when comparing a host with high resource usage (indicated by the purple line) to one with low resource usage (indicated by the blue line), the rate of increase in microservice latency after the cutoff is five times higher on the former. Additionally, resource interference causes the cut-off point to shift forward. In other words, as interference becomes more severe, microservice latency begins to increase rapidly at an earlier stage.

These observations motivate us to model microservice latency as a piecewise linear function of the workload. In addition, the slope of the linear curve highly depends on resource interference. With this function, we can quantify the performance of each microservice under different workloads and resource usages. It is possible to improve resource efficiency via globally optimizing resource configurations of all microservices based on the latency model and in the meanwhile, provide SLA guarantees with high probability.

To validate the above idea, we conduct a simple experiment for resource scaling in Figure 4 where there is only one service consisting of two sequentially-executed microservices U and P from Social Network Application in DeathStarBench [18]. Based on the two profiled piece-wise linear functions and host utilization, we compute for both U and P a latency target, which specifies the maximum time each microservice can take to process a request to meet the end-to-end SLA. These two latency targets change with the service workload and their sum equals the end-to-end SLA. The details of the computation are described in Section 4.2. U is given a higher latency target in contrast to P since its latency grows faster with the workload. The number of containers for U and P is then scaled such that the resulted microservice latency is below the corresponding target. Figure 4(b) shows that this scaling can lead to a reduction of the number of deployed containers by up to 58% and 6 \(\times\) in heavy-load and light-load settings while keeping the same tail end-to-end latency, compared to heuristic approaches GrandSLAm [23] and Rhythm [47]. The reason behind is that baselines compute latency targets based on the mean of microservice latency, regardless of the workload and interference. Consequently, they tend to allocate a lower latency target to U in contrast to our result, thereby requiring much more containers to be deployed for U, as shown in Figure 4(a).

Fig. 4.

2.3 Challenges and Opportunities from Microservice Multiplexing

As mentioned in Section 2.1, an individual microservice can be multiplexed by hundreds of online services. However, services can form diverse dependency graphs and have different workload patterns. When these services perform scaling in a separate manner, their allocated latency targets at a shared microservice can vary a lot, simply taking the minimum latency target for scaling without differentiating services can lead to a waste of resources.

We construct a simple multiplexing scenario to demonstrate that efficient scheduling at a shared microservice is important. As shown in Figure 5, this scenario consists of two online services that share a common microservice P (postStorage) from DeathStarBench. The first service calls U (userTimeline) and P sequentially while the second service calls H (homeTimeline) and P sequentially. Moreover, U is more sensitive to workload changes than H in terms of latency performance. To ensure a comprehensive and unbiased comparison of different resource allocation approaches, we explore a wide range of resource allocation configurations for microservices. We carefully select the configurations that minimize resource allocation while still satisfying the SLA requirements for each approach.

Fig. 5.

One straightforward solution is to process the concurrent requests that arrive at the shared microservice following the default policy FCFS (First-Come-First-Serve) and allocate a latency target in each service independently. Specifically, latency targets \(T^U, T^P_1\) for U and P are allocated from the first service based on its SLA requirement \(\mbox{SLA}_1\) , and latency targets \(T^H, T^P_2\) for H and P are computed based on the second SLA requirement \(\mbox{SLA}_2\) . To satisfy all SLA requirements, the final latency target for P is configured by taking the minimum between \(T^P_1\) and \(T^P_2\) , i.e., \(T^P = \min \lbrace T^P_1,T^P_2\rbrace\) .

The second approach is to partition the deployed containers of P into two separate groups, one group serves the first service and the other group serves the second group. Under this non-sharing approach, the latency target is allocated in each group independently.

We run an experiment based on the constructed scenario to compare the resource usage under these two schemes above. In this experiment, we generate the same static workload (40k requests/minute) for two different services and set \(\mbox{SLA}_1 = \mbox{SLA}_2=300ms\) . Experimental results show the non-sharing scheme requires 9 cores (❷ in Figure 5), whereas the sharing scheme (❶ in Figure 5) requires 10.5 CPU cores to fulfill the SLA requirement. It seems that this result violates the rule that sharing should be more cost-effective than non-sharing since the former can fully utilize resources. We also build an M/M/1 queue to analyze the processing time at P under these two different schemes [20]. Indeed, the theoretical result validates sharing is better for the achieved mean processing time when fixing the resource usage. However, under resource scaling with SLA requirements, the bottleneck is the more-sensitive microservice, i.e., U in this scenario. Due to this, P is allocated a lower latency target in the first service. In the sharing setting, requests with a higher latency target (from the second service) can easily delay the processing of those with a lower latency target (from the first service). As a result, sharing leads to more resource usage under SLA-guaranteed scaling. This implies that the lack of global coordination in a shared microservice execution framework makes multiplexing inefficient, and it is better to process calls from different services separately. Nevertheless, this non-sharing scheme is inconsistent with the design principle of microservice architecture, i.e., microservice is designed to be loose-coupled and functionality-focused only.

To mitigate delay caused by less-sensitive microservices and improve resource efficiency, we design a priority-based scheduling policy under which requests from the first service are given higher priority at Microservice P (❸ in Figure 5). Under this scheduling, latency targets need to be recomputed for microservices within the second service. The purpose of recomputation is to set a lower latency target for less-critical microservices, so as to relieve resource pressure on shared microservices. To examine this idea, we rerun the above experiment with the same workload and SLA settings. The result shows this policy only requires 7.5 CPU cores to satisfy SLA requirement, which is 20% (40%) less than that under the non-sharing scheme (FCFS policy). As such, multiplexing with efficient scheduling provides opportunities to greatly reduce the total resource usage, even in simple settings. However, globally coordinating all services is generally difficult when the number of shared microservices is large, which requires more careful designs.

3 The Erms Methodology

In this section, we describe the overall architecture of Erms framework. Erms is a cluster-wide resource manager that periodically adjusts the number of containers deployed for each microservice, with the goal of meeting service SLAs while minimizing total resource usage.

Erms deploys a Tracing Coordinator (❶ in Figure 6) on top of two tracing systems, Prometheus [3] and Jeager [2]. Tracing Coordinator generates microservice dependency graphs and extracts the individual microservice latency based on historic traces.

Fig. 6.

Erms includes an Offline Profiling module with two components, microservice Latency Profiling (❷ in Figure 6) and Resource Usage Profiling (❸ in Figure 6), which work in the background. This module fetches all microservice latency samples and resource usage samples under different workloads for all deployed containers for each microservice from the Tracing Coordinator. With these data samples, microservice Latency Profiling builds a fitting model that profiles microservice tail latency as a piece-wise linear function of the workload. Additionally, using the collected samples, Resource Usage Profiling builds a linear model to estimate resource usage of microservice containers under different workloads.

The key module of Erms is Online Scaling, which makes scaling decisions according to workload changes. It consists of three components, i.e., Graph Merge (❹ in Figure 6), Latency Target Computation (❺ in Figure 6), and Priority Scheduling (❻ in Figure 6). Graph Merge component applies graph algorithms to merge a general dependency graph with complex dependency into a simple structure with sequential dependency only, based on the observed workload. The purpose of this merge procedure is to simplify latency target computation. Latency Target Computation component allocates an initial latency target for all microservices within each dependency graph via solving a simple convex problem with low overhead. Priority Scheduling component assigns each service a different priority at a shared microservice based on this initial latency target. Requests from different services are processed according to this priority. Moreover, such priority also determines a new workload that a shared microservice needs to process under each service. Stem from this new workload, Latency Target Computation component recomputes latency targets for all microservices, and scales containers accordingly.

Erms also contains a Container Placing module (❼ in Figure 6) to place all containers from different microservices across physical hosts in the cluster. This module places newly scheduled containers or release existing containers, which are determined by the Online Scaling module. The placement strategy aims to globally reduce the impact of resource interference on the end-to-end latency of online service. Specifically, the strategy takes into account the global resource interference within the physical hosts, which primarily arises from two sources: offline jobs and the microservices that are to be deployed on these hosts. Finally, actions are executed on the underlying Kubernetes cluster through the deployment module.

4 Resource Scaling Models

In this section, we present the details of resource scaling models under Erms. First, we define the basic scaling model and our assumptions (Section 4.1). Next, we explain our developed solution approach and analyze why it works well (Section 4.2). The general principle behind this solution is to solve complex problems with near-optimality via using theoretically grounded yet practically viable solutions. Finally, we develop a multiplexing model to handle shared microservices (Section 4.3).

4.1 Basic Model

Given a collection of service dependency graphs and all the microservices in each graph - together with quantified information about microservice latency and the workload relationship, and the container size of each microservice - we must deploy these services in the cluster such that their SLA requirements are satisfied, i.e., the tail end-to-end latency is smaller than a user-defined threshold while minimizing total resource usage. This yields the following optimization problem:

\(\begin{equation} \min _{\overrightarrow{n}} \ \ \sum _{i=1}^N n_i \cdot {R_i}, \ \ \mbox{subject to,} \ \ \mbox{latency}_k\big (\overrightarrow{n}\big) \le \mbox{SLA}_k. \end{equation}\)

(2)

\(\overrightarrow{n} = \lt n_1,n_2,\cdots ,n_N\gt\) is the decision vector where \(n_i\) denotes the number of containers allocated to Microservice i. N is the total number of unique microservices from all services. \({R_i}\) is the dominant resource demand of Microservice i, i.e.,

\(\begin{equation} R_i = \max \Big \lbrace R^C_i/C ~,~ R^M_i/M\Big \rbrace , \end{equation}\)

(3)

where \(R^C_i\) ( \(R^M_i\) ) is the size of CPU (Memory) configuration of containers from Microservice i, C and M are the overall CPU and Memory capacity in the cluster. \(\mbox{latency}_k(\overrightarrow{n})\) and \(\mbox{SLA}_k\) represent the tail end-to-end latency of requests from service k under resource allocation \(\overrightarrow{n}\) and the SLA requirement of service k, respectively.

As observed in Section 2.2, microservice latency is a piece-wise linear function of the workload. For ease of modelling, we only consider a specific interval for each microservice in this section. In other words, the tail latency \(L_i\) of Microservice i is described as \(L_i = a_i \frac{\gamma _i}{n_i} + b_i\) . Here, \(a_i\) and \(b_i\) denote the slope and intercept, and \(\gamma _i\) is the workload of Microservice i. The details of choosing intervals are presented in Section 5.3.

4.2 Design of Optimal Scaling Method

In the setting where there is only one service consisting of sequential microservices, \(\mbox{latency}_k(\overrightarrow{n})\) can be formulated as:

\(\begin{equation} \mbox{latency}_k\big (\overrightarrow{n}\big) = \sum _{i=1}^N a_i \frac{\gamma _i}{n_i} + b_i. \end{equation}\)

(4)

In this setting, the optimal solution to Equation (2) can be obtained via solving KKT equations corresponding to the convex optimization problem [6]. Consequently, the optimal latency target and the optimal number of containers \(n_i^{o}\) can be expressed by a closed-form result:

\(\begin{equation} a_i \frac{\gamma _i}{n^o_i} + b_i = \frac{\sqrt {a_i \gamma _i R_i}}{\sum _{i=1}^N \sqrt {a_i \gamma _i R_i}}\Big (\mbox{SLA} - \sum _{i=1}^N b_i\Big) + b_i. \end{equation}\)

(5)

Equation (5) states that the optimal latency target of each microservice is in proportion to the square root of the product of \(a_i\) , workload \(\gamma _i\) , and resource demand \(R_i\) . This result implies that when the workload of a microservice increases, it needs to be allocated a higher latency target. Correspondingly, other microservices should be allocated lower latency targets and scheduled more containers.

A general dependency graph consists of multiple critical paths and one microservice can appear in different paths, complicating the optimal allocation of latency targets since it is difficult to give an exact expression of \(\mbox{latency}_k(\overrightarrow{n})\) . To address this problem, Erms simplifies the graph topology by removing parallel dependencies. We describe the procedure in Figure 7, which shows how to merge parallel dependency within one dependency graph of workload \(\gamma\) . In Figure 7, microservice T first calls microservice Url and U in parallel, and then calls microservice C after the response of Url and U.

Fig. 7.

Extracting complete dependency graph. In highly dynamic execution environments, dependency graphs within one service can vary significantly between each other. To address this issue, Erms compares the differences between dynamic graphs generated from the same online service and merges them into a complete dependency graph. Specifically, Erms first collects all microservices from historical traces and creates an empty zero matrix of size \(\lt M,M\gt\) , where M is the number of microservices, and each element represents an edge between two microservices. Erms then retrieves the dependency graph from the trace and updates the specific element from 0 to 1 if there is an edge between two microservices. This iteration is repeated until all graphs have been retrieved. Finally, the resulting adjacency matrix represents the complete dependency graph.

Handling sequential dependency. Erms removes dependency starting from the last layer, i.e., it first creates a virtual microservice UrlC \(^*\) to merge Url and C, and creates another virtual microservice UC \(^*\) to combine U and C. Let \(\lt a_u,b_u\gt\) and \(\lt a_c,b_c\gt\) be the parameters of the tail latency function associated with Url and C, and \(\lt a_1^{*},b_1^{*}\gt\) and \(\lt a_2^{*},b_2^{*}\gt\) be the parameters of UrlC \(^*\) and UC \(^*\) . The invention of a virtual microservice should yield the same latency and the same amount of resource usage as that of the original real microservices. Thus, the new parameters \(\lt a_1^{*},b_1^{*}\gt\) can be characterized by:

\(\begin{equation} a_1^{*} \frac{\gamma }{n_u + n_c} + b_1^{*} = a_u \frac{\gamma }{n_u} + b_u + a_c \frac{\gamma }{n_c} + b_c. \end{equation}\)

(6)

The solution to Equation (6) is given by:

\(\begin{equation} a_1^{*} = \big (\sqrt {a_u R_u} + \sqrt {a_c R_c}\big)\big (\sqrt {a_u/R_u} + \sqrt {a_c/R_c}\big), \end{equation}\)

(7)

\(\begin{equation} b_1^{*} = b_u + b_c. \end{equation}\)

(8)

And the virtual resource demand of UrlC \(^*\) is:

\(\begin{equation} R_1^{*} = \big (\sqrt {a_uR_u} + \sqrt {a_cR_c}\big) \big / \big (\sqrt {a_u/R_u} + \sqrt {a_c/R_c}\big). \end{equation}\)

(9)

\(\big \lt a_2^{*},b_2^{*}\big \gt\) can be obtained in the same way.

Removing parallel dependency. With the invention of UrlC \(^*\) and UC \(^*\) in Figure 7, it remains to remove the parallel dependency between them. This can be achieved via inventing another virtual microservice UU \(^{**}\) . Let \(\lt a^{**},b^{**}\gt\) be the parameter of UU \(^{**}\) . The optimal latency targets across parallel microservices must be the same, as otherwise, one can increase the lower one to reduce the overall resource usage. Thus, we have:

\(\begin{equation} a_1^{*} \frac{\gamma }{n^{*}_1} + b_1^{*} = a^{*}_2\frac{\gamma }{n^{*}_2} + b_2^{*} \approx a^{**} \frac{\gamma }{{n^{*}_1} + {n^{*}_2}} + b^{**}. \end{equation}\)

(10)

The solution to Equation (10) is as follows:

\(\begin{equation} a^{**} = a_1^{*} + a^{*}_2, \ b^{**} = \max \left\lbrace b_1^{*} , b_2^{*}\right\rbrace , \end{equation}\)

(11)

and the virtual resource demand of UU \(^{**}\) is given by:

\(\begin{equation} R^{**} = {(n^{*}_1 R^{*}_1 + n^{*}_2 R^{*}_2)}/{(n^{*}_1 + n^{*}_2)}. \end{equation}\)

(12)

After this merge process, the dependency graph only consists of three (virtual and real) microservices that execute sequentially. Erms computes latency targets and resource allocation for all these microservices based Equation (5).

Latency target computation. Finally, Erms reverses the above graph merge procedure and computes a latency target for each microservice, as described in Figure 8. First, Erms computes latency targets for microservices T, UU \(^{**}\) , and C with sequential dependencies according to Equation (5). Second, Erms assigns the same latency targets to microservices with parallel dependencies, that is, UrlC \(^*\) and UC \(^*\) ’s latency targets are equal to UU \(^{**}\) ’s latency target. Last, Erms uses these results to compute latency targets for real microservices with sequential dependencies, i.e., {Url,C} based on UrlC \(^*\) and {U,C} based on UC \(^*\) .

Fig. 8.

Algorithm 1 describes the entire process of resource scaling with a general graph of known microservice characteristics and service workload. It adopts Depth-First Search (DFS) to find all two-tier invocations [28]. (Line 7 to Line 19). Each two-tier invocation consists of one microservice along with all its downstream microservices, e.g., {T,Url,U,C} is a two-tier invocation formed by T, and {Url,C} is another two-tier invocation formed by Url in Figure 7. The merge function for inventing new virtual microservices (Line 24), starts from the last two-tier invocation and ends with the first one that is found by DFS. After this, the algorithm computes an optimal latency target for all virtual microservices (Line 20). The worst-case time-complexity of DFS algorithm is \(\mathcal {O}(|V|+|E|)\) for a graph with \(|V|\) nodes and \(|E|\) edges.

4.3 Microservice Multiplexing Model

Erms can extend the basic resource scaling framework to model multiplexing among different services.

Erms schedules high-priority services before those of low priority whenever there are multiple requests queued at a shared microservice. As such, response time of low-priority requests experienced at this shared microservice will be delayed by high-priority ones. To explicitly quantify such an effect, Erms formulates a new model to incorporate priorities assigned to different services. Consider two services illustrated in Figure 5 with workload \(\gamma _1\) , \(\gamma _2\) and SLA requirements SLA \(_1\) and SLA \(_2\) When requests from the first service are given higher priority at shared microservice P, and there is no other microservice shared among these two services, the new model is formulated as:

\(\begin{equation} \sum _{i \in \Phi _1 \setminus \lbrace p\rbrace } a_i \frac{\gamma _1}{n_i} + b_i + a_p \frac{\gamma _1}{n_p} + b_p \le \mbox{SLA}_1, \end{equation}\)

(13)

\(\begin{equation} \sum _{i \in \Phi _2 \setminus \lbrace p\rbrace } a_i \frac{\gamma _1}{n_i} + b_i + a_p \frac{\gamma _1 + \gamma _2}{n_p} + b_p \le \mbox{SLA}_2, \end{equation}\)

(14)

where \(\Phi _1\) and \(\Phi _2\) are the set of microservices included in the first and second services. In the first service, the end-to-end tail latency includes the time of processing \(\gamma _1\) requests per unit of time at P. By contrast, for the shared microservice in the second service, its tail latency is the time to finish processing \((\gamma _1 + \gamma _2)\) requests. This model can be generalized to include more services multiplexing microservice P. It is worth noting that this problem is also convex with respect to the allocation vector \(\overrightarrow{n}\) .

We also make use of convex analysis to quantify the total amount of resource usage under the multiplexing model. Theorem 1 demonstrates this new model results in less resource usage for satisfying SLAs, when compared to other scheduling policies.

Theorem 1.

The resource usage obtained by the optimization problem in Equation (13) and Equation (14) is smaller than that under the sharing scheme using FCFS scheduling and the non-sharing approach.

In the following proof of Theorem 1, we empirically compare Erms’ priority scheduling policy with other baselines, including sharing and non-sharing approaches, in terms of resource usage. The result demonstrates Erms’ priority scheduling policy is more cost-effective than baseline schemes in ensuring SLA requirements.

Proof.

When there is no prioritization with multiplexing, the service SLA requirements, \(\mbox{SLA}_1\) can be formulated as:

\(\begin{equation} a_u \frac{\gamma _1}{n_u} + b_u + a_p \frac{\gamma _1 + \gamma _2}{n_p} + b_p \le \mbox{SLA}_1, \end{equation}\)

(15)

and \(\mbox{SLA}_2\) is the same as that in Equation (14). We now consider a special setting where \(\mbox{SLA}_1 - b_u - b_p = \mbox{SLA}_2 - b_h -b_p\) . In this setting, the optimal resource allocation can be obtained by solving KKT equations that are similar to Equation (4), resulting in a total amount of resource usage of:

\(\begin{equation} RU^s = \frac{\Big (\sqrt {a_u \gamma _1 R_u + a_h \gamma _2 R_h} + \sqrt {a_p(\gamma _1 + \gamma _2)R_p}\Big)^2}{\mbox{SLA}_1 - b_u - b_p}. \end{equation}\)

(16)

When each service deploys microservice independently with no multiplexing, we can directly use the results in Equation (5) to determine the optimal scaling for each microservice, which yields the following amount of resource usage:

\(\begin{equation} RU^n = \frac{\gamma _1\big (\sqrt {a_u R_u } + \sqrt {a_p R_p}\big)^2 + \gamma _2\big (\sqrt {a_h R_h } + \sqrt {a_p R_p}\big)^2}{\mbox{SLA}_1 - b_u - b_p}. \end{equation}\)

(17)

Applying Cauchy-Schwarz Inequality here, we have \(RU^n \le RU^s\) and the equality is attained if and only if \(a_u R_u = a_h R_h\) .

However, it is difficult to derive a closed-form solution to the problem formulated in Equation (13) and Equation (14). One approximation is to solve these two equations independently, which yields an upper bound for the total resource usage:

\(\begin{equation} \begin{split}RU^o \le & \frac{\big (\sqrt { a_h \gamma _2 R_h} + \sqrt {a_p(\gamma _1 + \gamma _2)R_p}\big)^2}{\mbox{SLA}_1 - b_u - b_p} + a_u \gamma _1 R_u + \sqrt { a_u a_p R_u R_p} \gamma _1. \end{split} \end{equation}\)

(18)

Moreover, it can be readily shown that the R.H.S. of Equation (18) is less than \(RU^n\) . As such, we have \(RU^o \le RU^n \le RU^s\) . This completes the proof of Theorem 1. □

While this theorem can guarantee the optimality of Erms’ scheduling policy, it does not quantify to what extent Erms can improve the baselines. The proof also implies that the actual improvement depends on the workload and the sensitivity of upstream microservice’s response time to workload changes.

5 Erms Deployment

5.1 Tracing Coordinator

The tracing coordinator in Erms is developed based on two open-source tracing systems, Prometheus [3] and Jaeger [2]. Prometheus collects OS-level metrics including CPU and memory utilization for each microservice container as well physical hosts. Jaeger is a system to collect application-level metrics, including all calls send to each microservice and service response time. Jaeger adopts a sampling frequency of 10% to control the data collection overhead. It records two spans for each call between a pair of microservices; one starts with the client sending a request and ends with the client receiving the corresponding response, while the other starts with the server receiving the request and ends with it sending the response back to the client.

Tracing Coordinator extracts microservice dependency graphs based on historical traces from Jaeger. Specifically, it first treats the incoming microservice that receives user requests as the root node. If there is a call between two microservices, Tracing Coordinator adds an edge between them. In addition, if the client-side span of newly added calls overlaps the span of existing calls, those calls are marked as parallel calls, otherwise they are sequential calls. Tracing Coordinator repeats this process until it traverses all recorded calls. Based on the microservice dependency graph, Tracing Coordinator also extracts individual microservice latency.

5.2 Microservice Offline Profiling

In this subsection, we introduce Erms’ offline profiling module in detail. Erms adopts a linear function model to profile microservice latency and container resource usage. And these profiling results are leveraged to facilitate efficient containers scaling (Section 4) and scheduling (Section 5.4).

5.2.1 Latency Offline Profiling.

As explained in Section 2.2, microservice latency can be described as a piece-wise linear function of workload. At the same time, resource interference can significantly impact the slope of the latency curve. Therefore, Erms primarily considers workload and resource interference when conducting the profiling of microservice latency [7, 27, 37, 38, 47].

In terms of interference, Erms mainly considers CPU utilization, memory capacity utilization, memory bandwidth utilization, and network bandwidth utilization of the physical host where the microservice container is located. As investigated in Section 2.2, resource interference can have a significant impact on microservice latency [7, 27, 37]. Erms adopts machine learning methods to profile microservice latency in terms of workload and interference. Specifically, Erms collects the tail latency of all samples within the jth minute for each Microservice i from Tracing Coordinator, i.e., \(L_i^j\) . Erms also counts the total number of calls processed by each deployed container in the jth minute, i.e., \(\gamma _i^j\) . These two together with the average resource utilization are regarded as one data sample for Microservice i, i.e., \(d_i^j = (L_i^j, \gamma _i^j, \mbox{C}_i^j, \mbox{MemC}_i^j, \mbox{MemB}_i^j, \mbox{N}_i^j)\) where the last four elements represent CPU, memory capacity utilization, memory bandwidth utilization, network bandwidth utilization, respectively. Erms fits all these samples into a piece-wise model as shown below.

\(\begin{equation} L_i^j = \left\lbrace \begin{array}{cc} {(\alpha ^1_i \mbox{C}_i^j + \beta ^1_i \mbox{MemC}_i^j + \eta ^1_i \mbox{MemB}_i^j + \delta ^1_i \mbox{N}_i^j + c^1_i)} \gamma _i^j + b^1_i, & \gamma _i^j \le \sigma _i ,\\ {(\alpha ^2_i \mbox{C}_i^j + \beta ^2_i \mbox{MemC}_i^j + \eta ^2_i \mbox{MemB}_i^j + \delta ^2_i \mbox{N}_i^j + c^2_i)} \gamma _i^j + b^2_i, & \mbox{otherwise}. \end{array}\right. \end{equation}\)

(19)

Provided there is resource interference, i.e., \(\mbox{C}_i^j\) , \(\mbox{MemC}_i^j\) , \(\mbox{MemB}_i^j\) , and \(\mbox{N}_i^j\) remain fixed, \(L_i^j\) can be portrayed as a piece-wise linear function of the workload \(\gamma _i^j\) . Consequently, Erms first iterates over all training samples with the same resource interference to identify the optimal one as the cut-off point \(\sigma _i\) that minimizes the sum of squared residuals for the piece-wise linear function. Subsequently, using the least squares method, Erms fits the slopes \((a^l_i)_{l=1,2}\) and intercepts \((b^l_i)_{l=1,2}\) of the piece-wise linear function based on the optimal cut-off point \(\sigma _i\) . It is worth noting that \((b^l_i)_{l=1,2}\) are fixed values, unaffected by the resource interference.

Based on the fitted \(a_i\) and \(\sigma _i\) , Erms proceeds to create a new training dataset for each microservice, capturing the impact of resource interference. In this training dataset, each element for Microservice i consists of \(\lbrace \mbox{C}_i, \mbox{MemC}_i, \mbox{MemB}_i, \mbox{N}_i, \sigma _i, (a^l_i)_{l=1,2}\rbrace\) . To ensure efficient profiling, Erms employs simple yet effective models to quantify the relationship between \(\sigma _i\) , \((a^l)_{l=1,2}\) and the resource utilization \(\lbrace \mbox{C}_i, \mbox{MemC}_i, \mbox{MemB}_i, \mbox{N}_i\rbrace\) . Specifically, the slope \((a^l)_{l=1,2}\) is modeled as a linear function in relation to resource utilization. This means that the parameters \((\alpha ^l_i, \beta ^l_i, \eta ^l_i, \delta ^l_i, c^l_i)_{l=1,2}\) can be learned directly from the training dataset using the least-squares method. The cut-off point \(\sigma _i\) is also a function of resource utilization, and Erms leverages a decision tree model [39] to learn this relationship.

5.2.2 Resource usage Profiling.

Microservices are mainly deployed to handle service requests, so the actual resource usage of microservice containers primarily depends on the service workload. As depicted in Figure 9, both traces from Alibaba clusters and real benchmarks show that the average resource utilization of a running container grows almost linearly with workload. Therefore, Erms adopts a linear regression model to profile container resource usage \(r_i(\cdot)\) :

\(\begin{equation} r_i^{l}(x) = a_i^{l} \cdot w + b_i^{l},\ l \in \lbrace C, \mbox{MemC}, \mbox{MemC}, N\rbrace , \end{equation}\)

(20)

where w is the number of requests handled by per container within a minute for Microservice i, and l presents four different hardware resources as described in Equation (19).

Fig. 9.

5.3 Online Resource Scaling

In this section, we present the design details of Online Scaling module. The key of this module is to carefully apply resource scaling models developed in Section 4 such that the scaling overhead is well controlled.

5.3.1 Dependency Merge and Latency Target Computation.

Erms averages the current resource utilization across all physical hosts and feeds this utilization into the microservice profiling model to obtain parameters that describe the piece-wise linear function. These parameters quantify the sensitivity of microservice latency with respect to the workload of each container. Erms relies on them to allocate latency targets for microservices following Algorithm 1.

One critical challenge herein, however, is that there exist two different sets of parameters associated with two intervals for one microservice described by the profiling model. It is difficult to optimally choose which set should be used for Latency Target Computation. Exhaustively trying all possible choices is not scalable since the number of candidates is \(2^m\) where m is the number of microservice in a graph. To address this challenge, Erms first performs dependency merge and allocates latency targets based on these parameters learned from the second interval, as this interval corresponds to a high workload and means less resource consumption. After allocating a latency target for each Microservice i, Erms then checks whether the allocated latency target is less than the latency corresponding to the cut-off point \(\sigma _i\) or not. A positive result means Microservice i requires extra resources and should be allocated a lower latency target. For these microservices, Erms adopts the other set of parameters in the first interval to recompute all latency targets. In this way, the dependency graph of each service needs to be processed at most twice for Latency Target Computation.

5.3.2 Priority Scheduling.

At a shared microservice, Erms needs to configure the scheduling priority of requests from different online services. To find the schedule that yields the fewest resource usage, it is required to solve the multiplexing model in Section 4.3 under all possible configurations. However, this is not tractable in practical systems since there are \(n!\) orderings if n services share a microservice. When considering the situation that many microservices can be multiplexed among different services, the computational overhead can be extremely high, without mentioning the complexity of the multiplexing model. To be more scalable, Erms first calls the Latency Target Computation component for each service to allocate an initial latency target to all microservices. Priority is configured based on this target. In particular, the service that yields the lower latency target at a shared microservice is given higher priority. The intuition behind this is that the lower latency target implies the corresponding service consists of many latency-sensitive microservices and their requests should be handled first.

Based on the configured priority, Erms recomputes microservice latency target via solving the multiplexing model. However, this model couples all services together and is computationally expensive to deal with. For reducing scaling overhead, Erms chooses to call the Latency Target Computation component for each service independently. This call returns the final latency targets of all microservices and the number of containers to be scaled. In this call, Erms adopts a modified workload for a shared microservice to take into account priority scheduling. More specifically, let \(\gamma _{k,i}\) denote the original workload at shared microservice i that is from service k, the modified workload is \(\sum _{l =1}^k \gamma _{l,i}\) , assuming services are ordered following their index. The result from Latency Target Computation implies when the workload of a microservice increases, other microservices within the same dependency graph should be set lower latency targets for resource efficiency. Based on this, Priority scheduling allocates more resources to non-shared microservices in order to relieve resource pressure on shared microservices, compared to FIFO scheduling.

Whenever a thread is available in a deployed container and there are requests waiting to be processed, a request from the service with higher priority will be assigned to this thread with higher probability. In particular, requests from the service with the highest priority are scheduled with probability \((1-\delta)\) , and requests from the service with the lth highest priority are scheduled with probability \(\delta ^{l-1}(1-\delta)\) , and the service with the lowest priority is scheduled with probability \(\delta ^{n-1}\) where n is the number of services. Here, a small \(\delta\) is beneficial to the response of high-priority services at the cost of starving the processing of low-priority requests when the workload is heavy. We shall evaluate the impact of \(\delta\) on shared microservices in Section 6.4.2.

5.3.3 Overhead of Resource Scaling.

By careful design, Erms only needs to call Latency Target Computation twice for each dependency graph. In addition, Latency Target Computation component also applies graph traversal algorithm twice to compute latency targets, yielding a complexity of \(\mathcal {O}(|V|+|E|)\) for a graph with \(|V|\) nodes and \(|E|\) edges. In production clusters, dependency graphs behave like a tree [28], and the number of edges is usually several times the number of nodes. As such, the computational overhead of resource scaling scales linearly with the total number of microservices included in all services.

5.4 Interference-aware Containers Scheduling

To improve scalability, the Online Scaling module takes into account only the average resource interference across multiple hosts when performing resource scaling. However, it is important to note that scheduled containers belonging to a single microservice may be deployed across different hosts, resulting in varying degrees of resource interference. This variation in interference can subsequently lead to significant performance imbalances among containers within the same microservice.

A simple method to tackle performance imbalance involves bridging the disparity in host resource utilization [30]. However, this approach neglects the potential impact of resource interference on the performance degradation of different microservices to varying extents. In contrast, Erms strategically places containers in response to performance degradation in order to minimize end-to-end latency. To attain the optimal container placement, we develop an optimization problem with the objective function of minimizing the aggregate latency of all microservices (per Equation (19)). It is worth noting that this objective function accounts for resource interference originating not only from offline jobs but also from the microservices that will be placed on the hosts. The formulation of this optimization is as follows:

\(\begin{align} \min _{\mathbf {p}} & \ \sum _{h \in \Phi } \sum _{i \in \Omega } \sum _{k =1}^{n_i} p_{i,k}^h \bigg \lbrace \Big (\sum _{l \in \Psi } \big (c^l_i \cdot (\sum _{i \in \Omega } \sum _{k =1}^{n_i} p_{i,k}^h \cdot r^l_i+ {b_h^l})/{R^l_h} \big) \Big) \gamma _i^j + b_i \bigg \rbrace & \end{align}\)

(21)

\(\begin{align} \mbox{s.t.} & \ \sum _{h\in \Phi }p_{i,k}^h = 1, \ \ \forall i,k \quad \mbox{and} \ \ p_{i,k}^h \in \lbrace 0,1\rbrace , \ \ \forall i,k,h, \end{align}\)

(22)

\(\begin{align} & \ \sum _{i\in \Omega } \sum _{k=1}^{n_i} n_i r^l_i \le R^l_h, \ \ \forall h,l. \end{align}\)

(23)

The explanation for each parameter can be found in Table 1, and the last constraint arises from the fact that the combined resource consumption of all containers on each host must not exceed the host’s capacity. The resource usage of host h, as quantified in the objective function in Equation (21), comprises two parts: usage from microservice containers to be deployed, \(\sum _{i \in \Omega } \sum _{k =1}^{n_i} p_{i,k}^h \cdot r^l_i\) and usage from existing jobs, \({b_h^l}\) . Given the workload \(\gamma _i^j\) , resource usage of microservice containers can be estimated based on Equation (20), while the resource usage of existing jobs can be retrieved through Erms’s Tracing Coordinator.

Table 1.

\(c^l_i\)	The interference coefficient of resource l for microservice i (Equation (19))
\(r^l_i\)	The usage of resource l for Microservice i’s containers
\(b^l_h\)	The usage of resource l of existing jobs in Host h
\(R_l^c\)	The capacity of resource l for Host h
\(p_{i,k}^h\)	Whether the kth container of i is placed on h
\(n_i\)	Number of containers scheduled for Microservice i
\(\Psi\)	The set of four kinds of resource, including CPU, memory capacity, memory bandwidth and network bandwidth, respectively
\(\Omega\)	The set of all Microservice
\(\Phi\)	The set of all physical hosts

Table 1. Notations for Placement Optimization under Erms

It is worth noting that in this problem, \(\mathbf {p} = \lbrace p_{i,k}^h\rbrace _{i,k,h}\) serves as the sole optimization variable. In the meanwhile, this problem is a non-linear integer programming problem, which is NP-hard and challenging to solve. To address this, we relax the integer constraint \(p_{i,k}^h \in \lbrace 0,1\rbrace\) , allowing \(p_{i,k}^h\) to assume a fractional number, i.e., \(\widehat{p}_{i,k}^h \in [0,1]\) . As a result, the problem transforms into a convex program, which can be efficiently solved using the ADMM approach [22]. Following this, the generated fractional solutions are rounded back to binary values through uniform random sampling, i.e., \(p_{i,k}^h\) equals one with a probability of \(\widehat{p}_{i,k}^h\) . A significant limitation of this method is its high complexity, particularly when a production cluster contains a vast number of hosts and microservices. This complexity may result in substantial scheduling overhead, thereby restricting the approach’s applicability. To alleviate this overhead, Erms statically divides a cluster’s hosts into multiple equal-sized groups and solves a considerably smaller-scale optimization problem using the computational resources within each group.

Globally optimizing the placement of all containers may lead to migrations of containers across hosts. To mitigate the migration overhead, Erms solves the optimization problem based on the current deployment of containers in the cluster. If Erms determines to scale out the number of container for microservice i from \(n_i\) to \(n_i^{*}\) , then it only needs to figure out the placement for these \((n_i^{*} - n_i)\) containers.

5.5 Erms Implementation

We implement a prototype of Erms on top of Kubernetes [24], a widely-adopted container orchestration framework. At runtime, Erms queries Prometheus to obtain real-time data for scheduling resources. Online Scaling module and Resource Provisioning module are written via Kubernetes Python client library, implemented in approximately 3KLOC of Python.

Erms implements the priority-based scheduling in the network layer of each container. More specifically, it relies on a Linux traffic control interface tc to manage different incoming network flows of a container. This interface can provide prioritization through a queuing discipline, i.e., pfifo_fast. As such, Erms only needs to specify the priority of each flow. Originally, tc is designed for controlling outcoming traffic rather than incoming traffic. Erms activates a virtual network interface in a physical host and then binds this interface to the desired container.

6 Evaluation of Erms

6.1 Experiment Setup

Benchmarks: We evaluate Erms using an open-sourced microservice benchmark, DeathStarBench [18] and TrainTicket [49]. DeathStarBench consists of Social Network, Media Service, and Hotel Reservation applications. These applications contain 36, 38, and 15 unique microservices respectively, and include 3, 1, and 4 different services. Moreover, both Social Network application and Hotel Reservation application have 3 shared microservices. TrainTicket application contains about ten services, such as ticket booking, ticket querying and so on, and these services form dynamic dependency graph in runtime. Moreover, there are 23 shared microservices in these services.

Cluster Setup: We deploy Erms in a local private cluster of 20 two-socket physical hosts. Each host is configured with 32 CPU cores and 64 GB RAM. Each microservice container is configured with 0.1 core and 200MB memory.

Workload Generation: We find that 100,000 requests reach the maximum throughput that our cluster can support in one minute for the benchmark [18]. As such, we generate multiple static workloads ranging from 600 (low) to 100,000 (high) requests per minute for each service. In addition, we also adopt dynamic workloads from Alibaba clusters [28]. SLA targets are set with respect to 95th percentile end-to-end latency, ranging from 50 ms (low) to 200 ms (high) for all applications.

Dependency Graph: In DeathstarBench, online services generally exhibit static dependency graphs while processing various requests. However, TrainTicket presents dynamic dependency graphs at runtime, influenced by distinct request arguments, such as the number of stations involved. We explore two representative TrainTicket services: ticket booking and ticket querying. Under the arguments for 1 and 10 stations, we generate simple and complex graphs, respectively. The combination of simple and complex graphs serves to highlight the dynamic nature of these dependency graphs.

Baseline Schemes: We compare Erms against GrandSLAm [23], Rhythm [47], and Firm [38]. Moreover, we include the original Erms’ implementation (Erms-IPM) [30] as an additional baseline scheme. Without special mention, we set \(\delta\) to 0.05.

—

GrandSLAm: It computes latency target for each service such that it is proportional to its average latency under different workloads.

—

Rhythm: It evaluates the contribution of each microservice as the normalized product of mean latency, and variance of latency across different workloads, as well as the correlation coefficient between microservice latency and the end-to-end service latency.

—

Firm: It first identifies a critical microservice on each critical path that has a heavy impact on the end-to-end latency, and then applies reinforcement learning to tune resource allocation for this microservice.

—

Erms-IPM: To mitigate performance imbalances between containers of microservices, it minimizes the gap in resource utilization across hosts through container placement.

6.2 Microservice Profiling Accuracy

To validate the accuracy of Erms’ microservice profiling module, we run DeathStarBench and TrainTicket in our local cluster and collect one-day running samples for each microservice. We fix the interference level on each host via injecting iBench workloads [11] during each hour, and collect one sample per minute for a microservice. In addition, we collect one-day samples for all microservices from Taobao Application in Alibaba traces [1]. Taobao is mainly for online shopping and it consists of 2,000+ microservices. It is worth noting that microservices are usually co-located with batch jobs on the same host to increase resource utilization in Alibaba clusters [28]. Therefore, Alibaba microservices tend to experience more different types of resource interference than microservices in a dedicated cluster.

We train Erms’ profiling model for each microservice using the first 22-hour samples and perform testing on the remaining samples. We also implement XGBoost [8] and a three-layer Neural Network (NN) with 64 neurons as baseline schemes. As shown in Figure 10(a), the testing accuracy under Erms ranges from 83% to 97% for microservices from both DeathStarBench [18] and Alibaba traces. In this case, the testing accuracy is similar across all schemes. To investigate the generalization ability of Erms, we also evaluate the testing accuracy under different sizes of training data set collected from Taobao. As shown in Figure 10(b), Erms achieves a testing accuracy of 85% using 70% of the training samples. In contrast, the testing accuracy under NN drops dramatically when the number of training samples reduces. Considering that Erms only needs the slope and intercept of a piecewise linear function for resource scaling, this testing accuracy is sufficient for resource management, even in production environments.

Fig. 10.

Moreover, we also evaluate the profiling results of resource usage using traces generated from DeathStarBench and TrainTicket, which collectively comprise nearly 120 microservices. Furthermore, we validate the efficiency of the linear regression model on more than 1000 microservices from Alibaba clusters. The results highlight that the prediction accuracy under these benchmarks and Alibaba traces can be as high as 92.2% and 91.2%, respectively.

6.3 Resource Efficiency and Performance

6.3.1 Static Workload.

In this part, we evaluate the resource usage and end-to-end latency of services under different static workloads and SLA settings. In each setting, we run all services for 30 minutes.

We quantify resource usage in terms of the number of containers allocated to all services. Figure 11(a) shows the distribution of the resource usage under different static workloads. The result reveals that more than 83% of workloads require less than 200 containers under Erms, while these workloads need about 310 containers under both GrandSLAm, and Rhythm. GrandSLAm and Rhythm have similar distributions of resource usage as they allocate resources based on statistics of microservice latency. Firm tends to tune resource configuration for critical microservices only, and it needs to allocate more resources under the high workload to ensure SLA. As a result, Firm leads to the longest tail in term of the CDF distribution of resource allocation, as shown in Figure 11(a). In an extreme case, Firm needs more than 3 \(\times\) resources compared to Erms. To be more comprehensive, we also compare these schemes in each specific setting, as shown in Figure 11(b). On average, Erms saves about 27.8%, 91.1% and 30.1% of containers in contrast to Firm, GrandSLAm, and Rhythm, respectively. As workload goes up, the improvement of Erms also grows. One key reason behind this is that shared microservices need to deploy more containers so as to handle requests from different services, especially when the workload is high. This gives more opportunities for Erms to optimize resource allocation. Similar behavior can be observed when we vary SLA requirements. In the low-SLA scenario, the reduction of resource usage under Erms is more significant than that under the high-SLA setting. Low SLA means a low latency target allocated to each microservice and therefore, there is a large room to optimize resource usage.

Fig. 11.

In the meanwhile, we also characterize the end-to-end performance of service requests under different scenarios. As shown in Figure 12(a), on average, the SLA violation probability under Erms is less than 4%, whereas it is as high as 25.2%, 16.4%, and 7.2% under Firm, GrandSLAm and Rhythm, respectively. Moreover, both higher workloads and lower SLAs lead to higher SLA violation probability under all schemes. When referring to the actual end-to-end delay, Erms can reduce this metric by 18% compared to other schemes, as depicted in Figure 12(b). Moreover, in the high workload and low SLA scenarios, the gap between end-to-end latency and SLA will be larger than that in the low workload and high SLA settings.

Fig. 12.

6.3.2 Dynamic Workload.

In this part, we generate dynamic workload based on Alibaba traces and set the SLA target to 200ms. In this experiment, we dynamically scale containers for microservices from the Social Network application so as to satisfy SLA. As shown in Figure 13(a), all schemes could respond to the workload changes promptly. However, Erms can save up to 30% of containers compared to other schemes on average. In Figure 13(b), we depict the corresponding tail latency of requests submitted over time. It shows that Erms can satisfy SLA requirements all the time without violation, even when the workload grows quickly. However, other schemes can easily violate SLA at peak workloads. In particular, Firm can violate SLA by up to 50% due to its late detection of bottleneck microservices.

Fig. 13.

6.3.3 Dynamic Dependency Graph.

We evaluate the resource allocation and end-to-end latency of services with dynamic dependency graphs using various schemes. To obtain optimal resource allocation for a dynamic graph, we progressively decrease the number of containers for distinct microservices until SLA violations arise. The corresponding resource allocation can then be deemed optimal. To accommodate dynamic dependency graphs, baseline schemes allocate resources for the complete graph rather than its subgraph to prevent SLA violations. We employ a combination of complex and simple graphs to measure the graph’s dynamic nature.

As illustrated in Figure 14(a), optimal resource allocation yields an average savings of approximately 5%, 10%, 14%, and 20% compared to Erms, Firm, GrandSLAm, and Rhythm, respectively. These findings demonstrate that Erms outperforms other baseline schemes in dynamic dependency graph scenarios, although a minor gap still exists between Erms and optimal resource allocation. Moreover, the gap between Erms and optimal resource allocation remains stable as the proportion of simple graphs increases, while the gap between other schemes and optimal resource allocation widens with the growth of simple graphs. This is because Erms’ accurate modeling of the dependency graph can adapt to the dynamic nature of the graph. Additionally, Figure 14(b) reveals that Erms enhances service performance by 3% compared to other baseline schemes. As the proportion of simple graphs increases, Erms can gradually improve performance due to the benefit of overprovisioning, while the performance of other baseline schemes varies. Consequently, Erms can achieve high performance even under dynamic dependency graph scenarios.

Fig. 14.

6.4 Evaluation of Individual Modules

In this subsection, we separately quantify the benefit brought by different components and modules of Erms including Latency Target Computation, Priority Scheduling, and Resource Provisioning.

6.4.1 Latency Target Computation.

In this experiment, we evaluate the improvement of Latency Target Computation component by implementing Erms with default FCFS policy to schedule requests at a shared microservice. We compare the overall resource usage across different schemes under various static workloads and SLA settings. The distribution of resource usage is depicted in Figure 15(a). In an extreme case, Latency Target Computation alone could reduce the overall resource usage by 2 \(\times\) against Firm. On average, Erms outperforms Firm, GrandSLAm and Rhythm by 63.2%, 42.3%, and 61.5%, respectively, indicating that the performance of Erms can degrade a lot without efficient scheduling at shared microservices.

Fig. 15.

6.4.2 Benefit of Priority Scheduling.

We proceed to quantify the benefit brought by Erms’ scheduling policy at shared microservices. We also implement priority scheduling under GranSLAm and Rhythm. Firm tunes resource online using a reinforcement learning engine, it is not possible to prioritize requests. Therefore, we only compare Erms to GrandSLAm and Rhythm in this experiment. It is worth noting that priority scheduling requires Erms to recompute latency targets and adjust resource allocation for non-shared microservices as well.

As shown in Figure 15(b), with priority scheduling, Erms can save about 19% of containers. However, the benefit of priority scheduling for GrandSLAm (Rhythm) is very marginal, i.e., less than 10%. This is because directly applying priority scheduling under GrandSLAm (Rhythm) only reduces resource usage at shared microservices without impacting other microservices. By contrast, Erms relies on priority scheduling to optimize resource allocation for all microservices, leading to increased resource usage for non-shared microservices. However, sacrificing these microservices can benefit shared microservices a lot and therefore greatly reduce the overall resource usage, as illustrated in Figure 5. This result demonstrates that coordinating latency target computation and scheduling is critical for resource management in shared environments.

We also investigate the impact of the \(\delta\) parameter on shared microservices to determine the optimal \(\delta\) value under various workload and SLA conditions, as depicted in Figure 16. For each scenario, we utilize two configurations, with the outcomes represented by green and blue lines in Figure 16. In the workload scenario, we modify the workload levels of shared microservices for high-priority and low-priority requests.

Fig. 16.

The green line in Figure 16(a) reveals that a small \(\delta\) value, ranging from 0.05 to 0.1, significantly reduces the latency of low-priority requests under low workloads, while only slightly increasing the latency of high-priority requests under high workloads. Specifically, when \(\delta\) is set at 0.1, the latency of low-priority requests decreases by 7.8%, while the latency of high-priority requests increases by a mere 1.3%. Consequently, a \(\delta\) value between 0.05 and 0.1 offers high performance for this configuration. As the workload for low-priority requests rises and that for high-priority requests diminishes, as denoted by the red line in Figure 16(a), the \(\delta\) value exhibits minimal influence on the latency of low-priority requests until it surpasses 0.4. This occurs because low-priority services necessitate a higher \(\delta\) value to decrease queuing time as their workload increases. A similar observation is evident in distinct SLA scenarios. With a \(\delta\) value set between 0.05 and 0.1, the latency of low-priority requests substantially declines, while the latency of high-priority requests experiences a minor increment, as demonstrated in Figure 16(b).

6.4.3 Interference-aware Containers Placement.

In this section, we assess the performance improvement achieved through the implementation of an interference-aware container placement module under the Erms framework (refer to Section 5.4). We employ the iBench benchmark [11] to introduce varying degrees of interference, subsequently examining total resource consumption and tail latency under three different approaches: the Erms container placement policy, Erms-IPM, and the default deployment scheme of Kubernetes (K8S).

As illustrated in Figure 17(a), the K8S scheduler necessitates over 50% more containers to fulfill SLA requirements in comparison to Erms-IPM, owing to its lack of resource interference awareness during container placement. Conversely, Erms achieves a 10% reduction in allocated containers relative to Erms-IPM by optimizing end-to-end latency in the presence of resource interference. In high SLA scenarios, the interference-aware container placement module can decrease resource utilization by up to 2 \(\times\) , a more significant effect than in low SLA settings. Two factors contribute to this observed phenomenon. First, high SLA settings result in diminished resource allocation, rendering microservice performance more susceptible to interference from background workloads. Second, high SLA settings lead to high latency targets for each microservice. As microservice latency escalates with interference, resource usage increases to maintain the same latency target under intensified interference. This demonstrates the importance of profiling microservice performance while considering interference-awareness in order to optimize resource allocation effectively.

Fig. 17.

We further assess the end-to-end latency for services under Erms, Erms-IPM, and K8S while utilizing the same amount of resources. As depicted in Figure 17(b), Erms significantly improves latency performance by 10% and 1.2 \(\times\) on average when compared to Erms-IPM and K8S, respectively. Notably, Erms outperforms K8S by 2.2 \(\times\) in high interference scenarios and by 2 \(\times\) in high SLA settings, showing its enhanced efficiency in optimizing service latency.

6.5 Trace-driven Simulations

To evaluate Erms on a large scale, we replay Alibaba microservice workloads to conduct trace-driven simulations for Taobao Application. This application includes 500+ services and each service contains 50 microservices on average. The total number of shared microservices is 300+.

6.5.1 End-to-End Performance.

We depict the distribution of the total number of containers deployed under each service in Figure 18(a). It shows that more than 80% of services require less than 2,000 containers under Erms, whereas these services need 6,000 containers under both GrandSLAm and Rhythm. In addition, Erms could reduce the number of allocated containers by 1.6 \(\times\) on average, compared to baseline schemes, as shown in Figure 18(b). This improvement is much larger than that under real benchmarks, demonstrating Erms has more opportunities to improve resource efficiency for services with complex call dependency. We also evaluate the improvement of Latency Target Computation and Priority Scheduling, respectively. Results in Figure 18(b) show that Latency Target Computation alone can save resource usage by up to 1.2 \(\times\) . By contrast, Priority Scheduling leads to a reduction in resource usage by 50%. This improvement is also much higher than that from benchmarks since there are more shared microservices in Alibaba traces.

Fig. 18.

6.5.2 Scalability of Erms.

We evaluate the scaling overhead of Erms using Alibaba traces since their scale is much larger than that of DeathStarBench. The average overhead of Latency Target Computation is 15ms on an Intel Xeon CPU. For the largest graph with 1,000+ microservices, the computational overhead is 300ms. In addition, the overhead of resource provisioning is 200ms on average. Most of time, Erms only needs to scale no more than 1,000 containers across 5,000 hosts. Therefore, the overall scaling overhead is quite small since a container usually requires several seconds to start [38].

7 Discussion

In this section, we will discuss several practical issues about deploying Erms in a production environment.

Modelling latency using linear functions. Erms chooses to quantify microservice latency using piece-wise linear functions. The key reason is that these functions can well model microservice behavior, as explained in Section 2.2. Moreover, the piece-wise linear function can achieve up to 86% profiling accuracy on Alibaba production workloads and DeathStarBench, even outperforming complicated models including XGBoost and Neural Network. Another advantage is that Erms can leverage piecewise linear functions to derive closed-form expressions that assign optimal latency targets to each microservice. As a result, Erms can achieve better performance than existing heuristics while being scalable to handle large-scale problems. In fact, linear functions are not satisfactory for only very few microservices, i.e., less than 3% in DeathStarBench with profiling accuracy around 62%. This is because the latency of these microservices is relatively small, making it difficult to predict accurately. Nonetheless, these microservices have a negligible impact on the end-to-end SLA, and Erms only allocates a small amount of resources to them.

Handling resource-related exceptions. Resource-related exceptions, such as out of memory, rarely happen under Erms for two reasons. First, Erms computes latency targets across microservices based on SLA requirements and current workload. Erms assigns each microservice a proper number of containers based on the latency target to avoid overload. Second, Erms rounds up the number of containers per microservice to an integer. In this sense, Erms can eliminate the negative impact of mispredictions to avoid exceptions. Also, this over-provisioning due to rounding up is negligible relative to the total number of containers per microservice (typically hundreds to thousands in production environments).

8 Related Work

Microservice autoscaling. GrandSLAm builds an execution framework for ML-based microservices [23]. However, it allocates microservice latency targets independently among different services without global coordination. Microscaler [45] adopts Bayesian optimization approach to scale the number of instances for those important microservices. Rhythm [47] builds an advanced model to quantify the contribution of each microservice. Firm [38] leverages machine-learning techniques to localize critical microservice that can have a heavy impact on the overall service performance under low-level resource interference. Most recently, Sinan [46] presents a CNN-based cluster resource manager for microservice architecture to guarantee QoS while maintaining high resource utilization. DeepRest [9] and GRAF [36] employ graph neural networks to accurately estimate resource allocation in microservices, particularly those with intricate dependency graphs. Meanwhile, ORION [31] models serverless latency as a stochastic distribution and subsequently utilizes convolution operations to determine the end-to-end latency for serverless applications. SLAOrchestrator [35] designs a double nested learning algorithm to dynamically provision the number of containers for ad-hoc data analytics. ATOM [21] and MIRAS [44] tunes resources for microservices to improve the overall system throughput. All of these works do not investigate shared microservices.

Microservice sharing. To handle microservice sharing, Q-Zilla [34] designs a decoupled size-interval task scheduling policy to minimize microservice tail latency based on resource reservation. \(\mu\) steal [33] partitions resources at shared microservice and makes use of stealing to improve utilization. However, these schemes are not suitable for practical microservice architecture since they need to know the processing time of each microservice call in advance. Moreover, optimizing individual microservice latency can not provide SLA guarantees on the end-to-end performance of online services.

Graph analysis. Sage [17] builds a graphical model to identify the root cause of unpredictable microservice performance and dynamically adjust resources accordingly. This is not scalable in a production environment since a practical application can even consist of hundreds of microservice with complicated parallel or sequential dependencies. Parslo [32] adopts a gradient descent-based approach to break the end-to-end SLA into small unit SLO. However, such an iterative approach is generally costly in time, and can not be applicable to dynamic workloads. Llama [40] and Kraken [5] aim to optimize performance for serverless systems, which can not be applied to general microservices.

Interference mitigation: The problem of resource interference in cloud-related systems has been extensively investigated in the literature [7, 12, 27, 37]. These works focus on the co-scheduling of different applications, aiming at maximizing application performance. The intention of Erms is different from these works, Erms aims to minimize resource unbalance across different hosts so as to improve resource efficiency and provide end-to-end performance guarantees.

9 Conclusion

This paper presents a new method for dynamically allocating resources in shared microservice architectures through the use of explicit modeling. Our designs incorporate prioritization among various services, providing valuable insights into the effective deployment of online services. However, one limitation of Erms is its tendency to overprovision resources for online services with highly dynamic dependency graphs, as demonstrated in our experiments. A more promising approach would involve estimating resource allocation for graphs exhibiting different levels of dynamics, rather than relying solely on a complete graph. This would enable the scaling of minimal resources to satisfy the SLA for online services with diverse dependency graphs.

References

[1]

2021. Alibaba Microservices Cluster Traces.https://github.com/alibaba/clusterdata/tree/master/cluster-trace-microservices-v2021 (2021).

Abstract

1 Introduction

2 Background and Motivation

2.1 Microservice Background

2.2 Quantification of Microservice Latency

2.3 Challenges and Opportunities from Microservice Multiplexing

3 The Erms Methodology

4 Resource Scaling Models

4.1 Basic Model

4.2 Design of Optimal Scaling Method

4.3 Microservice Multiplexing Model

5 Erms Deployment

5.1 Tracing Coordinator

5.2 Microservice Offline Profiling

5.2.1 Latency Offline Profiling.

5.2.2 Resource usage Profiling.

5.3 Online Resource Scaling

5.3.1 Dependency Merge and Latency Target Computation.

5.3.2 Priority Scheduling.

5.3.3 Overhead of Resource Scaling.

5.4 Interference-aware Containers Scheduling

5.5 Erms Implementation

6 Evaluation of Erms

6.1 Experiment Setup

6.2 Microservice Profiling Accuracy

6.3 Resource Efficiency and Performance

6.3.1 Static Workload.

6.3.2 Dynamic Workload.

6.3.3 Dynamic Dependency Graph.

6.4 Evaluation of Individual Modules

6.4.1 Latency Target Computation.

6.4.2 Benefit of Priority Scheduling.

6.4.3 Interference-aware Containers Placement.

6.5 Trace-driven Simulations

6.5.1 End-to-End Performance.

6.5.2 Scalability of Erms.

7 Discussion

8 Related Work

9 Conclusion

References

Index Terms

Recommendations

Erms: Efficient Resource Management for Shared Microservices with SLA Guarantees

Self-adaptive resource management for large-scale shared clusters

Decentralized and optimal control of shared resource pools

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations