Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Network-Aware Reliability Modeling and Optimization for Microservice Placement

Fangyu Zhang, , Yuang Chen,  Hancheng Lu, , Yongsheng Huang Fangyu Zhang, Yuang Chen, Hancheng Lu, and Yongsheng Huang are with CAS Key Laboratory of Wireless-Optical Communications, University of Science and Technology of China, Hefei 230027, China (email: fv215b@mail.ustc.edu.cn; hclu@ustc.edu.cn; yuangchen21@mail.ustc.edu.cn; ysh6@mail.ustc.edu.cn).
Abstract

Optimizing microservice placement to enhance the reliability of services is crucial for improving the service level of microservice architecture-based mobile networks and Internet of Things (IoT) networks. Despite extensive research on service reliability, the impact of network load and routing on service reliability remains understudied, leading to suboptimal models and unsatisfactory performance. To address this issue, we propose a novel network-aware service reliability model that effectively captures the correlation between network state changes and reliability. Based on this model, we formulate the microservice placement problem as an integer nonlinear programming problem, aiming to maximize service reliability. Subsequently, a service reliability-aware placement (SRP) algorithm is proposed to solve the problem efficiently. To reduce bandwidth consumption, we further discuss the microservice placement problem with the shared backup path mechanism and propose a placement algorithm based on the SRP algorithm using shared path reliability calculation, known as the SRP-S algorithm. Extensive simulations demonstrate that the SRP algorithm reduces service failures by up to 29% compared to the benchmark algorithms. By introducing the shared backup path mechanism, the SRP-S algorithm reduces bandwidth consumption by up to 62% compared to the SRP algorithm with the fully protected path mechanism. It also reduces service failures by up to 21% compared to the SRP algorithm with the shared backup mechanism.

Index Terms:
Microservice Placement, Reliability Model, Network State, Fault Tolerance, Shared Backup Path

I Introduction

Cloud-native technologies [1] empower the creation and operation of applications in massively scalable distributed infrastructures, leveraging microservices architecture (MSA) [2] alongside platform technologies [3] like containers and virtual machines. With the Cloud Native Computing Foundation’s (CNCF) promotion of cloud-native technologies, MSA, which aims to improve software agility, is gradually coming into the limelight of both academia and industry. By splitting applications into microservices and interconnecting them using a lightweight application programming interface (API), MSA granulates complex services and provides an easier means of maintaining and updating software, accelerating new feature launches, and reducing manual costs. The benefits of MSA have led several major service providers such as Amazon, Netflix, and Spotify to use MSA to place their services [2], as well as making the Internet of Things (IoT) paradigm consider using MSA for smart manufacturing, Internet of Vehicles (IoV), and Industrial IoT (IIoT) [4]. Additionally, to fulfill service latency requirements [5] while avoiding the flooding of the backbone network with a large number of 5G and IoT devices [6], placing MSA-based services on infrastructure paradigms that are close to the users, such as edge or fog platforms [5], is emerging as a new trend for service placement in 5G environments [7].

Ensuring the reliability of 5G application services is crucial for improving the users quality of experience (QoE), and the MSA complicates this issue. For ultra-reliable low-latency communication (URLCC) services in 5G, such as telemedicine and autonomous driving, end-to-end service reliability of five nines (99.999%percent99.99999.999\%99.999 %) or more is typically required [8]. To improve reliability, traditional monolithic applications typically need to consider both the software reliability of the program itself and the reliability of the hardware on which the application is placed [9]. However, with the introduction of MSA, placing microservices in a distributed manner means that the service needs to bear more risk of failure from both hardware and software [10]. Therefore, how to improve the overall reliability of services when placing microservices has become an urgent problem.

A number of studies have been conducted to analyze the reliability model [11, 12, 13, 10, 14, 15] and place microservices more reliably. Reliability models can be divided into two categories: hardware reliability models [11, 12, 13] and software reliability models [14, 15, 16]. Based on the research on hardware and software reliability models, the reliability modeling studies for MSA-based services [17] comprehensively consider the overall reliability of the service after placing distributed software into the hardware. Since the placement strategy affects the service reliability, several works have been done to study microservice placement to enhance reliability [18, 19, 20, 21, 22, 23]. Microservice placement work can be categorized into placement in the cloud [18, 19, 20] and placement in the edge or fog [21, 22, 23, 24] based on the application scenario. Due to the variety of application scenarios, they have addressed different issues in terms of resources, quality of service, and reliability, and hence differ in service reliability modeling.

However, the dynamic nature of the network state caused by network load [25] and routing [26] has not been well studied in service reliability modeling, which brings new challenges to microservice placement. Network load has been shown to be negatively correlated with hardware reliability, i.e., the higher the load, the lower the reliability. In this case, hardware reliability is always changing dynamically during microservice placement. Network routing refers to the routing between microservices. On the one hand, the hardware reliability on the communication path is also load-dependent. On the other hand, with the maturity of multipath routing technologies, multipath routing can also have a significant impact on service reliability [27, 28, 29]. As a result, the impact of dynamic network state caused by network load and routing cannot be ignored in service reliability modeling as well as microservice placement.

In this paper, we propose a network-aware service reliability model to address the aforementioned issue. Firstly, to consider the impact of network state changes on reliability modeling, the network state is sensed by building a load-dependent hardware node reliability model and a routing-dependent path reliability model. Then, as each microservice placement and routing between microservices may change the reliability of the infrastructure network, a network-aware placement algorithm is proposed to achieve optimal service reliability performance. Furthermore, to reduce the bandwidth consumption of the infrastructure network, we investigate the microservice placement problem with the shared backup path mechanism. In this case, shared backup path contention due to simultaneous backup failures brings the microservice placement problem a new network state change factor. Since contending paths may lead to routing failures of backup instances, the contention probability of shared backup paths is considered carefully when placing microservices. The main contributions of this article are summarized as follows:

  • We propose a network-aware service reliability model to characterize the dynamic network states, with consideration of the load-dependent node reliability and the path reliability of multipath routing as well as the impacts of hardware and software decoupling and backup instances on the reliability of microservices. Simulation results validate the proposed reliability model with different algorithms in terms of the number of service failures.

  • Based on the proposed service reliability model, we formulate a microservice placement problem and then propose a service reliability-aware placement (SRP) algorithm to achieve maximum service reliability. The proposed algorithm evaluates the network-aware reliability of each microservice as it is placed. Simulation results show that the proposed algorithm reduces the number of service failures by up to 29% compared to the benchmark algorithms.

  • To reduce bandwidth consumption, we further investigate the microservice placement problem with the shared backup path mechanism and propose an algorithm based on the SRP algorithm using shared path reliability calculation (i.e., SRP-S algorithm). The proposed algorithm approximates the contention probability by calculating the probability that a single failure causes contention on the shared backup path and then reduces the occurrence of contention by combining the probability with network-aware service reliability. Simulation results show that the SRP-S algorithm reduces the bandwidth consumption by up to 62% compared to the SRP algorithm with the fully protected path mechanism and reduces the number of service failures by up to 21% compared to the SRP algorithms with the shared backup path mechanism.

The rest of the paper is organized as follows. Section II discusses the work related to reliability modeling and reliability-aware microservice placement over protected or shared paths. Section III gives the modeling procedure for the system model and the proposed service reliability model. In Section IV, we formulate the microservice placement problem with the fully protected and shared backup paths. The corresponding algorithms to solve them are proposed in Section V and Section VI, respectively. Simulation results are discussed in Section VII. Finally, the paper concludes in Section VIII.

II Related Work

Due to the distributed nature of MSA-based services, assessing service reliability is critical to ensure the failure tolerance of the MSA-based service, which has been addressed by research on reliability models and microservice placement.

II-A Reliability Model

Studies on reliability models can be classified into two main categories based on failure causes: hardware reliability and software reliability [15]. For hardware reliability, most studies have modeled the arrival of hardware failures as Poisson processes [30]. Wang et al. [9] proposed an instance-sharing reliability model to aggregate multiple services into a composite and proposed an algorithm to improve its reliability. Zhu et al. [12] proposed a load-dependent node reliability model to capture the relationship between failure probabilities and workloads and introduced a recovery strategy to handle workload variations. Mtawa et al. [31] proposed a link reliability model to assess the all-pair reliability of the network and tested it on both conventional network and SDN. Similar to hardware failures, the arrival of software failures is also usually modeled as a Poisson process [10]. Liu et al. [14] considered the software reliability problem in the framework of uncertainty theory and proposed a software reliability growth model based on uncertain differential equations.

Due to the different causes of hardware and software failures, it is inaccurate to consider the combination of hardware and placed software as a singular entity when modeling service reliability. Therefore, the service reliability model in the microservice placement problem should consider both hardware and software reliability [15, 22]. Qiu et al. [15] investigated the reliability model and fault recovery of cloud computing platforms, where the reliability model considered both hardware reliability and software reliability. Martin et al. [22] pointed out that the unavailability of a service is determined by software failures and hardware failures together and proposed a hardware-software decoupled reliability model.

II-B Microservice Placement

Microservice placement has been studied in a variety of scenarios [10, 32, 33]. Liu et al. [10] proposed an approach based on multi-intelligent body systems to maximize the reliability of services in cloud environments. Zhao et al. [32] considered the heterogeneity of edge environments and the uncertainty of service requests. They modeled the microservice placement problem as a stochastic optimization problem and proposed a statistics-based approach to solve it. Baranwal et al. [33] investigated the truthfulness of fog owners in fog environments and proposed a heuristic algorithm to ensure the truthfulness of fog owners and the reliability of services.

Since the reliability of MSA-based services depends on the microservice placement strategy, there have been many studies that investigated placing microservices with the goal of improving service reliability [8, 22, 23, 24]. Zeng et al. [8] formulated the microservice placement problem as an integer nonlinear programming problem and proposed a deep reinforcement learning scheme based on expert intervention to ensure high reliability and low latency of the service. Martin et al. [22] modeled the microservice placement problem as a multi-objective optimization problem and proposed a meta-heuristic algorithm to deal with the conflict between reliability and cost in the optimization objectives. Dadashi et al. [23] enhanced the reliability of the service by using backups and proposed a reliability-aware and delay-efficient heuristic algorithm to solve the microservice placement problem. The authors in [24] formulated the microservice placement problem in fog as a multi-objective optimization problem and proposed a fault-tolerant mechanism to improve the reliability of microservices while reducing power consumption and latency. However, while most works have used backup instances to enhance service reliability, the impact of network routing on service reliability has not been well studied.

Turning the perspective to routing, it can be seen that multipath routing techniques can significantly enhance the reliability of paths between microservices [27, 28, 29]. Le et al. [27] proposed a reliable service provisioning scheme to optimize network resource utilization by using multipath routing. The authors in [28] proposed a topology-agnostic multipath source routing scheme and orchestration architecture and verified its performance in improving communication reliability. Qu et al. [29] formalized the microservice placement problem as a mixed-integer linear programming problem and proposed a delay-aware hybrid multipath routing scheme to improve the reliability of network services.

To reduce bandwidth consumption, some researchers have considered using the shared backup path mechanism for network routing [34, 35, 36]. Saidi et al. [34] proposed two shared path mechanisms, including shared backup paths and shared all paths, to conserve bandwidth resources during network routing. Zheng et al. [35] used shared backup path protection to improve bandwidth capacity limits for elastic optical networks and used backup paths to improve system reliability during network routing. Ergenc et al. [36] used shared backup paths for service placement. They asserted that the proposed shared backup capacity model can bring up to 70% capacity gain and provide more than 90% fault tolerance for single node failure.

In the aforementioned related work, few works are aware of the impact of the load state and routing state in the network on the service reliability model as well as the microservice placement strategy. In addition, the reliability gain of backup instances of microservices after the decoupling of hardware and software has also not been well studied.

Refer to caption
Figure 1: The process of placing microservices into the infrastructure network.

III System Model

In this section, we give the system model and network-aware service reliability modeling. We first introduce the infrastructure network model for microservice placement and the service request model based on the MSA in Sec. III-A and Sec. III-B, respectively. Then, in order to clarify the difference between the already existing work and our work, we introduce the hardware reliability model and the software reliability model used in this paper in Sec. III-C, which serves as the basis for our network-aware service reliability modeling. In Sec. III-D, we formalize the network-aware service reliability model, which is innovatively sensitive to load-dependent node reliability and routing-dependent multipath routing reliability, and meticulously considers the impact of hardware and software reliability decoupling and backup instances on service reliability.

III-A Infrastructure Network Model

The infrastructure network model is established to provide the underlying network for microservice placement. As shown in Fig. 1, we represent the infrastructure network with an undirected graph, G=(N,E)𝐺𝑁𝐸G=(N,E)italic_G = ( italic_N , italic_E ), where N={n1,n2,,n|N|}𝑁subscript𝑛1subscript𝑛2subscript𝑛𝑁N=\{n_{1},n_{2},\cdots,n_{|N|}\}italic_N = { italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_n start_POSTSUBSCRIPT | italic_N | end_POSTSUBSCRIPT } denotes the set of physical nodes, and E={e12,e23,,e(|N|1)|N|}𝐸subscript𝑒12subscript𝑒23subscript𝑒𝑁1𝑁E=\{e_{12},e_{23},\cdots,e_{(|N|-1)|N|}\}italic_E = { italic_e start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT ( | italic_N | - 1 ) | italic_N | end_POSTSUBSCRIPT } represents the set of physical links in the infrastructure network. For any physical node nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, considering the most commonly utilized CPU core resources, we use c(ni)𝑐subscript𝑛𝑖c(n_{i})italic_c ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to denote the amount of resources that have been allocated, and C(ni)𝐶subscript𝑛𝑖C(n_{i})italic_C ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to represent its total resource capacity. For any physical link eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, the allocated bandwidth resources and total bandwidth resources are denoted by bw(eij)𝑏𝑤subscript𝑒𝑖𝑗bw(e_{ij})italic_b italic_w ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) and BW(eij)𝐵𝑊subscript𝑒𝑖𝑗BW(e_{ij})italic_B italic_W ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), respectively. Additionally, the reliability of physical nodes and links is denoted as rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and resubscript𝑟𝑒r_{e}italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, respectively, which describes the probability of no failure.

III-B Service Request Model

The service request model is established to describe information related to service requests. We use S𝑆Sitalic_S to represent the set of all service requests. Each service request is represented as si=(Gi,Bi,Υi,Di,Ωi)superscript𝑠𝑖superscript𝐺𝑖superscript𝐵𝑖superscriptΥ𝑖superscript𝐷𝑖superscriptΩ𝑖s^{i}=(G^{i},B^{i},\Upsilon^{i},D^{i},\Omega^{i})italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , roman_Υ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_D start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , roman_Ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), where Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a directed acyclic graph (DAG) used to represent the microservices in the service request and their dependencies, Bisuperscript𝐵𝑖B^{i}italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the maximum number of backups for microservices in the service request, ΥisuperscriptΥ𝑖\Upsilon^{i}roman_Υ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the amount of data transferred between microservices in service request sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, Disuperscript𝐷𝑖D^{i}italic_D start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the set of latency deadlines for each microservice in the service request, and ΩisuperscriptΩ𝑖\Omega^{i}roman_Ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the lifetime of service request sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. In the service request model described above, nodes in the microservice dependency graph Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represent microservices, and directed edges represent microservice links and invocation relationships between source and destination microservices. The microservice dependency graph Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the microservice model with Bisuperscript𝐵𝑖B^{i}italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT backup constraints are described in detail as follows.

The microservice dependency graph Gi=(Mi,Li)superscript𝐺𝑖superscript𝑀𝑖superscript𝐿𝑖G^{i}=(M^{i},L^{i})italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) consists of the microservice set Mi={m0i,m1i,m2i,,m|M|i}superscript𝑀𝑖subscriptsuperscript𝑚𝑖0subscriptsuperscript𝑚𝑖1subscriptsuperscript𝑚𝑖2subscriptsuperscript𝑚𝑖𝑀M^{i}=\{m^{i}_{0},m^{i}_{1},m^{i}_{2},\cdots,m^{i}_{|M|}\}italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_M | end_POSTSUBSCRIPT } and the microservice link set Li={lm1m2i,lm2m3i,,lm|M|1m|M|i}superscript𝐿𝑖subscriptsuperscript𝑙𝑖subscript𝑚1subscript𝑚2subscriptsuperscript𝑙𝑖subscript𝑚2subscript𝑚3subscriptsuperscript𝑙𝑖subscript𝑚𝑀1subscript𝑚𝑀L^{i}=\{l^{i}_{m_{1}m_{2}},l^{i}_{m_{2}m_{3}},\cdots,l^{i}_{m_{|M|-1}m_{|M|}}\}italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_l start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_l start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_l start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT | italic_M | - 1 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT | italic_M | end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. In microservice set Misuperscript𝑀𝑖M^{i}italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, m0isubscriptsuperscript𝑚𝑖0m^{i}_{0}italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a special virtual microservice that represents the request access location and does not consume the computational resources of the access node. m1isubscriptsuperscript𝑚𝑖1m^{i}_{1}italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represents the root node of the microservice graph. Additionally, each microservice m𝑚mitalic_m has fixed CPU core resource requirement c(m)𝑐𝑚c(m)italic_c ( italic_m ) and reliability rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Microservice link l𝑙litalic_l has bandwidth resource requirement and reliability, represented as bw(l)𝑏𝑤𝑙bw(l)italic_b italic_w ( italic_l ) and rlsubscript𝑟𝑙r_{l}italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, respectively.

Microservice backups are placed as new instances of the primary microservices to enhance the overall reliability of the service. When the primary microservice instance is operating normally, backup microservice instances need to occupy computing resources on physical nodes and utilize bandwidth resources on physical links to provide failure tolerance. When the primary microservice instance fails, if a backup microservice instance is available, the microservice can still connect to upstream or downstream microservices through it, allowing the service to continue running. We use mji(b)subscriptsuperscript𝑚𝑖𝑗𝑏m^{i}_{j}(b)italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_b ) to denote the b𝑏bitalic_b-th microservice instance of microservice mjisubscriptsuperscript𝑚𝑖𝑗m^{i}_{j}italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where bBmi𝑏subscriptsuperscript𝐵𝑖𝑚b\in B^{i}_{m}italic_b ∈ italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and Bmisubscriptsuperscript𝐵𝑖𝑚B^{i}_{m}italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represents the set of backup instance indexes for microservice m𝑚mitalic_m. For convenience, we abbreviate the primary microservice mji(1)subscriptsuperscript𝑚𝑖𝑗1m^{i}_{j}(1)italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 ) as mjisubscriptsuperscript𝑚𝑖𝑗m^{i}_{j}italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the following. In addition, Bisuperscript𝐵𝑖B^{i}italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in the service request model ensures that the number of microservice backups is limited to prevent unlimited resource consumption.

Since microservice instances typically have latency requirements, we consider that each microservice link has a latency, including transmission latency, propagation latency, and processing latency from child microservices, and represent the latency between a pair of parent-child microservices dτmsubscript𝑑𝜏𝑚d_{\tau m}italic_d start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT as follows:

dτm=υτmibw(lτm)+minpPlτm{epde}+dm,subscript𝑑𝜏𝑚subscriptsuperscript𝜐𝑖𝜏𝑚𝑏𝑤subscript𝑙𝜏𝑚subscript𝑝subscript𝑃subscript𝑙𝜏𝑚subscript𝑒𝑝subscript𝑑𝑒subscript𝑑𝑚\small d_{\tau m}=\frac{\upsilon^{i}_{\tau m}}{bw(l_{\tau m})}+\min_{p\in P_{l% _{\tau m}}}\{\sum_{e\in p}d_{e}\}+d_{m},italic_d start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT = divide start_ARG italic_υ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_b italic_w ( italic_l start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT ) end_ARG + roman_min start_POSTSUBSCRIPT italic_p ∈ italic_P start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_e ∈ italic_p end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT } + italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (1)

where τ𝜏\tauitalic_τ is a parent microservice instance of m𝑚mitalic_m, υτmiΥisubscriptsuperscript𝜐𝑖𝜏𝑚superscriptΥ𝑖\upsilon^{i}_{\tau m}\in\Upsilon^{i}italic_υ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT ∈ roman_Υ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the amount of data transmitted from microservice m𝑚mitalic_m to τ𝜏\tauitalic_τ after processing, desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT represents the propagation latency of physical link e𝑒eitalic_e, Plτmsubscript𝑃subscript𝑙𝜏𝑚P_{l_{\tau m}}italic_P start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the set of paths where link lτmsubscript𝑙𝜏𝑚l_{\tau m}italic_l start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT is placed, and dmsubscript𝑑𝑚d_{m}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represents the processing latency of microservice m𝑚mitalic_m. When a microservice link is placed on multiple paths, only the latency of the shortest path is considered.

III-C Hardware And Software Reliability Model

This subsection describes the reliability modeling of the elements involved in the microservice placement process, including hardware and software. First, the load dependency of hardware node reliability is the basis for service reliability to sense the network load, and the hardware link reliability constitutes the smallest unit of network routing-aware path reliability. Second, software reliability directly affects the service reliability gain of the backup instances. Additionally, software link reliability will directly affect the path reliability gained from multipath routing. The specific model is as follows:

  1. 𝟏.1\mathbf{1}.bold_1 .

    Hardware Reliability

    The reliability of physical nodes has been demonstrated to be correlated with the workload running on them [12, 11]. Load-dependent network node reliability affects the placement strategy, as densely placed microservices lead to lower reliability of physical nodes, while uniformly distributed microservices maintain a low load state of nodes at the cost of consuming more bandwidth resources. To describe the load dependence of dynamic network node reliability, we refer to the work of Zhu et al. [12] and represent the reliability rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of a physical node n𝑛nitalic_n as a two-segmented function as follows:

    rn(c(n))={rnL,c(n)ξ(n)rnH,ξ(n)<c(n)C(n),\small r_{n}(c(n))=\left\{\begin{aligned} r^{L}_{n},\qquad&c(n)\leq\xi(n)\\ r^{H}_{n},\qquad&\xi(n)<c(n)\leq C(n)\end{aligned}\qquad,\right.italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_c ( italic_n ) ) = { start_ROW start_CELL italic_r start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , end_CELL start_CELL italic_c ( italic_n ) ≤ italic_ξ ( italic_n ) end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , end_CELL start_CELL italic_ξ ( italic_n ) < italic_c ( italic_n ) ≤ italic_C ( italic_n ) end_CELL end_ROW , (2)

    where ξ(n)𝜉𝑛\xi(n)italic_ξ ( italic_n ) represents the load threshold at which the reliability of the physical node changes, and rnLsubscriptsuperscript𝑟𝐿𝑛r^{L}_{n}italic_r start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and rnHsubscriptsuperscript𝑟𝐻𝑛r^{H}_{n}italic_r start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the reliability of the physical node under low-load and high-load conditions, respectively.

    The failure arrival of a physical link e𝑒eitalic_e is usually modeled as a Poisson process [30]. Therefore, we model its reliability resubscript𝑟𝑒r_{e}italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT as follows:

    re(t)=Pr(x=0)=eλt,subscript𝑟𝑒𝑡𝑃𝑟𝑥0superscript𝑒𝜆𝑡\small r_{e}(t)=Pr(x=0)=e^{-\lambda t},italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_t ) = italic_P italic_r ( italic_x = 0 ) = italic_e start_POSTSUPERSCRIPT - italic_λ italic_t end_POSTSUPERSCRIPT , (3)

    where Pr()𝑃𝑟Pr(\cdot)italic_P italic_r ( ⋅ ) denotes probability, λ𝜆\lambdaitalic_λ represents the mean failure arrival rate, and t𝑡titalic_t denotes the time that the physical link has been operational.

  2. 𝟐.2\mathbf{2}.bold_2 .

    Software Reliability

    Software reliability encompasses the reliability of microservices and microservice links. Due to human factors that can lead to software failures, such as program design and environment configuration, software failures are typically modeled using specific software failure statistics. Nonetheless, the modeling of software failures does not affect subsequent analyses of service reliability, thus our work is compatible with arbitrary models. The arrival of software failures is modeled as a Poisson process. Therefore, the reliability rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and rlsubscript𝑟𝑙r_{l}italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of microservice m𝑚mitalic_m and microservice link l𝑙litalic_l can be respectively represented as

    rm(t)=Pr(x=0)=eλ1t,subscript𝑟𝑚𝑡𝑃𝑟𝑥0superscript𝑒subscript𝜆1𝑡\small r_{m}(t)=Pr(x=0)=e^{-\lambda_{1}t},italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) = italic_P italic_r ( italic_x = 0 ) = italic_e start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT , (4)
    rl(t)=Pr(x=0)=eλ2t,subscript𝑟𝑙𝑡𝑃𝑟𝑥0superscript𝑒subscript𝜆2𝑡\small r_{l}(t)=Pr(x=0)=e^{-\lambda_{2}t},italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_t ) = italic_P italic_r ( italic_x = 0 ) = italic_e start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT , (5)

    where Pr(x=0)𝑃𝑟𝑥0Pr(x=0)italic_P italic_r ( italic_x = 0 ) denotes the probability of no failure at moment t𝑡titalic_t, and λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the mean failure arrival rates for microservices and microservice links, respectively.

III-D Network-Aware Service Reliability Model

In this subsection, network-aware service reliability is modeled to assess the reliability level of the microservice placement strategy. Network-aware service reliability considers network-aware reliability, reliability of microservice dependencies, and reliability gain of backup microservice instances. Among them, network-aware reliability specifically refers to routing-dependent multipath reliability consisting of load-dependent physical node reliability and physical link reliability.

First, we consider network-aware reliability. Since the reliability of a physical node is related to the load of microservices running on it, we define a binary variable xnmsubscriptsuperscript𝑥𝑚𝑛x^{m}_{n}italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to indicate whether microservice m𝑚mitalic_m is placed on node n𝑛nitalic_n or not, where xnm=1subscriptsuperscript𝑥𝑚𝑛1x^{m}_{n}=1italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 means that microservice m𝑚mitalic_m is placed on node n𝑛nitalic_n. In addition, we define an extra binary variable yelsubscriptsuperscript𝑦𝑙𝑒y^{l}_{e}italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to indicate whether the microservice link l𝑙litalic_l is placed on the link e𝑒eitalic_e. Then we can use the defined binary variable xnmsubscriptsuperscript𝑥𝑚𝑛x^{m}_{n}italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to represent the physical node nmsuperscript𝑛𝑚n^{m}italic_n start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT where microservice m𝑚mitalic_m is placed as follows:

nm=nNxnmn.superscript𝑛𝑚subscript𝑛𝑁subscriptsuperscript𝑥𝑚𝑛𝑛\small n^{m}=\sum_{n\in N}x^{m}_{n}n.italic_n start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_n ∈ italic_N end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_n . (6)

In addition, the CPU resource occupancy c(n)𝑐𝑛c(n)italic_c ( italic_n ) can be represented as

c(n)=siSmMixnmc(m).𝑐𝑛subscriptsubscript𝑠𝑖𝑆subscript𝑚superscript𝑀𝑖subscriptsuperscript𝑥𝑚𝑛𝑐𝑚\small c(n)=\sum_{s_{i}\in S}\sum_{m\in M^{i}}x^{m}_{n}c(m).italic_c ( italic_n ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_c ( italic_m ) . (7)

Thus, the dynamic reliability of a physical node can be determined by Eq. (2) and Eq. (7).

Path reliability is considered since not only does the operation of microservices require node reliability, but also microservice links require that all physical nodes and links in their paths are reliable. We denote the j𝑗jitalic_j-th path where the link between two microservice instances is placed by pjmmsubscriptsuperscript𝑝𝑚superscript𝑚𝑗p^{mm^{\prime}}_{j}italic_p start_POSTSUPERSCRIPT italic_m italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which is a set that contains all nodes and edges on the path, but not the source and destination nodes. We can then represent the path reliability rpsubscript𝑟𝑝r_{p}italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for a single path p𝑝pitalic_p as follows:

rp=nNrnepre.subscript𝑟𝑝subscriptproduct𝑛𝑁subscript𝑟𝑛subscriptproduct𝑒𝑝subscript𝑟𝑒\small r_{p}=\prod_{n\in N}r_{n}\prod_{e\in p}r_{e}.italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_n ∈ italic_N end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_e ∈ italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT . (8)

Next, we can represent the total path reliability rPsubscript𝑟𝑃r_{P}italic_r start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT of the path set which contains multiple paths P𝑃Pitalic_P as follows:

rP=1pP(1rp).subscript𝑟𝑃1subscriptproduct𝑝𝑃1subscript𝑟𝑝\small r_{P}=1-\prod_{p\in P}(1-r_{p}).italic_r start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 1 - ∏ start_POSTSUBSCRIPT italic_p ∈ italic_P end_POSTSUBSCRIPT ( 1 - italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) . (9)

For the subsequent calculations, we need to record the path information when calculating the reliability of each path or multiple paths. We define a function 𝖯()𝖯\mathsf{P}(\cdot)sansserif_P ( ⋅ ) that is used to query the corresponding path set from the calculated path reliability as follows:

𝖯(rP)=P,𝖯subscript𝑟𝑃𝑃\small\mathsf{P}(r_{P})=P,sansserif_P ( italic_r start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) = italic_P , (10)

where P𝑃Pitalic_P is the set of paths corresponding to the total path reliability rPsubscript𝑟𝑃r_{P}italic_r start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT.

Next, we consider the total path reliability of a single microservice link placed on multiple paths. Calculating the two-terminal network reliability is the most accurate measure of total path reliability in a general network. However, in the microservice placement problem, routes are determined at the time of placement rather than being freely switchable at runtime to ensure resource provisioning. Therefore, we need to consider the reliability sum of a finite number of paths instead of two-end reliability. In this case, we consider all internally disjoint paths (IDPs) to avoid common cause faults (CCFs) [13].

We propose a microservice path reliability matrix to describe the reliability of all IDPs between two microservices. First, we express the reliability of a physical node nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the link eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT connected to it as 𝗋ij=rnireijsubscript𝗋𝑖𝑗subscript𝑟subscript𝑛𝑖subscript𝑟subscript𝑒𝑖𝑗\mathsf{r}_{ij}=r_{n_{i}}r_{e_{ij}}sansserif_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then, we propose a one-step path reliability matrix R𝑅Ritalic_R using the concept of the adjacency matrix as follows:

R=R(1)=[0𝗋12𝗋13𝗋1|N|𝗋210𝗋23𝗋2|N|𝗋|N|1𝗋|N|20],𝑅superscript𝑅1matrix0subscript𝗋12subscript𝗋13subscript𝗋1𝑁missing-subexpressionsubscript𝗋210subscript𝗋23subscript𝗋2𝑁missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript𝗋𝑁1subscript𝗋𝑁2missing-subexpression0missing-subexpression\small R=R^{(1)}=\begin{bmatrix}0&\mathsf{r}_{12}&\mathsf{r}_{13}&\cdots&% \mathsf{r}_{1|N|}&\\ \mathsf{r}_{21}&0&\mathsf{r}_{23}&\cdots&\mathsf{r}_{2|N|}&\\ \vdots&&\ddots&&\vdots&\\ \mathsf{r}_{|N|1}&\mathsf{r}_{|N|2}&\cdots&&0&\\ \end{bmatrix},italic_R = italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL sansserif_r start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL sansserif_r start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL sansserif_r start_POSTSUBSCRIPT 1 | italic_N | end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL sansserif_r start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL sansserif_r start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL sansserif_r start_POSTSUBSCRIPT 2 | italic_N | end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL sansserif_r start_POSTSUBSCRIPT | italic_N | 1 end_POSTSUBSCRIPT end_CELL start_CELL sansserif_r start_POSTSUBSCRIPT | italic_N | 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL end_CELL start_CELL 0 end_CELL start_CELL end_CELL end_ROW end_ARG ] , (11)

where 𝗋ij=0subscript𝗋𝑖𝑗0\mathsf{r}_{ij}=0sansserif_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 if the physical link eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT does not exist. Then, for ease of representation in subsequent calculations, we define the path reliability operator as follows:

xydirect-sum𝑥𝑦\displaystyle x\!\oplus\!yitalic_x ⊕ italic_y =1(1x)(1y)absent11𝑥1𝑦\displaystyle=1-(1-x)(1-y)= 1 - ( 1 - italic_x ) ( 1 - italic_y ) (12)
=x+yxy,x,y[0,1],formulae-sequenceabsent𝑥𝑦𝑥𝑦𝑥𝑦01\displaystyle=x+y-xy,\qquad x,y\in\left[0,1\right],= italic_x + italic_y - italic_x italic_y , italic_x , italic_y ∈ [ 0 , 1 ] ,
xy=xy1y,x[0,1],y[0,x),formulae-sequencesymmetric-difference𝑥𝑦𝑥𝑦1𝑦formulae-sequence𝑥01𝑦0𝑥\small x\!\ominus\!y=\frac{x-y}{1-y},\qquad x\in[0,1],y\in\left[0,x\right),italic_x ⊖ italic_y = divide start_ARG italic_x - italic_y end_ARG start_ARG 1 - italic_y end_ARG , italic_x ∈ [ 0 , 1 ] , italic_y ∈ [ 0 , italic_x ) , (13)
xy={0,pxpy,py𝖯(y),px𝖯(x)xy,else,\displaystyle x\!\otimes\!y\!\!=\!\!\left\{\!\begin{aligned} &0,\quad p_{x}\!% \cap\!p_{y}\neq\emptyset,p_{y}\!\in\!\mathsf{P}(y),\!p_{x}\!\in\!\mathsf{P}(x)% \\ &xy,\quad\quad\quad\quad\quad else\end{aligned},\right.italic_x ⊗ italic_y = { start_ROW start_CELL end_CELL start_CELL 0 , italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∩ italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ≠ ∅ , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ sansserif_P ( italic_y ) , italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ sansserif_P ( italic_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_x italic_y , italic_e italic_l italic_s italic_e end_CELL end_ROW , (14)
x,y[0,1],𝑥𝑦01\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad x,y\in[0,1],italic_x , italic_y ∈ [ 0 , 1 ] ,

where xydirect-sum𝑥𝑦x\oplus yitalic_x ⊕ italic_y denotes the reliability sum of two paths, xysymmetric-difference𝑥𝑦x\ominus yitalic_x ⊖ italic_y denotes the reliability sum of multiple paths minus the reliability of one of the paths, and xytensor-product𝑥𝑦x\otimes yitalic_x ⊗ italic_y denotes the reliability of two paths merged.

Based on the path reliability operator and the preservation of the path information corresponding to reliability, we define the multiplication of the path reliability matrix as follows:

AB=C=[cij],𝐴𝐵𝐶delimited-[]subscript𝑐𝑖𝑗\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad A\wedge B=C=[c_{ij}],italic_A ∧ italic_B = italic_C = [ italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] , (15)
cij={ai1b1jai2b2jainbnj,ij0,i=j,\displaystyle c_{ij}=\!\left\{\begin{aligned} &a_{i1}\otimes b_{1j}\oplus a_{i% 2}\otimes b_{2j}\cdots a_{in}\otimes b_{nj},\!\!\!\!\!&i\neq j\\ &0,\!\!\!\!\!&i=j\end{aligned},\right.italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ⊗ italic_b start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ⊕ italic_a start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT ⊗ italic_b start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ⋯ italic_a start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ⊗ italic_b start_POSTSUBSCRIPT italic_n italic_j end_POSTSUBSCRIPT , end_CELL start_CELL italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 0 , end_CELL start_CELL italic_i = italic_j end_CELL end_ROW ,

where cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the reliability sum of equal-length paths from physical node i𝑖iitalic_i to j𝑗jitalic_j. Thus, we can represent the k𝑘kitalic_k-th order path reliability matrix as follows:

R(k)=R(k1)R(1),k2,formulae-sequencesuperscript𝑅𝑘superscript𝑅𝑘1superscript𝑅1𝑘2\small R^{(k)}=R^{(k-1)}\wedge R^{(1)},k\geq 2,italic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ∧ italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_k ≥ 2 , (16)

where k𝑘kitalic_k represents the length of the path.

Next, we define path reliability matrix addition as follows:

AB=C=[cij],𝐴𝐵𝐶delimited-[]subscript𝑐𝑖𝑗\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad A\vee B=C=[c_{ij}],italic_A ∨ italic_B = italic_C = [ italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] , (17)
cij={aij,papb,pa𝖯(aij),pb𝖯(bij)aijbij,else.\displaystyle c_{ij}\!\!=\!\!\left\{\begin{aligned} &a_{ij},\quad p_{a}\!\cap% \!p_{b}\neq\emptyset,p_{a}\!\!\in\!\mathsf{P}(a_{ij}),p_{b}\!\!\in\!\mathsf{P}% (b_{ij})\\ &a_{ij}\oplus b_{ij},\quad\quad\quad\quad\quad else\end{aligned}.\right.italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∩ italic_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ≠ ∅ , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ sansserif_P ( italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ sansserif_P ( italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⊕ italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_e italic_l italic_s italic_e end_CELL end_ROW .

As a result, we can represent all path reliability matrices with a maximum length of k𝑘kitalic_k as follows:

R^(k)=R(1)R(2)R(k),superscript^𝑅𝑘superscript𝑅1superscript𝑅2superscript𝑅𝑘\small\hat{R}^{(k)}=R^{(1)}\vee R^{(2)}\vee\cdots\vee R^{(k)},over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∨ italic_R start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∨ ⋯ ∨ italic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , (18)

where the elements 𝗋^ij(k)subscriptsuperscript^𝗋𝑘𝑖𝑗\hat{\mathsf{r}}^{(k)}_{ij}over^ start_ARG sansserif_r end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in R^(k)superscript^𝑅𝑘\hat{R}^{(k)}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT represent the reliability of all IDPs from physical node nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to physical node njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with a length not greater than k𝑘kitalic_k. However, the total path reliability is inaccurate because each path includes the source node within it and does not meet the definition of an IDP. Therefore, we denote the path set 𝖯(𝗋^ij(k))superscript𝖯subscriptsuperscript^𝗋𝑘𝑖𝑗\mathsf{P}^{\prime}(\hat{\mathsf{r}}^{(k)}_{ij})sansserif_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG sansserif_r end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) after removing the source nodes of all paths as follows:

𝖯(𝗋^ij(k))={p|p=p\{ni},p𝖯(𝗋^ij(k))}.superscript𝖯subscriptsuperscript^𝗋𝑘𝑖𝑗conditional-setsuperscript𝑝formulae-sequencesuperscript𝑝\𝑝subscript𝑛𝑖𝑝𝖯subscriptsuperscript^𝗋𝑘𝑖𝑗\small\mathsf{P}^{\prime}(\hat{\mathsf{r}}^{(k)}_{ij})=\{p^{\prime}|p^{\prime}% =p\backslash\{n_{i}\},p\in\mathsf{P}(\hat{\mathsf{r}}^{(k)}_{ij})\}.sansserif_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG sansserif_r end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = { italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_p \ { italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_p ∈ sansserif_P ( over^ start_ARG sansserif_r end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) } . (19)

Finally, we can represent the network-aware reliability matrix as follows:

(k)=[rij(k)],rij(k)={r𝖯(𝗋^ij(k)),ij1,i=j.\small\mathcal{R}^{(k)}=\left[r^{(k)}_{ij}\right],r^{(k)}_{ij}=\left\{\begin{% aligned} &r_{\mathsf{P}^{\prime}(\hat{\mathsf{r}}^{(k)}_{ij})},&&i\neq j\\ &1,&&i=j\end{aligned}.\right.caligraphic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = [ italic_r start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] , italic_r start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_r start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG sansserif_r end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL start_CELL italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 , end_CELL start_CELL end_CELL start_CELL italic_i = italic_j end_CELL end_ROW . (20)

After obtaining the network-aware reliability matrix, we can analyze the reliability of the microservice dependency graph and the reliability gain brought by backup microservice instances. We divided the analysis of network-aware service reliability into the following four steps.

First, analyze the reliability between a single parent instance and a single child instance. We use τmfsuperscriptsubscript𝜏𝑚𝑓\tau_{m}^{f}italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT to represent the f𝑓fitalic_f-th parent microservice of microservice m𝑚mitalic_m. For each child microservice instance, its parent microservice instance is connected to it through a microservice link placed on one or more paths. So we can consider the reliability of the microservice link from the b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-th instance of the child microservice to the b2subscript𝑏2b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-th instance of the f𝑓fitalic_f-th parent microservice and the network-aware reliability between the ends as a whole. We call this whole the effective probability of microservice link κm(b1),τmf(b2)(k)subscriptsuperscript𝜅𝑘𝑚subscript𝑏1subscriptsuperscript𝜏𝑓𝑚subscript𝑏2\kappa^{(k)}_{m(b_{1}),\tau^{f}_{m}(b_{2})}italic_κ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and denote it as follows:

κm(b1),τmf(b2)(k)=rlm(b1)τmf(b2)rnm(b1)nτmf(b2)(k),subscriptsuperscript𝜅𝑘𝑚subscript𝑏1subscriptsuperscript𝜏𝑓𝑚subscript𝑏2subscript𝑟subscript𝑙𝑚subscript𝑏1subscriptsuperscript𝜏𝑓𝑚subscript𝑏2subscriptsuperscript𝑟𝑘superscript𝑛𝑚subscript𝑏1superscript𝑛subscriptsuperscript𝜏𝑓𝑚subscript𝑏2\small\kappa^{(k)}_{m(b_{1}),\tau^{f}_{m}(b_{2})}=r_{l_{m(b_{1})\tau^{f}_{m}(b% _{2})}}r^{(k)}_{n^{m(b_{1})}n^{\tau^{f}_{m}(b_{2})}},italic_κ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (21)

where rlm(b1)τmf(b2)subscript𝑟subscript𝑙𝑚subscript𝑏1subscriptsuperscript𝜏𝑓𝑚subscript𝑏2r_{l_{m(b_{1})\tau^{f}_{m}(b_{2})}}italic_r start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT represent the software reliability of the microservice link and rnm(b1)nτmf(b2)(k)subscriptsuperscript𝑟𝑘superscript𝑛𝑚subscript𝑏1superscript𝑛subscriptsuperscript𝜏𝑓𝑚subscript𝑏2r^{(k)}_{n^{m(b_{1})}n^{\tau^{f}_{m}(b_{2})}}italic_r start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT represent the network-aware reliability between the nodes where the parent and child microservices are placed. In addition, since the virtual microservice m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT used to represent the access location does not have a parent microservice, we let κm0=1subscript𝜅subscript𝑚01\kappa_{m_{0}}=1italic_κ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1.

Second, analyze the reliability of a single microservice instance and all its parent microservice links. To achieve this, understanding the impact of the node reliability on the effective probability of microservice links is essential. Node reliability may be reused by the network-aware reliability of multiple microservice instances and links, leading to spurious reliability. Therefore, node reliability should be considered only once in the calculation. However, backup instances introduce a new problem: node availability may be a non-essential condition for service availability. To address this issue, we focus on nodes whose failure would inevitably lead to service failure and call them critical nodes. We denote the set of critical nodes by Ncisubscriptsuperscript𝑁𝑖𝑐N^{i}_{c}italic_N start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which can be represented as follows:

Nci={n|n=n,mMi,nbBmixnm(b)0,nN},subscriptsuperscript𝑁𝑖𝑐conditional-setsuperscript𝑛formulae-sequencesuperscript𝑛𝑛formulae-sequence𝑚superscript𝑀𝑖formulae-sequence𝑛subscriptproduct𝑏subscriptsuperscript𝐵𝑖𝑚subscriptsuperscript𝑥𝑚𝑏𝑛0𝑛𝑁\small N^{i}_{c}\!=\!\!\{n^{\prime}|n^{\prime}=n,\exists m\!\in\!M^{i},n\!\!% \prod_{b\in B^{i}_{m}}\!\!x^{m(b)}_{n}\!\neq\!0,n\!\in\!\!N\},italic_N start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_n , ∃ italic_m ∈ italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_n ∏ start_POSTSUBSCRIPT italic_b ∈ italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_m ( italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≠ 0 , italic_n ∈ italic_N } , (22)

where Bmisubscriptsuperscript𝐵𝑖𝑚B^{i}_{m}italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the set of backup instance indexes of microservice m𝑚mitalic_m.

Now, we need to correct the network-aware reliability of the microservice link paths with the set of critical nodes. We denote the total path reliability r^Psubscript^𝑟𝑃\hat{r}_{P}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT corrected by the set of critical nodes as follows:

r^P=np\pNcirnepre.subscript^𝑟𝑃subscriptproduct𝑛\𝑝𝑝subscriptsuperscript𝑁𝑖𝑐subscript𝑟𝑛subscriptproduct𝑒𝑝subscript𝑟𝑒\small\hat{r}_{P}=\prod_{n\in p\backslash p\cap N^{i}_{c}}r_{n}\prod_{e\in p}r% _{e}.over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_n ∈ italic_p \ italic_p ∩ italic_N start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_e ∈ italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT . (23)

Network-aware reliability can be corrected as follows:

r^ij(k)={r^𝖯(𝗋^ij(k)),ij1,i=j.\small\hat{r}^{(k)}_{ij}=\left\{\begin{aligned} &\hat{r}_{\mathsf{P}^{\prime}(% \hat{\mathsf{r}}^{(k)}_{ij})},&&i\neq j\\ &1,&&i=j\end{aligned}.\right.over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over^ start_ARG sansserif_r end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL start_CELL italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 , end_CELL start_CELL end_CELL start_CELL italic_i = italic_j end_CELL end_ROW . (24)

The corrected effective probability of microservice link κ^m(b1),τmf(b2)(k)subscriptsuperscript^𝜅𝑘𝑚subscript𝑏1subscriptsuperscript𝜏𝑓𝑚subscript𝑏2\hat{\kappa}^{(k)}_{m(b_{1}),\tau^{f}_{m}(b_{2})}over^ start_ARG italic_κ end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT can be expressed as follows:

κ^m(b1),τmf(b2)(k)=rlm(b1)τmf(b2)r^nm(b1)nτmf(b2)(k).subscriptsuperscript^𝜅𝑘𝑚subscript𝑏1subscriptsuperscript𝜏𝑓𝑚subscript𝑏2subscript𝑟subscript𝑙𝑚subscript𝑏1subscriptsuperscript𝜏𝑓𝑚subscript𝑏2subscriptsuperscript^𝑟𝑘superscript𝑛𝑚subscript𝑏1superscript𝑛subscriptsuperscript𝜏𝑓𝑚subscript𝑏2\small\hat{\kappa}^{(k)}_{m(b_{1}),\tau^{f}_{m}(b_{2})}=r_{l_{m(b_{1})\tau^{f}% _{m}(b_{2})}}\hat{r}^{(k)}_{n^{m(b_{1})}n^{\tau^{f}_{m}(b_{2})}}.over^ start_ARG italic_κ end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_m ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (25)

Now we can denote the reliability of a single microservice instance m(b)𝑚𝑏m(b)italic_m ( italic_b ) and all its parent microservice links by σm(b),n(k)subscriptsuperscript𝜎𝑘𝑚𝑏𝑛\sigma^{(k)}_{m(b),n}italic_σ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m ( italic_b ) , italic_n end_POSTSUBSCRIPT, which is as follows:

σm(b),n(k)=rm(b)fFm(b)(1bBτ(1xnm(b)κ^m(b),τmf(b)(k)))subscriptsuperscript𝜎𝑘𝑚𝑏𝑛subscript𝑟𝑚𝑏subscriptproduct𝑓subscript𝐹𝑚𝑏1subscriptproductsuperscript𝑏subscript𝐵𝜏1subscriptsuperscript𝑥𝑚𝑏𝑛subscriptsuperscript^𝜅𝑘𝑚𝑏subscriptsuperscript𝜏𝑓𝑚superscript𝑏\small\sigma^{(k)}_{m(b),n}\!\!=\!r_{\!m(b)}\!\!\!\!\!\!\prod_{f\in F_{m(b)}}% \!\!\!\!(1\!\!-\!\!\!\!\prod_{b^{\prime}\in B_{\tau}}\!\!(1-x^{m(b)}_{n}\hat{% \kappa}^{(k)}_{m(b),\tau^{f}_{\!m}\!(b^{\prime})})\!)italic_σ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m ( italic_b ) , italic_n end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_m ( italic_b ) end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_f ∈ italic_F start_POSTSUBSCRIPT italic_m ( italic_b ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - ∏ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - italic_x start_POSTSUPERSCRIPT italic_m ( italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_κ end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m ( italic_b ) , italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ) ) (26)

where Bτsubscript𝐵𝜏B_{\tau}italic_B start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT denotes the set of backup instance indexes of microservice τmfsubscriptsuperscript𝜏𝑓𝑚\tau^{f}_{m}italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and Fm(b)subscript𝐹𝑚𝑏F_{m(b)}italic_F start_POSTSUBSCRIPT italic_m ( italic_b ) end_POSTSUBSCRIPT denotes the set of indexes of the parent microservice instances. For convenience, we omit the superscript (k)𝑘(k)( italic_k ) of σm(b),n(k)subscriptsuperscript𝜎𝑘𝑚𝑏𝑛\sigma^{(k)}_{m(b),n}italic_σ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m ( italic_b ) , italic_n end_POSTSUBSCRIPT in the subsequent analyses, which simply denotes the maximum path length in the network-aware reliability.

Third, analyze the reliability of all instances of a microservice and the links from their parents. We first use σm,nsubscript𝜎𝑚𝑛\sigma_{m,n}italic_σ start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT to denote the reliability of all instances of microservice m𝑚mitalic_m that are placed on the same node n𝑛nitalic_n, which can be expressed as follows:

σm,n=1bBm(1σm(b),n).subscript𝜎𝑚𝑛1subscriptproduct𝑏subscript𝐵𝑚1subscript𝜎𝑚𝑏𝑛\small\sigma_{m,n}=1-\prod_{b\in B_{m}}(1-\sigma_{m(b),n}).italic_σ start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = 1 - ∏ start_POSTSUBSCRIPT italic_b ∈ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - italic_σ start_POSTSUBSCRIPT italic_m ( italic_b ) , italic_n end_POSTSUBSCRIPT ) . (27)

Now we can obtain the reliability of all instances of microservice m𝑚mitalic_m on all nodes. We denote it by σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as follows:

σm=1nN\Ncirn(1σm,n)nNci(1σm,n),subscript𝜎𝑚1subscriptproduct𝑛\𝑁subscriptsuperscript𝑁𝑖𝑐subscript𝑟𝑛1subscript𝜎𝑚𝑛subscriptproduct𝑛subscriptsuperscript𝑁𝑖𝑐1subscript𝜎𝑚𝑛\small\sigma_{m}=1-\!\!\!\!\!\prod_{n\in N\backslash N^{i}_{c}}\!\!\!\!\!r_{n}% (1-\sigma_{m,n})\prod_{n\in N^{i}_{c}}(1-\sigma_{m,n}),italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 1 - ∏ start_POSTSUBSCRIPT italic_n ∈ italic_N \ italic_N start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 - italic_σ start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_n ∈ italic_N start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - italic_σ start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) , (28)

where we temporarily disregard the reliability of critical nodes to avoid their reuse.

Fourth, analyze the reliability of the entire microservice dependency graph, i.e., network-aware service reliability. We denote the service reliability, which consists of the reliability of all microservices and the reliability of critical nodes, by rGisubscript𝑟superscript𝐺𝑖r_{G^{i}}italic_r start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as shown below:

rGi=nNcirnmMiσm.subscript𝑟superscript𝐺𝑖subscriptproduct𝑛subscriptsuperscript𝑁𝑖𝑐subscript𝑟𝑛subscriptproduct𝑚superscript𝑀𝑖subscript𝜎𝑚\small r_{G^{i}}=\prod_{n\in N^{i}_{c}}r_{n}\prod_{m\in M^{i}}\sigma_{m}.italic_r start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_n ∈ italic_N start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m ∈ italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT . (29)

IV Problem Formulation

The microservice placement problem is defined as a mapping ψ:GiG:𝜓subscript𝐺𝑖𝐺\psi:G_{i}\rightarrow Gitalic_ψ : italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_G. Microservice placement involves two tasks: node placement and path selection. Node placement involves placing each microservice instance (including backup microservice instances) from a service request onto a single physical node in the infrastructure network, while path selection involves mapping the link between any two microservice instances to one or more consecutive physical links. After a service request expires, microservice placement is revoked and the occupied resources are released. In this paper, the primary objective of microservice placement is to maximize the service reliability of a single service request while meeting latency and resource constraints. Therefore, for a single service request s𝑠sitalic_s arriving at the current time, we can formalize the microservice placement problem as an integer nonlinear programming problem and represent it as follows:

𝒫1:maxrGi,:𝒫1subscript𝑟superscript𝐺𝑖\small\mathcal{P}1:\max r_{G^{i}},caligraphic_P 1 : roman_max italic_r start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (30)

s.t.

siSlLiyelbw(l)BW(e),eE,formulae-sequencesubscriptsuperscript𝑠𝑖𝑆subscript𝑙superscript𝐿𝑖subscriptsuperscript𝑦𝑙𝑒𝑏𝑤𝑙𝐵𝑊𝑒for-all𝑒𝐸\small\sum_{s^{i}\in S}\sum_{l\in L^{i}}y^{l}_{e}bw(l)\leq BW(e),\forall e\in E,∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l ∈ italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_b italic_w ( italic_l ) ≤ italic_B italic_W ( italic_e ) , ∀ italic_e ∈ italic_E , (31a)
siSmMixnmc(m),C(n),nN,\small\sum_{s^{i}\in S}\sum_{m\in M^{i}}x^{m}_{n}c(m),\leq C(n),\forall n\in N,∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_c ( italic_m ) , ≤ italic_C ( italic_n ) , ∀ italic_n ∈ italic_N , (31b)
nNxnm(b)=1,bBm,mM,formulae-sequencesubscript𝑛𝑁subscriptsuperscript𝑥𝑚𝑏𝑛1formulae-sequencefor-all𝑏subscript𝐵𝑚𝑚𝑀\small\sum_{n\in N}x^{m(b)}_{n}=1,\forall b\in B_{m},m\in M,∑ start_POSTSUBSCRIPT italic_n ∈ italic_N end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_m ( italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , ∀ italic_b ∈ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_m ∈ italic_M , (31c)
dτmDτm,mM,formulae-sequencesubscript𝑑𝜏𝑚subscript𝐷𝜏𝑚for-all𝑚𝑀\small d_{\tau m}\leq D_{\tau m},\forall m\in M,italic_d start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT ≤ italic_D start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT , ∀ italic_m ∈ italic_M , (31d)
mM|Bm|B+|M|,subscript𝑚𝑀subscript𝐵𝑚𝐵𝑀\small\sum_{m\in M}|B_{m}|\leq B+|M|,∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ italic_B + | italic_M | , (31e)
B<|M|,𝐵𝑀\small B<|M|,italic_B < | italic_M | , (31f)
xnm{0,1},mM,nN,formulae-sequencesubscriptsuperscript𝑥𝑚𝑛01formulae-sequencefor-all𝑚𝑀𝑛𝑁\small x^{m}_{n}\in\left\{0,1\right\},\forall m\in M,n\in N,italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ { 0 , 1 } , ∀ italic_m ∈ italic_M , italic_n ∈ italic_N , (31g)
yel{0,1},lL,eE,formulae-sequencesubscriptsuperscript𝑦𝑙𝑒01formulae-sequencefor-all𝑙𝐿𝑒𝐸\small y^{l}_{e}\in\left\{0,1\right\},\forall l\in L,e\in E,italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ { 0 , 1 } , ∀ italic_l ∈ italic_L , italic_e ∈ italic_E , (31h)
c(n),c(m),C(n)0,mM,nN,formulae-sequence𝑐𝑛𝑐𝑚𝐶𝑛0formulae-sequencefor-all𝑚𝑀𝑛𝑁\small c(n),c(m),C(n)\geq 0,\forall m\in M,n\in N,italic_c ( italic_n ) , italic_c ( italic_m ) , italic_C ( italic_n ) ≥ 0 , ∀ italic_m ∈ italic_M , italic_n ∈ italic_N , (31i)
bw(l),bw(e),BW(e)0,lL,eE,formulae-sequence𝑏𝑤𝑙𝑏𝑤𝑒𝐵𝑊𝑒0formulae-sequencefor-all𝑙𝐿𝑒𝐸\small bw(l),bw(e),BW(e)\geq 0,\forall l\in L,e\in E,italic_b italic_w ( italic_l ) , italic_b italic_w ( italic_e ) , italic_B italic_W ( italic_e ) ≥ 0 , ∀ italic_l ∈ italic_L , italic_e ∈ italic_E , (31j)

where constraints (31a)-(31b) ensure that the resource requirements of service requests do not exceed the resource limits of physical nodes and links in the infrastructure network, constraint (31c) ensures that each microservice instance is placed on only one physical node, constraint (31d) ensures that the latency of each microservice link does not exceed its latency requirements, constraint (31e) ensures that the number of backup microservices does not exceed the backup limit of the service request, constraint (31f) ensures that the backup limit does not exceed the number of microservices in the microservice dependency graph, and constraints (31g)-(31j) specify the value ranges of variables and resources.

In Section III-B, we mentioned that backup microservice instances consume computing resources on physical nodes and bandwidth resources on physical links. However, since backup microservice instances mostly remain in an inactive state (becoming active only when the primary microservice instance fails), providing dedicated bandwidth protection for them is not always necessary. Therefore, we consider the concept of the shared backup path, which allows backup microservice instances to share network bandwidth resources to reduce bandwidth consumption. However, the introduction of the shared backup path mechanism creates a new problem: how to avoid multiple backup instances becoming active at the same time and causing network bandwidth contention, which can lead to service request failures. To solve this problem, we first introduce an upper limit, denoted as BW^(e)^𝐵𝑊𝑒\hat{BW}(e)over^ start_ARG italic_B italic_W end_ARG ( italic_e ), which limits the shared bandwidth capacity. The upper limit is denoted as ω𝜔\omegaitalic_ω times the protected bandwidth limit BW(e)𝐵𝑊𝑒BW(e)italic_B italic_W ( italic_e ):

BW^(e)=ωBW(e),^𝐵𝑊𝑒𝜔𝐵𝑊𝑒\small\hat{BW}(e)=\omega BW(e),over^ start_ARG italic_B italic_W end_ARG ( italic_e ) = italic_ω italic_B italic_W ( italic_e ) , (32)

where ω0𝜔0\omega\geq 0italic_ω ≥ 0. Then we modify constraint (31a) of problem 𝒫1𝒫1\mathcal{P}1caligraphic_P 1 and propose problem 𝒫2𝒫2\mathcal{P}2caligraphic_P 2, which aims to maximize service reliability with the shared backup path mechanism. Problem 𝒫2𝒫2\mathcal{P}2caligraphic_P 2 is as follows:

𝒫2:maxrGi,:𝒫2subscript𝑟superscript𝐺𝑖\small\mathcal{P}2:\max r_{G^{i}},caligraphic_P 2 : roman_max italic_r start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (33)

s.t.

siSlL1iyelbw(l)BW(e),eE,formulae-sequencesubscriptsuperscript𝑠𝑖𝑆subscript𝑙subscriptsuperscript𝐿𝑖1subscriptsuperscript𝑦𝑙𝑒𝑏𝑤𝑙𝐵𝑊𝑒for-all𝑒𝐸\small\sum_{s^{i}\in S}\sum_{l\in L^{i}_{1}}y^{l}_{e}bw(l)\leq BW(e),\forall e% \in E,∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l ∈ italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_b italic_w ( italic_l ) ≤ italic_B italic_W ( italic_e ) , ∀ italic_e ∈ italic_E , (34a)
siSlLi\L1iyelbw(l)BW^(e),eE,formulae-sequencesubscriptsuperscript𝑠𝑖𝑆subscript𝑙\superscript𝐿𝑖subscriptsuperscript𝐿𝑖1subscriptsuperscript𝑦𝑙𝑒𝑏𝑤𝑙^𝐵𝑊𝑒for-all𝑒𝐸\small\!\!\sum_{s^{i}\in S}\sum_{l\in L^{i}\!\backslash L^{i}_{1}}\!\!\!y^{l}_% {e}bw(l)\!\leq\!\hat{BW}(e),\!\forall e\!\in\!\!E,∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l ∈ italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT \ italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_b italic_w ( italic_l ) ≤ over^ start_ARG italic_B italic_W end_ARG ( italic_e ) , ∀ italic_e ∈ italic_E , (34b)
(31b)(31j),31b31j(\ref{const_cpu})-(\ref{bwfield}),( ) - ( ) , (34c)

where L1isubscriptsuperscript𝐿𝑖1L^{i}_{1}italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represents the set of links between primary microservices. We denote it as follows:

L1i={l|l=lm(1)m(1),m,mMi}.subscriptsuperscript𝐿𝑖1conditional-set𝑙formulae-sequence𝑙subscript𝑙𝑚1superscript𝑚1𝑚superscript𝑚superscript𝑀𝑖L^{i}_{1}=\{l|l=l_{m(1)m^{\prime}(1)},m,m^{\prime}\in M^{i}\}.italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_l | italic_l = italic_l start_POSTSUBSCRIPT italic_m ( 1 ) italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 1 ) end_POSTSUBSCRIPT , italic_m , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } . (35)

Constraint (34a) ensures that protected bandwidth consumption does not exceed the protected bandwidth limit, while constraint (34b) guarantees that the shared bandwidth consumption does not exceed ω𝜔\omegaitalic_ω times the protected bandwidth limit.

V Proposed SRP Algorithm

In this section, we propose a service reliability-aware placement (SRP) algorithm, which is a heuristic algorithm proposed to solve Problem 𝒫1𝒫1\mathcal{P}1caligraphic_P 1. The SRP algorithm takes as input the network state as well as the service request and outputs a microservice placement strategy that includes microservice instance placement and backup object selection.

V-A Algorithm Description

The main process of the SRP algorithm is shown in Algorithm 1. When a service request arrives, SRP initiates a breadth-first search starting from the root microservice m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and adds the microservices in the microservice dependency graph Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to the placement queue Q𝑄Qitalic_Q (line 3). In line 4, we define and initialize the backtracking counter ϵitalic-ϵ\epsilonitalic_ϵ with a predefined upper limit ΔΔ\Deltaroman_Δ and the node blacklist Nblmsubscriptsuperscript𝑁𝑚𝑏𝑙N^{m}_{bl}italic_N start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT for each microservice. The loop from line 5 to line 17 ensures that each microservice of the service request is placed. In line 6, the SRP extracts the first unplaced microservice m𝑚mitalic_m from queue Q𝑄Qitalic_Q. The SRP then calls Algorithm 2 to place the microservice m𝑚mitalic_m and gets the placement result, which is used to indicate a successful or failed placement. Lines 8-16 deal with the case of microservice placement failure. If the microservice is not the root microservice and the number of backtracks has not exceeded the limit, SRP will cancel the placement of all parents of microservice m𝑚mitalic_m and their children. It also adds the node where the parent was placed to the node blacklist of the parent and then proceeds to the next iteration. In line 18, we call Algorithm 3 for backup object selection and backup instance placement. Finally, in line 19, the algorithm returns the placement success message.

Algorithm 2 describes the microservice placement process. It traverses the nodes in the candidate node set (lines 4-22). In lines 5-7, if the node is in the blacklist of microservice m𝑚mitalic_m or has insufficient resources, the algorithm starts the next iteration directly; otherwise, the algorithm calculates σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for microservice m𝑚mitalic_m in lines 8-16. Specifically, we use nfbsubscript𝑛subscript𝑓𝑏n_{f_{b}}italic_n start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT to denote the placement node of the parent microservice instance τmf(b)subscriptsuperscript𝜏𝑓𝑚𝑏\tau^{f}_{m}(b)italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b ). In line 11, the algorithm searches for the set of IDPs between nodes njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and nfbsubscript𝑛subscript𝑓𝑏n_{f_{b}}italic_n start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT that meet the constraints. Line 12 calculates the total path reliability of the path set Pjfbsubscript𝑃𝑗subscript𝑓𝑏P_{jf_{b}}italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Lines 13-16 calculate the reliability of m𝑚mitalic_m. Since node reliability is load-dependent, line 17 calculates the reliability of node n𝑛nitalic_n after placing microservice m𝑚mitalic_m, and lines 18-22 consider the effect of the critical node set on service reliability. Lines 23-25 track the nodes with the highest total reliability. Line 27 places the microservice m𝑚mitalic_m on the node with the highest total reliability and places all links connected to it on the path. At the same time, the algorithm records the current σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for subsequent selection of backup objects. Since the computation of the current σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is performed simultaneously with the placement of the microservice links, no additional time complexity is added. Finally, lines 28-32 return the placement result.

Algorithm 3 outlines the strategy for selecting backup objects. After initializing the backup counter b𝑏bitalic_b and the set of backup objects BMi𝐵superscript𝑀𝑖BM^{i}italic_B italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in line 1, the algorithm enters a loop in lines 2-16, which requires that the number of backup instances does not exceed a limit and that the set of backup objects is not empty. In lines 4-8, the algorithm iterates over BMi𝐵superscript𝑀𝑖BM^{i}italic_B italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to obtain the microservice mminsubscript𝑚𝑚𝑖𝑛m_{min}italic_m start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT with the smallest σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT value. Since the instantaneous σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT obtained when placing the microservice increases with the number of instances, the latest σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT needs to be obtained in line 5. The algorithm then calls Algorithm 2 in line 10 to place the microservice mminsubscript𝑚𝑚𝑖𝑛m_{min}italic_m start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT. In lines 11 to 15, if the placement fails, the microservice is removed from the set of backup objects and the next round of iteration starts; otherwise, the backup counter is increased in line 15.

Algorithm 1 Service Reliability-aware Placement (SRP) Algorithm
1:  Input edge network G𝐺Gitalic_G, service request sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.
2:  Output placement result of Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with backups.
3:  Add microservices from Mi\m0i\superscript𝑀𝑖subscriptsuperscript𝑚𝑖0M^{i}\backslash{m^{i}_{0}}italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT \ italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to the placement queue Q𝑄Qitalic_Q starting from m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT through breadth-first search.
4:  ϵ0italic-ϵ0\epsilon\leftarrow 0italic_ϵ ← 0, Nblm,mQformulae-sequencesubscriptsuperscript𝑁𝑚𝑏𝑙𝑚𝑄N^{m}_{bl}\leftarrow\emptyset,m\in Qitalic_N start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT ← ∅ , italic_m ∈ italic_Q.
5:  while unplaced microservices in Q𝑄Qitalic_Q exist do
6:     Obtain the first unplaced microservice m𝑚mitalic_m from Q𝑄Qitalic_Q.
7:     res𝑟𝑒𝑠absentres\leftarrowitalic_r italic_e italic_s ← place m𝑚mitalic_m using Alg. 2.
8:     if res=false𝑟𝑒𝑠falseres=\textbf{false}{}italic_r italic_e italic_s = false then
9:        if mm1𝑚subscript𝑚1m\neq m_{1}italic_m ≠ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ϵ<Δitalic-ϵΔ\epsilon<\Deltaitalic_ϵ < roman_Δ then
10:           Undo the placement of all the parents of microservice m𝑚mitalic_m and all their children.
11:           ϵϵ+1,NblτmNblτm{nτm}formulae-sequenceitalic-ϵitalic-ϵ1subscriptsuperscript𝑁subscript𝜏𝑚𝑏𝑙subscriptsuperscript𝑁subscript𝜏𝑚𝑏𝑙superscript𝑛subscript𝜏𝑚\epsilon\leftarrow\epsilon+1,N^{\tau_{m}}_{bl}\leftarrow N^{\tau_{m}}_{bl}\cup% \{n^{\tau_{m}}\}italic_ϵ ← italic_ϵ + 1 , italic_N start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT ← italic_N start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT ∪ { italic_n start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }
12:           continue
13:        else
14:           return  false
15:        end if
16:     end if
17:  end while
18:  Backup Placement Process (Alg. 3).
19:  return true
Algorithm 2 Microservice Placement Process
1:  Input G𝐺Gitalic_G, sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, m𝑚mitalic_m, Nblmsubscriptsuperscript𝑁𝑚𝑏𝑙N^{m}_{bl}italic_N start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT.
2:  Output Placement result of microservice m𝑚mitalic_m.
3:  nmaxnull,𝗋max0.formulae-sequencesubscript𝑛𝑚𝑎𝑥𝑛𝑢𝑙𝑙subscript𝗋𝑚𝑎𝑥0n_{max}\leftarrow null,\mathsf{r}_{max}\leftarrow 0.italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ← italic_n italic_u italic_l italic_l , sansserif_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ← 0 .
4:  for njNsubscript𝑛𝑗𝑁n_{j}\in Nitalic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_N do
5:     if njNblmsubscript𝑛𝑗subscriptsuperscript𝑁𝑚𝑏𝑙n_{j}\in N^{m}_{bl}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_N start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT or njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is under-resourced then
6:        continue
7:     end if
8:     𝗋b0,𝗋f1formulae-sequencesubscript𝗋𝑏0subscript𝗋𝑓1\mathsf{r}_{b}\leftarrow 0,\mathsf{r}_{f}\leftarrow 1sansserif_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ← 0 , sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← 1.
9:     for fFm𝑓subscript𝐹𝑚f\in F_{m}italic_f ∈ italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT do
10:        for bBτmf𝑏subscript𝐵subscriptsuperscript𝜏𝑓𝑚b\in B_{\tau^{f}_{m}}italic_b ∈ italic_B start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT do
11:           Denote the node of τmf(b)subscriptsuperscript𝜏𝑓𝑚𝑏\tau^{f}_{m}(b)italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b ) as nfbsubscript𝑛subscript𝑓𝑏n_{f_{b}}italic_n start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT and get the set of paths Pjfbsubscript𝑃𝑗subscript𝑓𝑏P_{jf_{b}}italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT that meet the constraints.
12:           Calculate the total path reliability rPjfbsubscript𝑟subscript𝑃𝑗subscript𝑓𝑏r_{P_{jf_{b}}}italic_r start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT.
13:           𝗋b𝗋brPjfbsubscript𝗋𝑏direct-sumsubscript𝗋𝑏subscript𝑟subscript𝑃𝑗subscript𝑓𝑏\mathsf{r}_{b}\leftarrow\mathsf{r}_{b}\oplus r_{P_{jf_{b}}}sansserif_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ← sansserif_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ⊕ italic_r start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT.
14:        end for
15:        𝗋f𝗋f𝗋bsubscript𝗋𝑓subscript𝗋𝑓subscript𝗋𝑏\mathsf{r}_{f}\leftarrow\mathsf{r}_{f}\mathsf{r}_{b}sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT sansserif_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.
16:     end for
17:     Calculate the rnjsubscriptsuperscript𝑟subscript𝑛𝑗r^{\prime}_{n_{j}}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT after placing m𝑚mitalic_m to njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
18:     if njNcsubscript𝑛𝑗subscript𝑁𝑐n_{j}\in N_{c}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT then
19:        𝗋f𝗋frnjrnjsubscript𝗋𝑓subscript𝗋𝑓subscriptsuperscript𝑟subscript𝑛𝑗subscript𝑟subscript𝑛𝑗\mathsf{r}_{f}\leftarrow\mathsf{r}_{f}\frac{r^{\prime}_{n_{j}}}{r_{n_{j}}}sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT divide start_ARG italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG.
20:     else
21:        𝗋f𝗋frnjsubscript𝗋𝑓subscript𝗋𝑓subscriptsuperscript𝑟subscript𝑛𝑗\mathsf{r}_{f}\leftarrow\mathsf{r}_{f}r^{\prime}_{n_{j}}sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT.
22:     end if
23:     if 𝗋max<𝗋fsubscript𝗋𝑚𝑎𝑥subscript𝗋𝑓\mathsf{r}_{max}<\mathsf{r}_{f}sansserif_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT < sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT then
24:        𝗋max𝗋f,nmaxnjformulae-sequencesubscript𝗋𝑚𝑎𝑥subscript𝗋𝑓subscript𝑛𝑚𝑎𝑥subscript𝑛𝑗\mathsf{r}_{max}\leftarrow\mathsf{r}_{f},n_{max}\leftarrow n_{j}sansserif_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ← sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ← italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
25:     end if
26:  end for
27:  Place m𝑚mitalic_m to nmaxsubscript𝑛𝑚𝑎𝑥n_{max}italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT while recording the current σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.
28:  if placement succeeds then
29:     return true
30:  else
31:     return false
32:  end if
Algorithm 3 Backup Placement Process
1:  b0,BMiMi\{m0i}formulae-sequence𝑏0𝐵superscript𝑀𝑖\superscript𝑀𝑖subscriptsuperscript𝑚𝑖0b\leftarrow 0,BM^{i}\leftarrow M^{i}\backslash\{m^{i}_{0}\}italic_b ← 0 , italic_B italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT \ { italic_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }.
2:  while b<Bi𝑏superscript𝐵𝑖b<B^{i}italic_b < italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and BMi𝐵superscript𝑀𝑖BM^{i}\neq\emptysetitalic_B italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≠ ∅  do
3:     mminnull,rmin1formulae-sequencesubscript𝑚𝑚𝑖𝑛𝑛𝑢𝑙𝑙subscript𝑟𝑚𝑖𝑛1m_{min}\leftarrow null,r_{min}\leftarrow 1italic_m start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ← italic_n italic_u italic_l italic_l , italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ← 1.
4:     for mBMi𝑚𝐵superscript𝑀𝑖m\in BM^{i}italic_m ∈ italic_B italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT do
5:        Obtain the σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of the last instance of m𝑚mitalic_m.
6:        if rmin>σmsubscript𝑟𝑚𝑖𝑛subscript𝜎𝑚r_{min}>\sigma_{m}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT > italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT then
7:           rminσm,mminmformulae-sequencesubscript𝑟𝑚𝑖𝑛subscript𝜎𝑚subscript𝑚𝑚𝑖𝑛𝑚r_{min}\leftarrow\sigma_{m},m_{min}\leftarrow mitalic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ← italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ← italic_m.
8:        end if
9:     end for
10:     res𝑟𝑒𝑠absentres\leftarrowitalic_r italic_e italic_s ← place mminsubscript𝑚𝑚𝑖𝑛m_{min}italic_m start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT using Alg. 2.
11:     if res=false𝑟𝑒𝑠falseres=\textbf{false}{}italic_r italic_e italic_s = false then
12:        BMiBMi\{mmin}𝐵superscript𝑀𝑖\𝐵superscript𝑀𝑖subscript𝑚𝑚𝑖𝑛BM^{i}\leftarrow BM^{i}\backslash\{m_{min}\}italic_B italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_B italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT \ { italic_m start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT }.
13:        continue
14:     end if
15:     bb+1𝑏𝑏1b\leftarrow b+1italic_b ← italic_b + 1.
16:  end while

V-B Complexity Analysis

The complexity analysis starts with the time complexity of Algorithm 2 and Algorithm 3. First, we use the IDP-searching algorithm based on the Edmonds-Karp algorithm [37] with a time complexity of O(|N||E|2)𝑂𝑁superscript𝐸2O(|N||E|^{2})italic_O ( | italic_N | | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to obtain the set of paths between two nodes. Then we can get the complexity of Algorithm 2, which is O(|N|2|E|2|M|)𝑂superscript𝑁2superscript𝐸2𝑀O(|N|^{2}|E|^{2}|M|)italic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_M | ). In Algorithm 3, the loop in lines 2 to 16 is executed at most |M|𝑀|M|| italic_M | times. Since the complexity of each loop is O(|M|+|N|2|E|2|M|)=O(|N|2|E|2|M|)𝑂𝑀superscript𝑁2superscript𝐸2𝑀𝑂superscript𝑁2superscript𝐸2𝑀O(|M|+|N|^{2}|E|^{2}|M|)=O(|N|^{2}|E|^{2}|M|)italic_O ( | italic_M | + | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_M | ) = italic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_M | ), the total complexity of Algorithm 3 is O(|N|2|E|2|M|2)𝑂superscript𝑁2superscript𝐸2superscript𝑀2O(|N|^{2}|E|^{2}|M|^{2})italic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_M | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Now looking back at Algorithm 1, we can see that its complexity is the same as that of Algorithm 3. Therefore, we can conclude that the complexity of Algorithm 1 is O(|N|2|E|2|M|2)𝑂superscript𝑁2superscript𝐸2superscript𝑀2O(|N|^{2}|E|^{2}|M|^{2})italic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_M | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

VI Proposed SPRC Algorithm

In this section, we propose a new heuristic algorithm based on the SRP algorithm to solve the problem 𝒫2𝒫2\mathcal{P}2caligraphic_P 2 by using the shared path reliability computation (SPRC) algorithm. When selecting placement nodes for backup instances, in addition to considering the adequacy of the remaining shared bandwidth on the physical link, we also need to consider the shared backup path contention problem caused by simultaneous failures. Specifically, when other backup links are suddenly activated (i.e., switched from occupying virtual bandwidth to occupying protected bandwidth), the protected bandwidth may be insufficient and lead to the failure of the backup link if the current backup link needs to be activated for failure tolerance. Therefore, we must consider the reliability of the primary instance of the backup link placed on the physical link, which determines the distribution of the occupied bandwidth of the protected bandwidth. For a more comprehensive assessment of network-aware service reliability, we propose replacing lines 9-16 in Algorithm 2 with SPRC. In the following, we refer to the SRP algorithm using SPRC as the SRP-S algorithm.

VI-A Algorithm Description

Algorithm 4 modifies lines 9-16 of Algorithm 2. In lines 6-11 of Algorithm 4, we traverse the set Mp𝑀𝑝Mpitalic_M italic_p, which represents the set of all backup microservice instances that have links on the path p𝑝pitalic_p. Line 7 obtains the probability of inactivity of the backup microservice instance msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which is the reliability of all instances with backup indexes less than the backup index of msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In lines 8-10, we evaluate whether the protected bandwidth is sufficient on the physical link for the microservice links belonging to both msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and m𝑚mitalic_m. Activating msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT may cause activation of m𝑚mitalic_m to fail if bandwidth is insufficient. In fact, each backup instance in Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT may be inactive or active, so there are 2|M|superscript2superscript𝑀2^{|M^{\prime}|}2 start_POSTSUPERSCRIPT | italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT events and there is only one event with the highest probability that each backup instance is inactive. Due to the time constraint and the low probability of multiple simultaneous failures, we ignore the simultaneous failure of two or more backup instances in this algorithm and only consider |M|superscript𝑀|M^{\prime}|| italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | individual failure events. Finally, in line 12, we multiply the backup path inactivation probability by the original path reliability to produce new path reliability and calculate r^Pjfbsubscript^𝑟subscript𝑃𝑗subscript𝑓𝑏\hat{r}_{P_{jf_{b}}}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT with the new path reliability in line 14. Subsequent lines 15-17 are consistent with Algorithm 2.

VI-B Complexity Analysis

The additional complexity introduced by Algorithm 4 relative to Algorithm 2 is mainly in the loops in lines 6-11. Combined with the complexity of the path-searching algorithm, we can obtain the time complexity of Algorithm 4 as O(|M|(|N||E|2+|M|))𝑂𝑀𝑁superscript𝐸2𝑀O(|M|(|N||E|^{2}+|M|))italic_O ( | italic_M | ( | italic_N | | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | italic_M | ) ). Thus the time complexity of the SRP-S algorithm is O(|M|(|M|+|N||M|(|N||E|2+|M|)))=O(|N|2|E|2|M|2+|N||M|3)𝑂𝑀𝑀𝑁𝑀𝑁superscript𝐸2𝑀𝑂superscript𝑁2superscript𝐸2superscript𝑀2𝑁superscript𝑀3O(|M|(|M|+|N||M|(|N||E|^{2}+|M|)))=O(|N|^{2}|E|^{2}|M|^{2}+|N||M|^{3})italic_O ( | italic_M | ( | italic_M | + | italic_N | | italic_M | ( | italic_N | | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | italic_M | ) ) ) = italic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_E | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_M | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | italic_N | | italic_M | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

Algorithm 4 Shared Path Reliability Calculation (SPRC) Algorithm
1:  for fFm𝑓subscript𝐹𝑚f\in F_{m}italic_f ∈ italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT do
2:     for bBτmf𝑏subscript𝐵subscriptsuperscript𝜏𝑓𝑚b\in B_{\tau^{f}_{m}}italic_b ∈ italic_B start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT do
3:        Denote the node of τmf(b)subscriptsuperscript𝜏𝑓𝑚𝑏\tau^{f}_{m}(b)italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b ) as nfbsubscript𝑛subscript𝑓𝑏n_{f_{b}}italic_n start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT and get the set of paths Pjfbsubscript𝑃𝑗subscript𝑓𝑏P_{jf_{b}}italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT that meet the constraints.
4:        for pPjfb𝑝subscript𝑃𝑗subscript𝑓𝑏p\in P_{jf_{b}}italic_p ∈ italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT do
5:           Prp0𝑃subscript𝑟𝑝0Pr_{p}\leftarrow 0italic_P italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← 0.
6:           for mMpsuperscript𝑚subscript𝑀𝑝m^{\prime}\in M_{p}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT do
7:              Obtain the inactivation probability σmsubscriptsuperscript𝜎superscript𝑚\sigma^{\prime}_{m^{\prime}}italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
8:              if ep{e|yelmτ=1},BW(e)<bw(e)+bw(lmτ)+bw(lmτmf(b))formulae-sequence𝑒𝑝conditional-set𝑒subscriptsuperscript𝑦subscript𝑙superscript𝑚𝜏𝑒1𝐵𝑊𝑒𝑏𝑤𝑒𝑏𝑤subscript𝑙superscript𝑚𝜏𝑏𝑤subscript𝑙𝑚subscriptsuperscript𝜏𝑓𝑚𝑏\exists e\in p\cap\{e|y^{l_{m^{\prime}\tau}}_{e}=1\},BW(e)<bw(e)+bw(l_{m^{% \prime}\tau})+bw(l_{m\tau^{f}_{m}(b)})∃ italic_e ∈ italic_p ∩ { italic_e | italic_y start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 1 } , italic_B italic_W ( italic_e ) < italic_b italic_w ( italic_e ) + italic_b italic_w ( italic_l start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_τ end_POSTSUBSCRIPT ) + italic_b italic_w ( italic_l start_POSTSUBSCRIPT italic_m italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ) then
9:                 PrpPrp+σm𝑃subscript𝑟𝑝𝑃subscript𝑟𝑝subscriptsuperscript𝜎superscript𝑚Pr_{p}\leftarrow Pr_{p}+\sigma^{\prime}_{m^{\prime}}italic_P italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← italic_P italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.
10:              end if
11:           end for
12:           r^pPrprpsubscript^𝑟𝑝𝑃subscript𝑟𝑝subscript𝑟𝑝\hat{r}_{p}\leftarrow Pr_{p}r_{p}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← italic_P italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.
13:        end for
14:        Calculate r^Pjfbsubscript^𝑟subscript𝑃𝑗subscript𝑓𝑏\hat{r}_{P_{jf_{b}}}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT using modified path reliability.
15:        𝗋b𝗋br^Pjfbsubscript𝗋𝑏direct-sumsubscript𝗋𝑏subscript^𝑟subscript𝑃𝑗subscript𝑓𝑏\mathsf{r}_{b}\leftarrow\mathsf{r}_{b}\oplus\hat{r}_{P_{jf_{b}}}sansserif_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ← sansserif_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ⊕ over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT.
16:     end for
17:     𝗋f𝗋f𝗋bsubscript𝗋𝑓subscript𝗋𝑓subscript𝗋𝑏\mathsf{r}_{f}\leftarrow\mathsf{r}_{f}\mathsf{r}_{b}sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← sansserif_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT sansserif_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.
18:  end for

VII Performance Evaluation

We validate our modeling and algorithmic work through extensive simulations. Our simulation code can be accessed online [38].

VII-A Simulation Setting

We first use the Erdős-Rényi model [39] to create an infrastructure network topology with a node count of 50 and an edge creation probability of 0.2. We then select one-fifth of the nodes with smaller degrees as access nodes for receiving service requests. For each microservice, microservice link, and physical link, we set their failure arrival rate to 0.000010.000010.000010.00001 per time unit, ensuring their reliability meets the ”five nines” reliability level at the initial moment. For dynamic node reliability, we set the reliability rLsubscript𝑟𝐿r_{L}italic_r start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT between 0.99990.99990.99990.9999 and 0.999990.999990.999990.99999 for low load and rHsubscript𝑟𝐻r_{H}italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT between 0.9990.9990.9990.999 and 0.99990.99990.99990.9999 for high load. For the backup number limit, we set it to be the same as the primary number of microservices (full backup) if not specifically declared. The other parameters are listed in Table I. In all simulations, 100100100100 service requests reach the access node according to a Poisson distribution with parameter 1111, implying that the average arrival rate of service requests is 1111 per time unit. Additionally, all simulations are performed 100100100100 times and averaged for the final results.

TABLE I: Parameters
Element Parameter Range
n𝑛nitalic_n C(n)𝐶𝑛C(n)italic_C ( italic_n ) [8,16]816[8,16][ 8 , 16 ] cores
ξ(n)𝜉𝑛\xi(n)italic_ξ ( italic_n ) 0.50.50.50.5
BW(e)𝐵𝑊𝑒BW(e)italic_B italic_W ( italic_e ) [100,1000]1001000[100,1000][ 100 , 1000 ] MBps
e𝑒eitalic_e desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT [1,10]110[1,10][ 1 , 10 ] ms
ω𝜔\omegaitalic_ω 1111
s𝑠sitalic_s ΩΩ\Omegaroman_Ω [1,100]1100[1,100][ 1 , 100 ] units
|M|𝑀|M|| italic_M | [1,5]15[1,5][ 1 , 5 ]
c(m)𝑐𝑚c(m)italic_c ( italic_m ) [0.1,1]0.11[0.1,1][ 0.1 , 1 ] cores
m𝑚mitalic_m υτmsubscript𝜐𝜏𝑚\upsilon_{\tau m}italic_υ start_POSTSUBSCRIPT italic_τ italic_m end_POSTSUBSCRIPT [0.1,5]0.15[0.1,5][ 0.1 , 5 ] MB
dmsubscript𝑑𝑚d_{m}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [10,50]1050[10,50][ 10 , 50 ] ms
l𝑙litalic_l bw(l)𝑏𝑤𝑙bw(l)italic_b italic_w ( italic_l ) [0.1,10]0.110[0.1,10][ 0.1 , 10 ] MBps
Dlsubscript𝐷𝑙D_{l}italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [0.03,50.15]0.0350.15[0.03,50.15][ 0.03 , 50.15 ] s

VII-B Benchmark

In our simulations, four algorithms are used for comparison with the SRP and SRP-S algorithms. They are described in detail as follows:

  1. 𝟏.1\mathbf{1}.bold_1 .

    Delay-efficient and Availability-aware Placement (DAIP) [23]

    DAIP algorithm is an algorithm that considers backup instances. It focuses on the reliability of the nodes when placing services and chooses the path with the least latency when placing links. In addition, it adopts a Round-Robin strategy for backup object selection. While the DAIP algorithm considers in detail the reliability gain from backup instances, it does not consider in detail the dynamic network-aware reliability and hardware-software reliability decoupling in contrast to our work.

  2. 𝟐.2\mathbf{2}.bold_2 .

    Reliable Redundant Services Placement (RRSP) [20]

    The RRSP algorithm considers only the total node reliability after placement and does not consider path selection. Unlike DAIP, RRSP does not focus on the placement of each instance but directly generates a certain number of solutions and selects the optimal solution among them. In addition, it does not specify how the backup objects are selected, so we generate backup instances for the primary instances in descending order of the degrees of the nodes in the dependency graph. We use this benchmark algorithm to show the performance of the algorithm considering only node reliability.

  3. 𝟑.3\mathbf{3}.bold_3 .

    Greedy Placement (Grd)

    The greedy algorithm is a classical heuristic algorithm. It selects the node with the highest reliability for each microservice instance and chooses the shortest path for each microservice link. To emphasize the effect of backups on reliability improvement, the basic version of the greedy algorithm does not consider the backup algorithm.

  4. 𝟒.4\mathbf{4}.bold_4 .

    Greedy Placement with Backup (Grd-B)

    Greedy Placement with Backup is an advanced version of Greedy Placement where the round-robin policy is used for backup object selection. The instance placement and path selection of this algorithm are consistent with Greedy Placement.

VII-C Validation of Network-Aware Service Reliability Model

Refer to caption
(a) Service request distribution for different service reliability.
Refer to caption
(b) Number of service failures over time.
Figure 2: Reliability performance with the fully protected path mechanism.
Refer to caption
(a) Service request distribution for different service reliability.
Refer to caption
(b) Number of service failures over time.
Figure 3: Reliability performance with the shared backup path mechanism.

In this subsection, we verify the effectiveness of the network-aware service reliability model.

Fig. 2(a) and Fig. 2(b) show the network-aware service reliability evaluation and the number of service failures for different placement algorithms with the fully protected path mechanism, respectively. From Fig. 2(a) and Fig. 2(b) we can see that the more service requests with high network-aware service reliability, the fewer number of failures. This result proves that the evaluation result of the proposed reliability model is generally consistent with the evaluation result of the number of failures.

Fig. 3 shows the reliability performance with the shared backup path mechanism. As seen in Fig. 3(a) and Fig. 3(b), the proposed network-aware service reliability model can still work to evaluate service reliability with the shared backup path mechanism. This is because when bandwidth resources are not strained, shared backup paths do not need to consider path reliability changes due to backup path contention. However, when bandwidth resources are extremely scarce, the network-aware service reliability will be inaccurate due to backup path contention.

VII-D Validation of the SRP Algorithm

In this subsection, we validate the performance of the SRP algorithm by executing a series of simulations with the fully protected path mechanism. Although we verified the effectiveness of the proposed model in Sec. VII-C, a more accurate way to evaluate the performance of the algorithm should still be to evaluate the number of service failures. This is because after different algorithms provide differentiated placement strategies, the resource conditions of the network will gradually differentiate, leading to differences in the solution space for subsequent microservice placement. This makes the reliability of services placed by different algorithms comparable only at the initial moment. Therefore, in order to directly reflect the fault tolerance of different placement algorithms in subsequent simulations, we evaluate the algorithm performance by comparing the number of service failures.

Our first simulation result is shown in Fig. 2 (b). The result demonstrates that the SRP algorithm outperforms other algorithms. Taking the number of failures of the worst-performing algorithm as a criterion, the SRP algorithm reduces failures by up to 29% compared to the latest DAIP algorithm. This is because the proposed algorithm takes into account the network-aware reliability of each placed part when placing microservice instances. In addition, the SRP algorithm always provides a new backup instance for the microservice with the lowest current network-aware reliability when selecting the backup object.

Refer to caption
(a) Number of service failures with different edge creation probabilities.
Refer to caption
(b) Number of service failures with a different number of nodes.
Figure 4: Reliability performance with different topologies.

We secondly validate the performance of the algorithms with different network topologies. First, we generate six sets of topologies with different edge creation probabilities, and each set has 100 different random topologies. Fig. 4(a) shows the number of failures for services placed by different algorithms with 600 different topologies. From the figure, we can see that the SRP algorithm maintains the lowest number of service failures for all the topologies with different edge creation probabilities. This is due to the fact that the benchmark algorithms all determine the placement of the microservice instances based solely on node reliability, without considering the reduction in reliability caused by network routing and the increase in reliability provided by multipath routing. On the contrary, the SRP algorithm senses network routing by calculating network-aware reliability, thus achieving superior performance. Second, we generated seven sets of topologies with different numbers of nodes, each containing 100 different random topologies. Fig. 4(b) exhibits the number of failures for services placed by different algorithms with 700 different topologies. It can be seen that the number of failures of the different algorithms first decreases and then stabilizes when the number of nodes exceeds 40. This is because the more nodes there are, the more nodes in the network that are in a highly reliable state. Although the benchmark algorithms consider node reliability, they do not consider the reliability of the entire microservice dependency graph. Instead, they select the most reliable node for each instance in isolation, which causes each node to reach a high load state quickly. The SRP algorithm achieves optimal performance because, on the one hand, it calculates the change in node reliability as a node approaches a high load state, and on the other hand, multipath reliability makes distributed placement of instances less costly.

Refer to caption
(a) Number of service failures with different CPU requirements.
Refer to caption
(b) Number of service failures with different bandwidth requirements.
Figure 5: Reliability performance with different resource requirements.

Thirdly, we verify the performance of the algorithm with different resource requirement conditions. In order to avoid the impact of modifying the topology on the algorithm performance, we do not change the number of nodes in the simulation of Fig. 5(a), but instead scale up the CPU requirements of all the microservices by 10-50 times. The trend of the curve in the figure is decreasing because the number of successfully placed services decreases as the CPU requirement increases, leading to a decrease in the number of service failures. We can see that the SRP algorithm reduces the number of service failures by up to 24% compared to benchmark algorithms. This is because the benchmark algorithms maintain the strategy of finding the most reliable node when most of the nodes are with high load, whereas the SRP algorithm provides more placement strategies that do not occupy high-load nodes at the cost of occupying the bandwidth of multiple paths. Similarly, we scale up the bandwidth requirement for microservice links in the simulation of Fig. 5(b). We can see that as the bandwidth requirement increases, the number of service failures for most of the benchmark algorithms decreases as fewer services are successfully placed. This is because the inflated bandwidth requirement compresses their solution space. In contrast, the SRP algorithm discovers more solution space through backtracking. Moreover, although the SRP algorithm also places fewer services, it ensures the reliability of successfully placed services through multipath routing and backup. Thus, even though it consumes more bandwidth, the SRP algorithm still achieves superior performance.

Refer to caption
(a) Service request distribution for different service reliability.
Refer to caption
(b) Number of service failures over time.
Figure 6: Reliability performance with random backups.

Finally, we validate the performance of different algorithms for backup object selection. We set the upper limit of backups per service request to a random value between 1 and the number of microservices because when the number of backups decreases, proper backup object selection leads to higher reliability gain. From Fig. 6, we can see that the SRP algorithm can reduce the number of failures by up to 23.8% despite the reduction in the number of backups compared to the full backup simulation. This relates to the fact that the SRP algorithm considers a backup object selection strategy that prioritizes compensating microservices with the lowest network-aware reliability.

VII-E Validation of the SRP-S Algorithm

In this subsection, we first verify the significant contribution of the shared backup path mechanism in reducing bandwidth consumption. Then, we verify the fault tolerance performance of the SRP and SRP-S algorithms with the shared backup path mechanism through two simulations.

Refer to caption
Figure 7: Bandwidth consumption over time.

Fig. 7 shows the average bandwidth consumption of different algorithms over time, where the algorithms marked with brackets operate with the shared backup path mechanism. As seen in Fig. 7, the SRP-S algorithm is able to reduce the bandwidth consumption by up to 62% compared to the SRP algorithm with the fully protected path mechanism, which demonstrates its significant potential in reducing bandwidth consumption.

The result of the first simulation for verifying fault tolerance are shown in Fig. 3. From Fig. 3(a) we can see that there is almost no difference between SRP and SRP-S curves. Theoretically, in a single placement with the same conditions, the service reliability of the service request placed by the SRP-S algorithm is not higher than that of the service request placed by the SRP algorithm. This is because the SRP-S algorithm pursues corrected reliability considering the shared backup path contention probability rather than the network-aware service reliability. Fig. 3(b) illustrates the number of service failures of the services placed by each algorithm. In Fig. 3(b), the performance of the benchmark algorithms and the SRP algorithm is similar to the case with the fully protected path mechanism, while the performance of the SRP-S algorithm is similar to that of the SRP algorithm as they are essentially the same when there is not much pressure on bandwidth resources.

Refer to caption
Figure 8: Reliability performance with different bandwidth requirements.

To verify the performance of the SRP-S algorithm in more extreme cases, the bandwidth requirements of the microservice links are amplified in the second simulation. From Fig. 8 we can see that the SRP algorithm and the SRP-S algorithm produce fewer service failures in different cases and the SRP-S algorithm performs better in the more extreme cases. The reason for the superiority of the SRP algorithm lies in the consideration of network state and backup object selection, whereas the reason for the superiority of the SRP-S algorithm is the consideration of the shared backup path contention caused by simultaneous failure events. Fig. 8 shows that the SRP-S algorithm reduces the number of service failures by up to 21% compared to the SRP algorithm in extreme cases.

VIII Conclusion

In this paper, we address the intricate challenges of microservice placement with a focus on enhancing the reliability of MSA-based 5G and IoT services. The network-aware service reliability model thoroughly considers the impact of network load and routing on service reliability, offering profound insights into system reliability assessment. Based on the proposed service reliability model, we propose an innovative heuristic SRP algorithm that effectively addresses the microservices placement problem with the fully protected path mechanism. For the purpose of reducing bandwidth consumption, we further propose the SRP-S algorithm by considering the shared backup path contention caused by simultaneous failures, which effectively tackles the microservice placement problem with the shared backup path mechanism. Simulation results validate the proposed service reliability model and show that the SRP algorithm can reduce the number of failures by up to 29% compared to the benchmark algorithms with the fully protected path mechanism. With the shared backup path mechanism, the SRP-S algorithm can reduce bandwidth consumption by up to 62% compared to the SRP algorithm with the fully protected path mechanism, and reduce the number of service failures by up to 21% compared to the SRP algorithm with the shared backup path mechanism.

For future work, we plan to extend our proposed reliability model to more diverse backup mechanisms to reduce bandwidth consumption even further. In addition, it is also one of our goals to adjust the network resource utilization through microservice migration or scaling to improve service reliability in the future.

References

  • [1] M. Usman, S. Ferlin, A. Brunstrom, and J. Taheri, “A survey on observability of distributed edge & container-based microservices,” IEEE Access, vol. 10, pp. 86904–86919, 2022.
  • [2] M. Söylemez, B. Tekinerdogan, and A. Kolukısa Tarhan, “Challenges and solution directions of microservice architectures: A systematic literature review,” Applied Sciences, vol. 12, no. 11, p. 5507, 2022.
  • [3] K. Kaur, F. Guillemin, and F. Sailhan, “Container placement and migration strategies for cloud, fog, and edge data centers: A survey,” International Journal of Network Management, vol. 32, no. 6, p. e2212, 2022.
  • [4] H. Siddiqui, F. Khendek, and M. Toeroe, “Microservices based architectures for iot systems - state-of-the-art review,” Internet of Things, vol. 23, p. 100854, 2023.
  • [5] R. Kumar and N. Agrawal, “Analysis of multi-dimensional industrial iot (iiot) data in edge–fog–cloud based architectural frameworks : A survey on current state and research challenges,” Journal of Industrial Information Integration, vol. 35, p. 100504, 2023.
  • [6] Y. Chen, H. Lu, L. Qin, C. Zhang, and C. W. Chen, “Statistical qos provisioning analysis and performance optimization in xurllc-enabled massive mu-mimo networks: A stochastic network calculus perspective,” IEEE Transactions on Wireless Communications, pp. 1–1, 2024.
  • [7] S. Pallewatta, V. Kostakos, and R. Buyya, “Placement of microservices-based iot applications in fog computing: A taxonomy and future directions,” ACM Comput. Surv., vol. 55, jul 2023.
  • [8] Y. Zeng, Z. Qu, S. Guo, B. Ye, J. Zhang, J. Li, and B. Tang, “Safedrl: Dynamic microservice provisioning with reliability and latency guarantees in edge environments,” IEEE Transactions on Computers, vol. 73, no. 1, pp. 235–248, 2024.
  • [9] Y. Wang, L. Zhang, P. Yu, K. Chen, X. Qiu, L. Meng, M. Kadoch, and M. Cheriet, “Reliability-oriented and resource-efficient service function chain construction and backup,” IEEE Transactions on Network and Service Management, vol. 18, no. 1, pp. 240–257, 2021.
  • [10] G. Baranwal and D. P. Vidyarthi, “Trappy: a truthfulness and reliability aware application placement policy in fog computing,” The Journal of Supercomputing, vol. 78, pp. 7861–7887, Apr 2022.
  • [11] Y. Qiu, J. Liang, V. C. Leung, X. Wu, and X. Deng, “Online reliability-enhanced virtual network services provisioning in fault-prone mobile edge cloud,” IEEE Transactions on Wireless Communications, vol. 21, no. 9, pp. 7299–7313, 2022.
  • [12] M. Zhu, F. He, and E. Oki, “Resource allocation model against multiple failures with workload-dependent failure probability,” IEEE Transactions on Network and Service Management, vol. 19, no. 2, pp. 1098–1116, 2022.
  • [13] L. Rui, X. Chen, X. Wang, Z. Gao, X. Qiu, and S. Wang, “Multiservice reliability evaluation algorithm considering network congestion and regional failure based on petri net,” IEEE Transactions on Services Computing, vol. 15, no. 2, pp. 684–697, 2022.
  • [14] Z. Liu, S. Yang, M. Yang, and R. Kang, “Software belief reliability growth model based on uncertain differential equation,” IEEE Transactions on Reliability, vol. 71, no. 2, pp. 775–787, 2022.
  • [15] X. Qiu, Y. Dai, Y. Xiang, and L. Xing, “A hierarchical correlation model for evaluating reliability, performance, and power consumption of a cloud service,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 3, pp. 401–412, 2016.
  • [16] B. Schroeder and G. A. Gibson, “A large-scale study of failures in high-performance computing systems,” IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337–350, 2010.
  • [17] A. Zhou, S. Wang, B. Cheng, Z. Zheng, F. Yang, R. N. Chang, M. R. Lyu, and R. Buyya, “Cloud service reliability enhancement via virtual machine placement optimization,” IEEE Transactions on Services Computing, vol. 10, no. 6, pp. 902–913, 2017.
  • [18] L. Zhu, Q. Zhuang, H. Jiang, H. Liang, X. Gao, and W. Wang, “Reliability-aware failure recovery for cloud computing based automatic train supervision systems in urban rail transit using deep reinforcement learning,” Journal of Cloud Computing, vol. 12, no. 1, p. 147, 2023.
  • [19] Z. Liu, G. Fan, H. Yu, and L. Chen, “An approach to modeling and analyzing reliability for microservice-oriented cloud applications,” Wireless Communications and Mobile Computing, vol. 2021, p. 5750646, Aug 2021.
  • [20] H. Huang, H. Zhang, T. Guo, J. Guo, and C. He, “Reliable redundant services placement in federated micro-clouds,” in 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS), pp. 446–453, IEEE, 2019.
  • [21] M. Ibrar, L. Wang, N. Shah, O. Rottenstreich, G.-M. Muntean, and A. Akbar, “Reliability-aware flow distribution algorithm in sdn-enabled fog computing for smart cities,” IEEE Transactions on Vehicular Technology, vol. 72, no. 1, pp. 573–588, 2023.
  • [22] J. Paul Martin, A. Kandasamy, and K. Chandrasekaran, “Crew: cost and reliability aware eagle-whale optimiser for service placement in fog,” Software: Practice and Experience, vol. 50, no. 12, pp. 2337–2360, 2020.
  • [23] M. Dadashi and A. Rajabzadeh, “Daip: a delay-efficient and availability-aware iot application placement in fog environments,” Computing, vol. 105, pp. 2007–2035, Sep 2023.
  • [24] Y. Ramzanpoor, M. Hosseini Shirvani, and M. Golsorkhtabaramiri, “Multi-objective fault-tolerant optimization algorithm for deployment of iot applications on fog computing infrastructure,” Complex & Intelligent Systems, vol. 8, no. 1, pp. 361–392, 2022.
  • [25] Y. Qiu, J. Liang, V. C. M. Leung, X. Wu, and X. Deng, “Online reliability-enhanced virtual network services provisioning in fault-prone mobile edge cloud,” IEEE Transactions on Wireless Communications, vol. 21, no. 9, pp. 7299–7313, 2022.
  • [26] A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot, “Characterization of failures in an operational ip backbone network,” IEEE/ACM Transactions on Networking, vol. 16, no. 4, pp. 749–762, 2008.
  • [27] G. Le, S. Ferdousi, A. Marotta, S. Xu, Y. Hirota, Y. Awaji, S. Savas, M. Tornatore, and B. Mukherjee, “Reliable provisioning with degraded service using multipath routing from multiple data centers in optical metro networks,” IEEE Transactions on Network and Service Management, vol. 20, no. 3, pp. 3334–3347, 2023.
  • [28] R. S. Guimarães, C. Dominicini, V. M. G. Martínez, B. M. Xavier, D. R. Mafioletti, A. C. Locateli, R. Villaca, M. Martinello, and M. R. N. Ribeiro, “M-polka: Multipath polynomial key-based source routing for reliable communications,” IEEE Transactions on Network and Service Management, vol. 19, no. 3, pp. 2639–2651, 2022.
  • [29] L. Qu, C. Assi, M. J. Khabbaz, and Y. Ye, “Reliability-aware service function chaining with function decomposition and multipath routing,” IEEE Transactions on Network and Service Management, vol. 17, no. 2, pp. 835–848, 2020.
  • [30] L. Tang, G. Zhao, C. Wang, P. Zhao, and Q. Chen, “Queue-aware reliable embedding algorithm for 5g network slicing,” Computer Networks, vol. 146, pp. 138–150, 2018.
  • [31] Y. Al Mtawa, A. Haque, and H. Lutfiyya, “Migrating from legacy to software defined networks: A network reliability perspective,” IEEE Transactions on Reliability, vol. 70, no. 4, pp. 1525–1541, 2021.
  • [32] H. Zhao, S. Deng, Z. Liu, J. Yin, and S. Dustdar, “Distributed redundant placement for microservice-based applications at the edge,” IEEE Transactions on Services Computing, vol. 15, no. 3, pp. 1732–1745, 2022.
  • [33] G. Baranwal and D. P. Vidyarthi, “Trappy: a truthfulness and reliability aware application placement policy in fog computing,” The Journal of Supercomputing, vol. 78, pp. 7861–7887, Apr 2022.
  • [34] M.-Y. Saidi and B. Cousin, “Resource saving: Which resource sharing strategy to protect primary shortest paths?,” in 2016 13th IEEE Annual Consumer Communications & Networking Conference (CCNC), pp. 297–298, 2016.
  • [35] W. Zheng, M. Yang, C. Zhang, Y. Zheng, and Y. Zhang, “Robust design against network failures of shared backup path protected sdm-eons,” Journal of Lightwave Technology, vol. 41, no. 10, pp. 2923–2939, 2023.
  • [36] D. Ergenç, J. Rak, and M. Fischer, “Service-based resilience via shared protection in mission-critical embedded networks,” IEEE Transactions on Network and Service Management, vol. 18, no. 3, pp. 2687–2701, 2021.
  • [37] J. Edmonds and R. M. Karp, “Theoretical improvements in algorithmic efficiency for network flow problems,” Journal of the ACM (JACM), vol. 19, no. 2, pp. 248–264, 1972.
  • [38] F. Zhang, “Microservice placement simulations.” https://github.com/ZfyInfonet/SRP, 2024.
  • [39] P. ERDdS and A. R&wi, “On random graphs i,” Publ. math. debrecen, vol. 6, no. 290-297, p. 18, 1959.