Network-Aware Reliability Modeling and Optimization for Microservice Placement

Fangyu Zhang, , Yuang Chen, Hancheng Lu, , Yongsheng Huang Fangyu Zhang, Yuang Chen, Hancheng Lu, and Yongsheng Huang are with CAS Key Laboratory of Wireless-Optical Communications, University of Science and Technology of China, Hefei 230027, China (email: fv215b@mail.ustc.edu.cn; hclu@ustc.edu.cn; yuangchen21@mail.ustc.edu.cn; ysh6@mail.ustc.edu.cn).

Abstract

Optimizing microservice placement to enhance the reliability of services is crucial for improving the service level of microservice architecture-based mobile networks and Internet of Things (IoT) networks. Despite extensive research on service reliability, the impact of network load and routing on service reliability remains understudied, leading to suboptimal models and unsatisfactory performance. To address this issue, we propose a novel network-aware service reliability model that effectively captures the correlation between network state changes and reliability. Based on this model, we formulate the microservice placement problem as an integer nonlinear programming problem, aiming to maximize service reliability. Subsequently, a service reliability-aware placement (SRP) algorithm is proposed to solve the problem efficiently. To reduce bandwidth consumption, we further discuss the microservice placement problem with the shared backup path mechanism and propose a placement algorithm based on the SRP algorithm using shared path reliability calculation, known as the SRP-S algorithm. Extensive simulations demonstrate that the SRP algorithm reduces service failures by up to 29% compared to the benchmark algorithms. By introducing the shared backup path mechanism, the SRP-S algorithm reduces bandwidth consumption by up to 62% compared to the SRP algorithm with the fully protected path mechanism. It also reduces service failures by up to 21% compared to the SRP algorithm with the shared backup mechanism.

Index Terms:

Microservice Placement, Reliability Model, Network State, Fault Tolerance, Shared Backup Path

I Introduction

Cloud-native technologies [1] empower the creation and operation of applications in massively scalable distributed infrastructures, leveraging microservices architecture (MSA) [2] alongside platform technologies [3] like containers and virtual machines. With the Cloud Native Computing Foundation’s (CNCF) promotion of cloud-native technologies, MSA, which aims to improve software agility, is gradually coming into the limelight of both academia and industry. By splitting applications into microservices and interconnecting them using a lightweight application programming interface (API), MSA granulates complex services and provides an easier means of maintaining and updating software, accelerating new feature launches, and reducing manual costs. The benefits of MSA have led several major service providers such as Amazon, Netflix, and Spotify to use MSA to place their services [2], as well as making the Internet of Things (IoT) paradigm consider using MSA for smart manufacturing, Internet of Vehicles (IoV), and Industrial IoT (IIoT) [4]. Additionally, to fulfill service latency requirements [5] while avoiding the flooding of the backbone network with a large number of 5G and IoT devices [6], placing MSA-based services on infrastructure paradigms that are close to the users, such as edge or fog platforms [5], is emerging as a new trend for service placement in 5G environments [7].

Ensuring the reliability of 5G application services is crucial for improving the users quality of experience (QoE), and the MSA complicates this issue. For ultra-reliable low-latency communication (URLCC) services in 5G, such as telemedicine and autonomous driving, end-to-end service reliability of five nines ( $99.999\%$ ) or more is typically required [8]. To improve reliability, traditional monolithic applications typically need to consider both the software reliability of the program itself and the reliability of the hardware on which the application is placed [9]. However, with the introduction of MSA, placing microservices in a distributed manner means that the service needs to bear more risk of failure from both hardware and software [10]. Therefore, how to improve the overall reliability of services when placing microservices has become an urgent problem.

A number of studies have been conducted to analyze the reliability model [11, 12, 13, 10, 14, 15] and place microservices more reliably. Reliability models can be divided into two categories: hardware reliability models [11, 12, 13] and software reliability models [14, 15, 16]. Based on the research on hardware and software reliability models, the reliability modeling studies for MSA-based services [17] comprehensively consider the overall reliability of the service after placing distributed software into the hardware. Since the placement strategy affects the service reliability, several works have been done to study microservice placement to enhance reliability [18, 19, 20, 21, 22, 23]. Microservice placement work can be categorized into placement in the cloud [18, 19, 20] and placement in the edge or fog [21, 22, 23, 24] based on the application scenario. Due to the variety of application scenarios, they have addressed different issues in terms of resources, quality of service, and reliability, and hence differ in service reliability modeling.

However, the dynamic nature of the network state caused by network load [25] and routing [26] has not been well studied in service reliability modeling, which brings new challenges to microservice placement. Network load has been shown to be negatively correlated with hardware reliability, i.e., the higher the load, the lower the reliability. In this case, hardware reliability is always changing dynamically during microservice placement. Network routing refers to the routing between microservices. On the one hand, the hardware reliability on the communication path is also load-dependent. On the other hand, with the maturity of multipath routing technologies, multipath routing can also have a significant impact on service reliability [27, 28, 29]. As a result, the impact of dynamic network state caused by network load and routing cannot be ignored in service reliability modeling as well as microservice placement.

In this paper, we propose a network-aware service reliability model to address the aforementioned issue. Firstly, to consider the impact of network state changes on reliability modeling, the network state is sensed by building a load-dependent hardware node reliability model and a routing-dependent path reliability model. Then, as each microservice placement and routing between microservices may change the reliability of the infrastructure network, a network-aware placement algorithm is proposed to achieve optimal service reliability performance. Furthermore, to reduce the bandwidth consumption of the infrastructure network, we investigate the microservice placement problem with the shared backup path mechanism. In this case, shared backup path contention due to simultaneous backup failures brings the microservice placement problem a new network state change factor. Since contending paths may lead to routing failures of backup instances, the contention probability of shared backup paths is considered carefully when placing microservices. The main contributions of this article are summarized as follows:

•

We propose a network-aware service reliability model to characterize the dynamic network states, with consideration of the load-dependent node reliability and the path reliability of multipath routing as well as the impacts of hardware and software decoupling and backup instances on the reliability of microservices. Simulation results validate the proposed reliability model with different algorithms in terms of the number of service failures.
•

Based on the proposed service reliability model, we formulate a microservice placement problem and then propose a service reliability-aware placement (SRP) algorithm to achieve maximum service reliability. The proposed algorithm evaluates the network-aware reliability of each microservice as it is placed. Simulation results show that the proposed algorithm reduces the number of service failures by up to 29% compared to the benchmark algorithms.
•

To reduce bandwidth consumption, we further investigate the microservice placement problem with the shared backup path mechanism and propose an algorithm based on the SRP algorithm using shared path reliability calculation (i.e., SRP-S algorithm). The proposed algorithm approximates the contention probability by calculating the probability that a single failure causes contention on the shared backup path and then reduces the occurrence of contention by combining the probability with network-aware service reliability. Simulation results show that the SRP-S algorithm reduces the bandwidth consumption by up to 62% compared to the SRP algorithm with the fully protected path mechanism and reduces the number of service failures by up to 21% compared to the SRP algorithms with the shared backup path mechanism.

The rest of the paper is organized as follows. Section II discusses the work related to reliability modeling and reliability-aware microservice placement over protected or shared paths. Section III gives the modeling procedure for the system model and the proposed service reliability model. In Section IV, we formulate the microservice placement problem with the fully protected and shared backup paths. The corresponding algorithms to solve them are proposed in Section V and Section VI, respectively. Simulation results are discussed in Section VII. Finally, the paper concludes in Section VIII.

II Related Work

Due to the distributed nature of MSA-based services, assessing service reliability is critical to ensure the failure tolerance of the MSA-based service, which has been addressed by research on reliability models and microservice placement.

II-A Reliability Model

Studies on reliability models can be classified into two main categories based on failure causes: hardware reliability and software reliability [15]. For hardware reliability, most studies have modeled the arrival of hardware failures as Poisson processes [30]. Wang et al. [9] proposed an instance-sharing reliability model to aggregate multiple services into a composite and proposed an algorithm to improve its reliability. Zhu et al. [12] proposed a load-dependent node reliability model to capture the relationship between failure probabilities and workloads and introduced a recovery strategy to handle workload variations. Mtawa et al. [31] proposed a link reliability model to assess the all-pair reliability of the network and tested it on both conventional network and SDN. Similar to hardware failures, the arrival of software failures is also usually modeled as a Poisson process [10]. Liu et al. [14] considered the software reliability problem in the framework of uncertainty theory and proposed a software reliability growth model based on uncertain differential equations.

Due to the different causes of hardware and software failures, it is inaccurate to consider the combination of hardware and placed software as a singular entity when modeling service reliability. Therefore, the service reliability model in the microservice placement problem should consider both hardware and software reliability [15, 22]. Qiu et al. [15] investigated the reliability model and fault recovery of cloud computing platforms, where the reliability model considered both hardware reliability and software reliability. Martin et al. [22] pointed out that the unavailability of a service is determined by software failures and hardware failures together and proposed a hardware-software decoupled reliability model.

II-B Microservice Placement

Microservice placement has been studied in a variety of scenarios [10, 32, 33]. Liu et al. [10] proposed an approach based on multi-intelligent body systems to maximize the reliability of services in cloud environments. Zhao et al. [32] considered the heterogeneity of edge environments and the uncertainty of service requests. They modeled the microservice placement problem as a stochastic optimization problem and proposed a statistics-based approach to solve it. Baranwal et al. [33] investigated the truthfulness of fog owners in fog environments and proposed a heuristic algorithm to ensure the truthfulness of fog owners and the reliability of services.

Since the reliability of MSA-based services depends on the microservice placement strategy, there have been many studies that investigated placing microservices with the goal of improving service reliability [8, 22, 23, 24]. Zeng et al. [8] formulated the microservice placement problem as an integer nonlinear programming problem and proposed a deep reinforcement learning scheme based on expert intervention to ensure high reliability and low latency of the service. Martin et al. [22] modeled the microservice placement problem as a multi-objective optimization problem and proposed a meta-heuristic algorithm to deal with the conflict between reliability and cost in the optimization objectives. Dadashi et al. [23] enhanced the reliability of the service by using backups and proposed a reliability-aware and delay-efficient heuristic algorithm to solve the microservice placement problem. The authors in [24] formulated the microservice placement problem in fog as a multi-objective optimization problem and proposed a fault-tolerant mechanism to improve the reliability of microservices while reducing power consumption and latency. However, while most works have used backup instances to enhance service reliability, the impact of network routing on service reliability has not been well studied.

Turning the perspective to routing, it can be seen that multipath routing techniques can significantly enhance the reliability of paths between microservices [27, 28, 29]. Le et al. [27] proposed a reliable service provisioning scheme to optimize network resource utilization by using multipath routing. The authors in [28] proposed a topology-agnostic multipath source routing scheme and orchestration architecture and verified its performance in improving communication reliability. Qu et al. [29] formalized the microservice placement problem as a mixed-integer linear programming problem and proposed a delay-aware hybrid multipath routing scheme to improve the reliability of network services.

To reduce bandwidth consumption, some researchers have considered using the shared backup path mechanism for network routing [34, 35, 36]. Saidi et al. [34] proposed two shared path mechanisms, including shared backup paths and shared all paths, to conserve bandwidth resources during network routing. Zheng et al. [35] used shared backup path protection to improve bandwidth capacity limits for elastic optical networks and used backup paths to improve system reliability during network routing. Ergenc et al. [36] used shared backup paths for service placement. They asserted that the proposed shared backup capacity model can bring up to 70% capacity gain and provide more than 90% fault tolerance for single node failure.

In the aforementioned related work, few works are aware of the impact of the load state and routing state in the network on the service reliability model as well as the microservice placement strategy. In addition, the reliability gain of backup instances of microservices after the decoupling of hardware and software has also not been well studied.

Refer to caption — Figure 1: The process of placing microservices into the infrastructure network.

III System Model

In this section, we give the system model and network-aware service reliability modeling. We first introduce the infrastructure network model for microservice placement and the service request model based on the MSA in Sec. III-A and Sec. III-B, respectively. Then, in order to clarify the difference between the already existing work and our work, we introduce the hardware reliability model and the software reliability model used in this paper in Sec. III-C, which serves as the basis for our network-aware service reliability modeling. In Sec. III-D, we formalize the network-aware service reliability model, which is innovatively sensitive to load-dependent node reliability and routing-dependent multipath routing reliability, and meticulously considers the impact of hardware and software reliability decoupling and backup instances on service reliability.

III-A Infrastructure Network Model

The infrastructure network model is established to provide the underlying network for microservice placement. As shown in Fig. 1, we represent the infrastructure network with an undirected graph, $G=(N,E)$ , where $N=\{n_{1},n_{2},\cdots,n_{|N|}\}$ denotes the set of physical nodes, and $E=\{e_{12},e_{23},\cdots,e_{(|N|-1)|N|}\}$ represents the set of physical links in the infrastructure network. For any physical node $n_{i}$ , considering the most commonly utilized CPU core resources, we use $c(n_{i})$ to denote the amount of resources that have been allocated, and $C(n_{i})$ to represent its total resource capacity. For any physical link $e_{ij}$ , the allocated bandwidth resources and total bandwidth resources are denoted by $bw(e_{ij})$ and $BW(e_{ij})$ , respectively. Additionally, the reliability of physical nodes and links is denoted as $r_{n}$ and $r_{e}$ , respectively, which describes the probability of no failure.

III-B Service Request Model

The service request model is established to describe information related to service requests. We use $S$ to represent the set of all service requests. Each service request is represented as $s^{i}=(G^{i},B^{i},\Upsilon^{i},D^{i},\Omega^{i})$ , where $G^{i}$ is a directed acyclic graph (DAG) used to represent the microservices in the service request and their dependencies, $B^{i}$ denotes the maximum number of backups for microservices in the service request, $\Upsilon^{i}$ represents the amount of data transferred between microservices in service request $s^{i}$ , $D^{i}$ denotes the set of latency deadlines for each microservice in the service request, and $\Omega^{i}$ denotes the lifetime of service request $s^{i}$ . In the service request model described above, nodes in the microservice dependency graph $G^{i}$ represent microservices, and directed edges represent microservice links and invocation relationships between source and destination microservices. The microservice dependency graph $G^{i}$ and the microservice model with $B^{i}$ backup constraints are described in detail as follows.

The microservice dependency graph $G^{i}=(M^{i},L^{i})$ consists of the microservice set $M^{i}=\{m^{i}_{0},m^{i}_{1},m^{i}_{2},\cdots,m^{i}_{|M|}\}$ and the microservice link set $L^{i}=\{l^{i}_{m_{1}m_{2}},l^{i}_{m_{2}m_{3}},\cdots,l^{i}_{m_{|M|-1}m_{|M|}}\}$ . In microservice set $M^{i}$ , $m^{i}_{0}$ is a special virtual microservice that represents the request access location and does not consume the computational resources of the access node. $m^{i}_{1}$ represents the root node of the microservice graph. Additionally, each microservice $m$ has fixed CPU core resource requirement $c(m)$ and reliability $r_{m}$ . Microservice link $l$ has bandwidth resource requirement and reliability, represented as $bw(l)$ and $r_{l}$ , respectively.

Microservice backups are placed as new instances of the primary microservices to enhance the overall reliability of the service. When the primary microservice instance is operating normally, backup microservice instances need to occupy computing resources on physical nodes and utilize bandwidth resources on physical links to provide failure tolerance. When the primary microservice instance fails, if a backup microservice instance is available, the microservice can still connect to upstream or downstream microservices through it, allowing the service to continue running. We use $m^{i}_{j}(b)$ to denote the $b$ -th microservice instance of microservice $m^{i}_{j}$ , where $b\in B^{i}_{m}$ and $B^{i}_{m}$ represents the set of backup instance indexes for microservice $m$ . For convenience, we abbreviate the primary microservice $m^{i}_{j}(1)$ as $m^{i}_{j}$ in the following. In addition, $B^{i}$ in the service request model ensures that the number of microservice backups is limited to prevent unlimited resource consumption.

Since microservice instances typically have latency requirements, we consider that each microservice link has a latency, including transmission latency, propagation latency, and processing latency from child microservices, and represent the latency between a pair of parent-child microservices $d_{\tau m}$ as follows:

\small d_{\tau m}=\frac{\upsilon^{i}_{\tau m}}{bw(l_{\tau m})}+\min_{p\in P_{l% _{\tau m}}}\{\sum_{e\in p}d_{e}\}+d_{m},

(1)

where $\tau$ is a parent microservice instance of $m$ , $\upsilon^{i}_{\tau m}\in\Upsilon^{i}$ represents the amount of data transmitted from microservice $m$ to $\tau$ after processing, $d_{e}$ represents the propagation latency of physical link $e$ , $P_{l_{\tau m}}$ denotes the set of paths where link $l_{\tau m}$ is placed, and $d_{m}$ represents the processing latency of microservice $m$ . When a microservice link is placed on multiple paths, only the latency of the shortest path is considered.

III-C Hardware And Software Reliability Model

This subsection describes the reliability modeling of the elements involved in the microservice placement process, including hardware and software. First, the load dependency of hardware node reliability is the basis for service reliability to sense the network load, and the hardware link reliability constitutes the smallest unit of network routing-aware path reliability. Second, software reliability directly affects the service reliability gain of the backup instances. Additionally, software link reliability will directly affect the path reliability gained from multipath routing. The specific model is as follows:

\mathbf{1}.

Hardware Reliability

The reliability of physical nodes has been demonstrated to be correlated with the workload running on them [12, 11]. Load-dependent network node reliability affects the placement strategy, as densely placed microservices lead to lower reliability of physical nodes, while uniformly distributed microservices maintain a low load state of nodes at the cost of consuming more bandwidth resources. To describe the load dependence of dynamic network node reliability, we refer to the work of Zhu et al. [12] and represent the reliability $r_{n}$ of a physical node $n$ as a two-segmented function as follows:

\small r_{n}(c(n))=\left\{\begin{aligned} r^{L}_{n},\qquad&c(n)\leq\xi(n)\\ r^{H}_{n},\qquad&\xi(n)<c(n)\leq C(n)\end{aligned}\qquad,\right.

(2)

where $\xi(n)$ represents the load threshold at which the reliability of the physical node changes, and $r^{L}_{n}$ and $r^{H}_{n}$ denote the reliability of the physical node under low-load and high-load conditions, respectively.

The failure arrival of a physical link $e$ is usually modeled as a Poisson process [30]. Therefore, we model its reliability $r_{e}$ as follows:

\small r_{e}(t)=Pr(x=0)=e^{-\lambda t},

(3)

where $Pr(\cdot)$ denotes probability, $\lambda$ represents the mean failure arrival rate, and $t$ denotes the time that the physical link has been operational.

\mathbf{2}.

Software Reliability

Software reliability encompasses the reliability of microservices and microservice links. Due to human factors that can lead to software failures, such as program design and environment configuration, software failures are typically modeled using specific software failure statistics. Nonetheless, the modeling of software failures does not affect subsequent analyses of service reliability, thus our work is compatible with arbitrary models. The arrival of software failures is modeled as a Poisson process. Therefore, the reliability $r_{m}$ and $r_{l}$ of microservice $m$ and microservice link $l$ can be respectively represented as

\small r_{m}(t)=Pr(x=0)=e^{-\lambda_{1}t},

(4)

\small r_{l}(t)=Pr(x=0)=e^{-\lambda_{2}t},

(5)

where $Pr(x=0)$ denotes the probability of no failure at moment $t$ , and $\lambda_{1}$ and $\lambda_{2}$ are the mean failure arrival rates for microservices and microservice links, respectively.

III-D Network-Aware Service Reliability Model

In this subsection, network-aware service reliability is modeled to assess the reliability level of the microservice placement strategy. Network-aware service reliability considers network-aware reliability, reliability of microservice dependencies, and reliability gain of backup microservice instances. Among them, network-aware reliability specifically refers to routing-dependent multipath reliability consisting of load-dependent physical node reliability and physical link reliability.

First, we consider network-aware reliability. Since the reliability of a physical node is related to the load of microservices running on it, we define a binary variable $x^{m}_{n}$ to indicate whether microservice $m$ is placed on node $n$ or not, where $x^{m}_{n}=1$ means that microservice $m$ is placed on node $n$ . In addition, we define an extra binary variable $y^{l}_{e}$ to indicate whether the microservice link $l$ is placed on the link $e$ . Then we can use the defined binary variable $x^{m}_{n}$ to represent the physical node $n^{m}$ where microservice $m$ is placed as follows:

\small n^{m}=\sum_{n\in N}x^{m}_{n}n.

(6)

In addition, the CPU resource occupancy $c(n)$ can be represented as

\small c(n)=\sum_{s_{i}\in S}\sum_{m\in M^{i}}x^{m}_{n}c(m).

(7)

Thus, the dynamic reliability of a physical node can be determined by Eq. (2) and Eq. (7).

Path reliability is considered since not only does the operation of microservices require node reliability, but also microservice links require that all physical nodes and links in their paths are reliable. We denote the $j$ -th path where the link between two microservice instances is placed by $p^{mm^{\prime}}_{j}$ , which is a set that contains all nodes and edges on the path, but not the source and destination nodes. We can then represent the path reliability $r_{p}$ for a single path $p$ as follows:

\small r_{p}=\prod_{n\in N}r_{n}\prod_{e\in p}r_{e}.

(8)

Next, we can represent the total path reliability $r_{P}$ of the path set which contains multiple paths $P$ as follows:

\small r_{P}=1-\prod_{p\in P}(1-r_{p}).

(9)

For the subsequent calculations, we need to record the path information when calculating the reliability of each path or multiple paths. We define a function $\mathsf{P}(\cdot)$ that is used to query the corresponding path set from the calculated path reliability as follows:

\small\mathsf{P}(r_{P})=P,

(10)

where $P$ is the set of paths corresponding to the total path reliability $r_{P}$ .

Next, we consider the total path reliability of a single microservice link placed on multiple paths. Calculating the two-terminal network reliability is the most accurate measure of total path reliability in a general network. However, in the microservice placement problem, routes are determined at the time of placement rather than being freely switchable at runtime to ensure resource provisioning. Therefore, we need to consider the reliability sum of a finite number of paths instead of two-end reliability. In this case, we consider all internally disjoint paths (IDPs) to avoid common cause faults (CCFs) [13].

We propose a microservice path reliability matrix to describe the reliability of all IDPs between two microservices. First, we express the reliability of a physical node $n_{i}$ and the link $e_{ij}$ connected to it as $\mathsf{r}_{ij}=r_{n_{i}}r_{e_{ij}}$ . Then, we propose a one-step path reliability matrix $R$ using the concept of the adjacency matrix as follows:

\small R=R^{(1)}=\begin{bmatrix}0&\mathsf{r}_{12}&\mathsf{r}_{13}&\cdots&% \mathsf{r}_{1|N|}&\\ \mathsf{r}_{21}&0&\mathsf{r}_{23}&\cdots&\mathsf{r}_{2|N|}&\\ \vdots&&\ddots&&\vdots&\\ \mathsf{r}_{|N|1}&\mathsf{r}_{|N|2}&\cdots&&0&\\ \end{bmatrix},

(11)

where $\mathsf{r}_{ij}=0$ if the physical link $e_{ij}$ does not exist. Then, for ease of representation in subsequent calculations, we define the path reliability operator as follows:

	$\displaystyle x\!\oplus\!y$	$\displaystyle=1-(1-x)(1-y)$		(12)
		$\displaystyle=x+y-xy,\qquad x,y\in\left[0,1\right],$		(12)

\small x\!\ominus\!y=\frac{x-y}{1-y},\qquad x\in[0,1],y\in\left[0,x\right),

(13)

		$\displaystyle x\!\otimes\!y\!\!=\!\!\left\{\!\begin{aligned} &0,\quad p_{x}\!% \cap\!p_{y}\neq\emptyset,p_{y}\!\in\!\mathsf{P}(y),\!p_{x}\!\in\!\mathsf{P}(x)% \\ &xy,\quad\quad\quad\quad\quad else\end{aligned},\right.$		(14)
		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad x,y\in[0,1],$		(14)

where $x\oplus y$ denotes the reliability sum of two paths, $x\ominus y$ denotes the reliability sum of multiple paths minus the reliability of one of the paths, and $x\otimes y$ denotes the reliability of two paths merged.

Based on the path reliability operator and the preservation of the path information corresponding to reliability, we define the multiplication of the path reliability matrix as follows:

		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad A\wedge B=C=[c_{ij}],$		(15)
		$\displaystyle c_{ij}=\!\left\{\begin{aligned} &a_{i1}\otimes b_{1j}\oplus a_{i% 2}\otimes b_{2j}\cdots a_{in}\otimes b_{nj},\!\!\!\!\!&i\neq j\\ &0,\!\!\!\!\!&i=j\end{aligned},\right.$		(15)

where $c_{ij}$ represents the reliability sum of equal-length paths from physical node $i$ to $j$ . Thus, we can represent the $k$ -th order path reliability matrix as follows:

\small R^{(k)}=R^{(k-1)}\wedge R^{(1)},k\geq 2,

(16)

where $k$ represents the length of the path.

Next, we define path reliability matrix addition as follows:

		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad A\vee B=C=[c_{ij}],$		(17)
		$\displaystyle c_{ij}\!\!=\!\!\left\{\begin{aligned} &a_{ij},\quad p_{a}\!\cap% \!p_{b}\neq\emptyset,p_{a}\!\!\in\!\mathsf{P}(a_{ij}),p_{b}\!\!\in\!\mathsf{P}% (b_{ij})\\ &a_{ij}\oplus b_{ij},\quad\quad\quad\quad\quad else\end{aligned}.\right.$		(17)

As a result, we can represent all path reliability matrices with a maximum length of $k$ as follows:

\small\hat{R}^{(k)}=R^{(1)}\vee R^{(2)}\vee\cdots\vee R^{(k)},

(18)

where the elements $\hat{\mathsf{r}}^{(k)}_{ij}$ in $\hat{R}^{(k)}$ represent the reliability of all IDPs from physical node $n_{i}$ to physical node $n_{j}$ with a length not greater than $k$ . However, the total path reliability is inaccurate because each path includes the source node within it and does not meet the definition of an IDP. Therefore, we denote the path set $\mathsf{P}^{\prime}(\hat{\mathsf{r}}^{(k)}_{ij})$ after removing the source nodes of all paths as follows:

\small\mathsf{P}^{\prime}(\hat{\mathsf{r}}^{(k)}_{ij})=\{p^{\prime}|p^{\prime}% =p\backslash\{n_{i}\},p\in\mathsf{P}(\hat{\mathsf{r}}^{(k)}_{ij})\}.

(19)

Finally, we can represent the network-aware reliability matrix as follows:

\small\mathcal{R}^{(k)}=\left[r^{(k)}_{ij}\right],r^{(k)}_{ij}=\left\{\begin{% aligned} &r_{\mathsf{P}^{\prime}(\hat{\mathsf{r}}^{(k)}_{ij})},&&i\neq j\\ &1,&&i=j\end{aligned}.\right.

(20)

After obtaining the network-aware reliability matrix, we can analyze the reliability of the microservice dependency graph and the reliability gain brought by backup microservice instances. We divided the analysis of network-aware service reliability into the following four steps.

First, analyze the reliability between a single parent instance and a single child instance. We use $\tau_{m}^{f}$ to represent the $f$ -th parent microservice of microservice $m$ . For each child microservice instance, its parent microservice instance is connected to it through a microservice link placed on one or more paths. So we can consider the reliability of the microservice link from the $b_{1}$ -th instance of the child microservice to the $b_{2}$ -th instance of the $f$ -th parent microservice and the network-aware reliability between the ends as a whole. We call this whole the effective probability of microservice link $\kappa^{(k)}_{m(b_{1}),\tau^{f}_{m}(b_{2})}$ and denote it as follows:

\small\kappa^{(k)}_{m(b_{1}),\tau^{f}_{m}(b_{2})}=r_{l_{m(b_{1})\tau^{f}_{m}(b% _{2})}}r^{(k)}_{n^{m(b_{1})}n^{\tau^{f}_{m}(b_{2})}},

(21)

where $r_{l_{m(b_{1})\tau^{f}_{m}(b_{2})}}$ represent the software reliability of the microservice link and $r^{(k)}_{n^{m(b_{1})}n^{\tau^{f}_{m}(b_{2})}}$ represent the network-aware reliability between the nodes where the parent and child microservices are placed. In addition, since the virtual microservice $m_{0}$ used to represent the access location does not have a parent microservice, we let $\kappa_{m_{0}}=1$ .

Second, analyze the reliability of a single microservice instance and all its parent microservice links. To achieve this, understanding the impact of the node reliability on the effective probability of microservice links is essential. Node reliability may be reused by the network-aware reliability of multiple microservice instances and links, leading to spurious reliability. Therefore, node reliability should be considered only once in the calculation. However, backup instances introduce a new problem: node availability may be a non-essential condition for service availability. To address this issue, we focus on nodes whose failure would inevitably lead to service failure and call them critical nodes. We denote the set of critical nodes by $N^{i}_{c}$ , which can be represented as follows:

\small N^{i}_{c}\!=\!\!\{n^{\prime}|n^{\prime}=n,\exists m\!\in\!M^{i},n\!\!% \prod_{b\in B^{i}_{m}}\!\!x^{m(b)}_{n}\!\neq\!0,n\!\in\!\!N\},

(22)

where $B^{i}_{m}$ denotes the set of backup instance indexes of microservice $m$ .

Now, we need to correct the network-aware reliability of the microservice link paths with the set of critical nodes. We denote the total path reliability $\hat{r}_{P}$ corrected by the set of critical nodes as follows:

\small\hat{r}_{P}=\prod_{n\in p\backslash p\cap N^{i}_{c}}r_{n}\prod_{e\in p}r% _{e}.

(23)

Network-aware reliability can be corrected as follows:

\small\hat{r}^{(k)}_{ij}=\left\{\begin{aligned} &\hat{r}_{\mathsf{P}^{\prime}(% \hat{\mathsf{r}}^{(k)}_{ij})},&&i\neq j\\ &1,&&i=j\end{aligned}.\right.

(24)

The corrected effective probability of microservice link $\hat{\kappa}^{(k)}_{m(b_{1}),\tau^{f}_{m}(b_{2})}$ can be expressed as follows:

\small\hat{\kappa}^{(k)}_{m(b_{1}),\tau^{f}_{m}(b_{2})}=r_{l_{m(b_{1})\tau^{f}% _{m}(b_{2})}}\hat{r}^{(k)}_{n^{m(b_{1})}n^{\tau^{f}_{m}(b_{2})}}.

(25)

Now we can denote the reliability of a single microservice instance $m(b)$ and all its parent microservice links by $\sigma^{(k)}_{m(b),n}$ , which is as follows:

\small\sigma^{(k)}_{m(b),n}\!\!=\!r_{\!m(b)}\!\!\!\!\!\!\prod_{f\in F_{m(b)}}% \!\!\!\!(1\!\!-\!\!\!\!\prod_{b^{\prime}\in B_{\tau}}\!\!(1-x^{m(b)}_{n}\hat{% \kappa}^{(k)}_{m(b),\tau^{f}_{\!m}\!(b^{\prime})})\!)

(26)

where $B_{\tau}$ denotes the set of backup instance indexes of microservice $\tau^{f}_{m}$ and $F_{m(b)}$ denotes the set of indexes of the parent microservice instances. For convenience, we omit the superscript $(k)$ of $\sigma^{(k)}_{m(b),n}$ in the subsequent analyses, which simply denotes the maximum path length in the network-aware reliability.

Third, analyze the reliability of all instances of a microservice and the links from their parents. We first use $\sigma_{m,n}$ to denote the reliability of all instances of microservice $m$ that are placed on the same node $n$ , which can be expressed as follows:

\small\sigma_{m,n}=1-\prod_{b\in B_{m}}(1-\sigma_{m(b),n}).

(27)

Now we can obtain the reliability of all instances of microservice $m$ on all nodes. We denote it by $\sigma_{m}$ as follows:

\small\sigma_{m}=1-\!\!\!\!\!\prod_{n\in N\backslash N^{i}_{c}}\!\!\!\!\!r_{n}% (1-\sigma_{m,n})\prod_{n\in N^{i}_{c}}(1-\sigma_{m,n}),

(28)

where we temporarily disregard the reliability of critical nodes to avoid their reuse.

Fourth, analyze the reliability of the entire microservice dependency graph, i.e., network-aware service reliability. We denote the service reliability, which consists of the reliability of all microservices and the reliability of critical nodes, by $r_{G^{i}}$ as shown below:

\small r_{G^{i}}=\prod_{n\in N^{i}_{c}}r_{n}\prod_{m\in M^{i}}\sigma_{m}.

(29)

IV Problem Formulation

The microservice placement problem is defined as a mapping $\psi:G_{i}\rightarrow G$ . Microservice placement involves two tasks: node placement and path selection. Node placement involves placing each microservice instance (including backup microservice instances) from a service request onto a single physical node in the infrastructure network, while path selection involves mapping the link between any two microservice instances to one or more consecutive physical links. After a service request expires, microservice placement is revoked and the occupied resources are released. In this paper, the primary objective of microservice placement is to maximize the service reliability of a single service request while meeting latency and resource constraints. Therefore, for a single service request $s$ arriving at the current time, we can formalize the microservice placement problem as an integer nonlinear programming problem and represent it as follows:

\small\mathcal{P}1:\max r_{G^{i}},

(30)

s.t.

	$\small\sum_{s^{i}\in S}\sum_{l\in L^{i}}y^{l}_{e}bw(l)\leq BW(e),\forall e\in E,$		(31a)
	$\small\sum_{s^{i}\in S}\sum_{m\in M^{i}}x^{m}_{n}c(m),\leq C(n),\forall n\in N,$		(31b)
	$\small\sum_{n\in N}x^{m(b)}_{n}=1,\forall b\in B_{m},m\in M,$		(31c)
	$\small d_{\tau m}\leq D_{\tau m},\forall m\in M,$		(31d)
	$\small\sum_{m\in M}\|B_{m}\|\leq B+\|M\|,$		(31e)
	$\small B<\|M\|,$		(31f)
	$\small x^{m}_{n}\in\left\{0,1\right\},\forall m\in M,n\in N,$		(31g)
	$\small y^{l}_{e}\in\left\{0,1\right\},\forall l\in L,e\in E,$		(31h)
	$\small c(n),c(m),C(n)\geq 0,\forall m\in M,n\in N,$		(31i)
	$\small bw(l),bw(e),BW(e)\geq 0,\forall l\in L,e\in E,$		(31j)

where constraints (31a)-(31b) ensure that the resource requirements of service requests do not exceed the resource limits of physical nodes and links in the infrastructure network, constraint (31c) ensures that each microservice instance is placed on only one physical node, constraint (31d) ensures that the latency of each microservice link does not exceed its latency requirements, constraint (31e) ensures that the number of backup microservices does not exceed the backup limit of the service request, constraint (31f) ensures that the backup limit does not exceed the number of microservices in the microservice dependency graph, and constraints (31g)-(31j) specify the value ranges of variables and resources.

In Section III-B, we mentioned that backup microservice instances consume computing resources on physical nodes and bandwidth resources on physical links. However, since backup microservice instances mostly remain in an inactive state (becoming active only when the primary microservice instance fails), providing dedicated bandwidth protection for them is not always necessary. Therefore, we consider the concept of the shared backup path, which allows backup microservice instances to share network bandwidth resources to reduce bandwidth consumption. However, the introduction of the shared backup path mechanism creates a new problem: how to avoid multiple backup instances becoming active at the same time and causing network bandwidth contention, which can lead to service request failures. To solve this problem, we first introduce an upper limit, denoted as $\hat{BW}(e)$ , which limits the shared bandwidth capacity. The upper limit is denoted as $\omega$ times the protected bandwidth limit $BW(e)$ :

\small\hat{BW}(e)=\omega BW(e),

(32)

where $\omega\geq 0$ . Then we modify constraint (31a) of problem $\mathcal{P}1$ and propose problem $\mathcal{P}2$ , which aims to maximize service reliability with the shared backup path mechanism. Problem $\mathcal{P}2$ is as follows:

\small\mathcal{P}2:\max r_{G^{i}},

(33)

s.t.

\small\sum_{s^{i}\in S}\sum_{l\in L^{i}_{1}}y^{l}_{e}bw(l)\leq BW(e),\forall e% \in E,

(34a)

\small\!\!\sum_{s^{i}\in S}\sum_{l\in L^{i}\!\backslash L^{i}_{1}}\!\!\!y^{l}_% {e}bw(l)\!\leq\!\hat{BW}(e),\!\forall e\!\in\!\!E,

(34b)

(\ref{const_cpu})-(\ref{bwfield}),

(34c)

where $L^{i}_{1}$ represents the set of links between primary microservices. We denote it as follows:

L^{i}_{1}=\{l|l=l_{m(1)m^{\prime}(1)},m,m^{\prime}\in M^{i}\}.

(35)

Constraint (34a) ensures that protected bandwidth consumption does not exceed the protected bandwidth limit, while constraint (34b) guarantees that the shared bandwidth consumption does not exceed $\omega$ times the protected bandwidth limit.

V Proposed SRP Algorithm

In this section, we propose a service reliability-aware placement (SRP) algorithm, which is a heuristic algorithm proposed to solve Problem $\mathcal{P}1$ . The SRP algorithm takes as input the network state as well as the service request and outputs a microservice placement strategy that includes microservice instance placement and backup object selection.

V-A Algorithm Description

The main process of the SRP algorithm is shown in Algorithm 1. When a service request arrives, SRP initiates a breadth-first search starting from the root microservice $m_{1}$ and adds the microservices in the microservice dependency graph $G^{i}$ to the placement queue $Q$ (line 3). In line 4, we define and initialize the backtracking counter $\epsilon$ with a predefined upper limit $\Delta$ and the node blacklist $N^{m}_{bl}$ for each microservice. The loop from line 5 to line 17 ensures that each microservice of the service request is placed. In line 6, the SRP extracts the first unplaced microservice $m$ from queue $Q$ . The SRP then calls Algorithm 2 to place the microservice $m$ and gets the placement result, which is used to indicate a successful or failed placement. Lines 8-16 deal with the case of microservice placement failure. If the microservice is not the root microservice and the number of backtracks has not exceeded the limit, SRP will cancel the placement of all parents of microservice $m$ and their children. It also adds the node where the parent was placed to the node blacklist of the parent and then proceeds to the next iteration. In line 18, we call Algorithm 3 for backup object selection and backup instance placement. Finally, in line 19, the algorithm returns the placement success message.

Algorithm 2 describes the microservice placement process. It traverses the nodes in the candidate node set (lines 4-22). In lines 5-7, if the node is in the blacklist of microservice $m$ or has insufficient resources, the algorithm starts the next iteration directly; otherwise, the algorithm calculates $\sigma_{m}$ for microservice $m$ in lines 8-16. Specifically, we use $n_{f_{b}}$ to denote the placement node of the parent microservice instance $\tau^{f}_{m}(b)$ . In line 11, the algorithm searches for the set of IDPs between nodes $n_{j}$ and $n_{f_{b}}$ that meet the constraints. Line 12 calculates the total path reliability of the path set $P_{jf_{b}}$ . Lines 13-16 calculate the reliability of $m$ . Since node reliability is load-dependent, line 17 calculates the reliability of node $n$ after placing microservice $m$ , and lines 18-22 consider the effect of the critical node set on service reliability. Lines 23-25 track the nodes with the highest total reliability. Line 27 places the microservice $m$ on the node with the highest total reliability and places all links connected to it on the path. At the same time, the algorithm records the current $\sigma_{m}$ for subsequent selection of backup objects. Since the computation of the current $\sigma_{m}$ is performed simultaneously with the placement of the microservice links, no additional time complexity is added. Finally, lines 28-32 return the placement result.

Algorithm 3 outlines the strategy for selecting backup objects. After initializing the backup counter $b$ and the set of backup objects $BM^{i}$ in line 1, the algorithm enters a loop in lines 2-16, which requires that the number of backup instances does not exceed a limit and that the set of backup objects is not empty. In lines 4-8, the algorithm iterates over $BM^{i}$ to obtain the microservice $m_{min}$ with the smallest $\sigma_{m}$ value. Since the instantaneous $\sigma_{m}$ obtained when placing the microservice increases with the number of instances, the latest $\sigma_{m}$ needs to be obtained in line 5. The algorithm then calls Algorithm 2 in line 10 to place the microservice $m_{min}$ . In lines 11 to 15, if the placement fails, the microservice is removed from the set of backup objects and the next round of iteration starts; otherwise, the backup counter is increased in line 15.

Algorithm 1 Service Reliability-aware Placement (SRP) Algorithm

1: Input edge network

G

, service request

s^{i}

2: Output placement result of

G^{i}

with backups.

3: Add microservices from

M^{i}\backslash{m^{i}_{0}}

to the placement queue

Q

starting from

m_{1}

through breadth-first search.

\epsilon\leftarrow 0

N^{m}_{bl}\leftarrow\emptyset,m\in Q

5: while unplaced microservices in

Q

exist do

6: Obtain the first unplaced microservice

m

from

Q

res\leftarrow

place

m

using Alg. 2.

8: if

res=\textbf{false}{}

then

9: if

m\neq m_{1}

and

\epsilon<\Delta

then

10: Undo the placement of all the parents of microservice

m

and all their children.

11:

\epsilon\leftarrow\epsilon+1,N^{\tau_{m}}_{bl}\leftarrow N^{\tau_{m}}_{bl}\cup% \{n^{\tau_{m}}\}

12: continue

13: else

14: return false

15: end if

16: end if

17: end while

18: Backup Placement Process (Alg. 3).

19: return true

Algorithm 2 Microservice Placement Process

1: Input

G

s^{i}

m

N^{m}_{bl}

2: Output Placement result of microservice

m

n_{max}\leftarrow null,\mathsf{r}_{max}\leftarrow 0.

4: for

n_{j}\in N

5: if

n_{j}\in N^{m}_{bl}

n_{j}

is under-resourced then

6: continue

7: end if

\mathsf{r}_{b}\leftarrow 0,\mathsf{r}_{f}\leftarrow 1

9: for

f\in F_{m}

10: for

b\in B_{\tau^{f}_{m}}

11: Denote the node of

\tau^{f}_{m}(b)

n_{f_{b}}

and get the set of paths

P_{jf_{b}}

that meet the constraints.

12: Calculate the total path reliability

r_{P_{jf_{b}}}

13:

\mathsf{r}_{b}\leftarrow\mathsf{r}_{b}\oplus r_{P_{jf_{b}}}

14: end for

15:

\mathsf{r}_{f}\leftarrow\mathsf{r}_{f}\mathsf{r}_{b}

16: end for

17: Calculate the

r^{\prime}_{n_{j}}

after placing

m

n_{j}

18: if

n_{j}\in N_{c}

then

19:

\mathsf{r}_{f}\leftarrow\mathsf{r}_{f}\frac{r^{\prime}_{n_{j}}}{r_{n_{j}}}

20: else

21:

\mathsf{r}_{f}\leftarrow\mathsf{r}_{f}r^{\prime}_{n_{j}}

22: end if

23: if

\mathsf{r}_{max}<\mathsf{r}_{f}

then

24:

\mathsf{r}_{max}\leftarrow\mathsf{r}_{f},n_{max}\leftarrow n_{j}

25: end if

26: end for

27: Place

m

n_{max}

while recording the current

\sigma_{m}

28: if placement succeeds then

29: return true

30: else

31: return false

32: end if

Algorithm 3 Backup Placement Process

b\leftarrow 0,BM^{i}\leftarrow M^{i}\backslash\{m^{i}_{0}\}

2: while

b<B^{i}

and

BM^{i}\neq\emptyset

m_{min}\leftarrow null,r_{min}\leftarrow 1

4: for

m\in BM^{i}

5: Obtain the

\sigma_{m}

of the last instance of

m

6: if

r_{min}>\sigma_{m}

then

r_{min}\leftarrow\sigma_{m},m_{min}\leftarrow m

8: end if

9: end for

10:

res\leftarrow

place

m_{min}

using Alg. 2.

11: if

res=\textbf{false}{}

then

12:

BM^{i}\leftarrow BM^{i}\backslash\{m_{min}\}

13: continue

14: end if

15:

b\leftarrow b+1

16: end while

V-B Complexity Analysis

The complexity analysis starts with the time complexity of Algorithm 2 and Algorithm 3. First, we use the IDP-searching algorithm based on the Edmonds-Karp algorithm [37] with a time complexity of $O(|N||E|^{2})$ to obtain the set of paths between two nodes. Then we can get the complexity of Algorithm 2, which is $O(|N|^{2}|E|^{2}|M|)$ . In Algorithm 3, the loop in lines 2 to 16 is executed at most $|M|$ times. Since the complexity of each loop is $O(|M|+|N|^{2}|E|^{2}|M|)=O(|N|^{2}|E|^{2}|M|)$ , the total complexity of Algorithm 3 is $O(|N|^{2}|E|^{2}|M|^{2})$ . Now looking back at Algorithm 1, we can see that its complexity is the same as that of Algorithm 3. Therefore, we can conclude that the complexity of Algorithm 1 is $O(|N|^{2}|E|^{2}|M|^{2})$ .

VI Proposed SPRC Algorithm

In this section, we propose a new heuristic algorithm based on the SRP algorithm to solve the problem $\mathcal{P}2$ by using the shared path reliability computation (SPRC) algorithm. When selecting placement nodes for backup instances, in addition to considering the adequacy of the remaining shared bandwidth on the physical link, we also need to consider the shared backup path contention problem caused by simultaneous failures. Specifically, when other backup links are suddenly activated (i.e., switched from occupying virtual bandwidth to occupying protected bandwidth), the protected bandwidth may be insufficient and lead to the failure of the backup link if the current backup link needs to be activated for failure tolerance. Therefore, we must consider the reliability of the primary instance of the backup link placed on the physical link, which determines the distribution of the occupied bandwidth of the protected bandwidth. For a more comprehensive assessment of network-aware service reliability, we propose replacing lines 9-16 in Algorithm 2 with SPRC. In the following, we refer to the SRP algorithm using SPRC as the SRP-S algorithm.

VI-A Algorithm Description

Algorithm 4 modifies lines 9-16 of Algorithm 2. In lines 6-11 of Algorithm 4, we traverse the set $Mp$ , which represents the set of all backup microservice instances that have links on the path $p$ . Line 7 obtains the probability of inactivity of the backup microservice instance $m^{\prime}$ , which is the reliability of all instances with backup indexes less than the backup index of $m^{\prime}$ . In lines 8-10, we evaluate whether the protected bandwidth is sufficient on the physical link for the microservice links belonging to both $m^{\prime}$ and $m$ . Activating $m^{\prime}$ may cause activation of $m$ to fail if bandwidth is insufficient. In fact, each backup instance in $M^{\prime}$ may be inactive or active, so there are $2^{|M^{\prime}|}$ events and there is only one event with the highest probability that each backup instance is inactive. Due to the time constraint and the low probability of multiple simultaneous failures, we ignore the simultaneous failure of two or more backup instances in this algorithm and only consider $|M^{\prime}|$ individual failure events. Finally, in line 12, we multiply the backup path inactivation probability by the original path reliability to produce new path reliability and calculate $\hat{r}_{P_{jf_{b}}}$ with the new path reliability in line 14. Subsequent lines 15-17 are consistent with Algorithm 2.

VI-B Complexity Analysis

The additional complexity introduced by Algorithm 4 relative to Algorithm 2 is mainly in the loops in lines 6-11. Combined with the complexity of the path-searching algorithm, we can obtain the time complexity of Algorithm 4 as $O(|M|(|N||E|^{2}+|M|))$ . Thus the time complexity of the SRP-S algorithm is $O(|M|(|M|+|N||M|(|N||E|^{2}+|M|)))=O(|N|^{2}|E|^{2}|M|^{2}+|N||M|^{3})$ .

Algorithm 4 Shared Path Reliability Calculation (SPRC) Algorithm

1: for

f\in F_{m}

2: for

b\in B_{\tau^{f}_{m}}

3: Denote the node of

\tau^{f}_{m}(b)

n_{f_{b}}

and get the set of paths

P_{jf_{b}}

that meet the constraints.

4: for

p\in P_{jf_{b}}

Pr_{p}\leftarrow 0

6: for

m^{\prime}\in M_{p}

7: Obtain the inactivation probability

\sigma^{\prime}_{m^{\prime}}

for

m^{\prime}

8: if

\exists e\in p\cap\{e|y^{l_{m^{\prime}\tau}}_{e}=1\},BW(e)<bw(e)+bw(l_{m^{% \prime}\tau})+bw(l_{m\tau^{f}_{m}(b)})

then

Pr_{p}\leftarrow Pr_{p}+\sigma^{\prime}_{m^{\prime}}

10: end if

11: end for

12:

\hat{r}_{p}\leftarrow Pr_{p}r_{p}

13: end for

14: Calculate

\hat{r}_{P_{jf_{b}}}

using modified path reliability.

15:

\mathsf{r}_{b}\leftarrow\mathsf{r}_{b}\oplus\hat{r}_{P_{jf_{b}}}

16: end for

17:

\mathsf{r}_{f}\leftarrow\mathsf{r}_{f}\mathsf{r}_{b}

18: end for

VII Performance Evaluation

We validate our modeling and algorithmic work through extensive simulations. Our simulation code can be accessed online [38].

VII-A Simulation Setting

We first use the Erdős-Rényi model [39] to create an infrastructure network topology with a node count of 50 and an edge creation probability of 0.2. We then select one-fifth of the nodes with smaller degrees as access nodes for receiving service requests. For each microservice, microservice link, and physical link, we set their failure arrival rate to $0.00001$ per time unit, ensuring their reliability meets the ”five nines” reliability level at the initial moment. For dynamic node reliability, we set the reliability $r_{L}$ between $0.9999$ and $0.99999$ for low load and $r_{H}$ between $0.999$ and $0.9999$ for high load. For the backup number limit, we set it to be the same as the primary number of microservices (full backup) if not specifically declared. The other parameters are listed in Table I. In all simulations, $100$ service requests reach the access node according to a Poisson distribution with parameter $1$ , implying that the average arrival rate of service requests is $1$ per time unit. Additionally, all simulations are performed $100$ times and averaged for the final results.

TABLE I: Parameters

Element	Parameter	Range
$n$	$C(n)$	$[8,16]$ cores
	$\xi(n)$	$0.5$
	$BW(e)$	$[100,1000]$ MBps
$e$	$d_{e}$	$[1,10]$ ms
	$\omega$	$1$
$s$	$\Omega$	$[1,100]$ units
	$\|M\|$	$[1,5]$
	$c(m)$	$[0.1,1]$ cores
$m$	$\upsilon_{\tau m}$	$[0.1,5]$ MB
	$d_{m}$	$[10,50]$ ms
$l$	$bw(l)$	$[0.1,10]$ MBps
	$D_{l}$	$[0.03,50.15]$ s

VII-B Benchmark

In our simulations, four algorithms are used for comparison with the SRP and SRP-S algorithms. They are described in detail as follows:

$\mathbf{1}.$

Delay-efficient and Availability-aware Placement (DAIP) [23]

DAIP algorithm is an algorithm that considers backup instances. It focuses on the reliability of the nodes when placing services and chooses the path with the least latency when placing links. In addition, it adopts a Round-Robin strategy for backup object selection. While the DAIP algorithm considers in detail the reliability gain from backup instances, it does not consider in detail the dynamic network-aware reliability and hardware-software reliability decoupling in contrast to our work.
$\mathbf{2}.$

Reliable Redundant Services Placement (RRSP) [20]

The RRSP algorithm considers only the total node reliability after placement and does not consider path selection. Unlike DAIP, RRSP does not focus on the placement of each instance but directly generates a certain number of solutions and selects the optimal solution among them. In addition, it does not specify how the backup objects are selected, so we generate backup instances for the primary instances in descending order of the degrees of the nodes in the dependency graph. We use this benchmark algorithm to show the performance of the algorithm considering only node reliability.
$\mathbf{3}.$

Greedy Placement (Grd)

The greedy algorithm is a classical heuristic algorithm. It selects the node with the highest reliability for each microservice instance and chooses the shortest path for each microservice link. To emphasize the effect of backups on reliability improvement, the basic version of the greedy algorithm does not consider the backup algorithm.
$\mathbf{4}.$

Greedy Placement with Backup (Grd-B)

Greedy Placement with Backup is an advanced version of Greedy Placement where the round-robin policy is used for backup object selection. The instance placement and path selection of this algorithm are consistent with Greedy Placement.

VII-C Validation of Network-Aware Service Reliability Model

In this subsection, we verify the effectiveness of the network-aware service reliability model.

Fig. 2(a) and Fig. 2(b) show the network-aware service reliability evaluation and the number of service failures for different placement algorithms with the fully protected path mechanism, respectively. From Fig. 2(a) and Fig. 2(b) we can see that the more service requests with high network-aware service reliability, the fewer number of failures. This result proves that the evaluation result of the proposed reliability model is generally consistent with the evaluation result of the number of failures.

Fig. 3 shows the reliability performance with the shared backup path mechanism. As seen in Fig. 3(a) and Fig. 3(b), the proposed network-aware service reliability model can still work to evaluate service reliability with the shared backup path mechanism. This is because when bandwidth resources are not strained, shared backup paths do not need to consider path reliability changes due to backup path contention. However, when bandwidth resources are extremely scarce, the network-aware service reliability will be inaccurate due to backup path contention.

VII-D Validation of the SRP Algorithm

In this subsection, we validate the performance of the SRP algorithm by executing a series of simulations with the fully protected path mechanism. Although we verified the effectiveness of the proposed model in Sec. VII-C, a more accurate way to evaluate the performance of the algorithm should still be to evaluate the number of service failures. This is because after different algorithms provide differentiated placement strategies, the resource conditions of the network will gradually differentiate, leading to differences in the solution space for subsequent microservice placement. This makes the reliability of services placed by different algorithms comparable only at the initial moment. Therefore, in order to directly reflect the fault tolerance of different placement algorithms in subsequent simulations, we evaluate the algorithm performance by comparing the number of service failures.

Our first simulation result is shown in Fig. 2 (b). The result demonstrates that the SRP algorithm outperforms other algorithms. Taking the number of failures of the worst-performing algorithm as a criterion, the SRP algorithm reduces failures by up to 29% compared to the latest DAIP algorithm. This is because the proposed algorithm takes into account the network-aware reliability of each placed part when placing microservice instances. In addition, the SRP algorithm always provides a new backup instance for the microservice with the lowest current network-aware reliability when selecting the backup object.

We secondly validate the performance of the algorithms with different network topologies. First, we generate six sets of topologies with different edge creation probabilities, and each set has 100 different random topologies. Fig. 4(a) shows the number of failures for services placed by different algorithms with 600 different topologies. From the figure, we can see that the SRP algorithm maintains the lowest number of service failures for all the topologies with different edge creation probabilities. This is due to the fact that the benchmark algorithms all determine the placement of the microservice instances based solely on node reliability, without considering the reduction in reliability caused by network routing and the increase in reliability provided by multipath routing. On the contrary, the SRP algorithm senses network routing by calculating network-aware reliability, thus achieving superior performance. Second, we generated seven sets of topologies with different numbers of nodes, each containing 100 different random topologies. Fig. 4(b) exhibits the number of failures for services placed by different algorithms with 700 different topologies. It can be seen that the number of failures of the different algorithms first decreases and then stabilizes when the number of nodes exceeds 40. This is because the more nodes there are, the more nodes in the network that are in a highly reliable state. Although the benchmark algorithms consider node reliability, they do not consider the reliability of the entire microservice dependency graph. Instead, they select the most reliable node for each instance in isolation, which causes each node to reach a high load state quickly. The SRP algorithm achieves optimal performance because, on the one hand, it calculates the change in node reliability as a node approaches a high load state, and on the other hand, multipath reliability makes distributed placement of instances less costly.

Thirdly, we verify the performance of the algorithm with different resource requirement conditions. In order to avoid the impact of modifying the topology on the algorithm performance, we do not change the number of nodes in the simulation of Fig. 5(a), but instead scale up the CPU requirements of all the microservices by 10-50 times. The trend of the curve in the figure is decreasing because the number of successfully placed services decreases as the CPU requirement increases, leading to a decrease in the number of service failures. We can see that the SRP algorithm reduces the number of service failures by up to 24% compared to benchmark algorithms. This is because the benchmark algorithms maintain the strategy of finding the most reliable node when most of the nodes are with high load, whereas the SRP algorithm provides more placement strategies that do not occupy high-load nodes at the cost of occupying the bandwidth of multiple paths. Similarly, we scale up the bandwidth requirement for microservice links in the simulation of Fig. 5(b). We can see that as the bandwidth requirement increases, the number of service failures for most of the benchmark algorithms decreases as fewer services are successfully placed. This is because the inflated bandwidth requirement compresses their solution space. In contrast, the SRP algorithm discovers more solution space through backtracking. Moreover, although the SRP algorithm also places fewer services, it ensures the reliability of successfully placed services through multipath routing and backup. Thus, even though it consumes more bandwidth, the SRP algorithm still achieves superior performance.

Finally, we validate the performance of different algorithms for backup object selection. We set the upper limit of backups per service request to a random value between 1 and the number of microservices because when the number of backups decreases, proper backup object selection leads to higher reliability gain. From Fig. 6, we can see that the SRP algorithm can reduce the number of failures by up to 23.8% despite the reduction in the number of backups compared to the full backup simulation. This relates to the fact that the SRP algorithm considers a backup object selection strategy that prioritizes compensating microservices with the lowest network-aware reliability.

VII-E Validation of the SRP-S Algorithm

In this subsection, we first verify the significant contribution of the shared backup path mechanism in reducing bandwidth consumption. Then, we verify the fault tolerance performance of the SRP and SRP-S algorithms with the shared backup path mechanism through two simulations.

Fig. 7 shows the average bandwidth consumption of different algorithms over time, where the algorithms marked with brackets operate with the shared backup path mechanism. As seen in Fig. 7, the SRP-S algorithm is able to reduce the bandwidth consumption by up to 62% compared to the SRP algorithm with the fully protected path mechanism, which demonstrates its significant potential in reducing bandwidth consumption.

The result of the first simulation for verifying fault tolerance are shown in Fig. 3. From Fig. 3(a) we can see that there is almost no difference between SRP and SRP-S curves. Theoretically, in a single placement with the same conditions, the service reliability of the service request placed by the SRP-S algorithm is not higher than that of the service request placed by the SRP algorithm. This is because the SRP-S algorithm pursues corrected reliability considering the shared backup path contention probability rather than the network-aware service reliability. Fig. 3(b) illustrates the number of service failures of the services placed by each algorithm. In Fig. 3(b), the performance of the benchmark algorithms and the SRP algorithm is similar to the case with the fully protected path mechanism, while the performance of the SRP-S algorithm is similar to that of the SRP algorithm as they are essentially the same when there is not much pressure on bandwidth resources.

To verify the performance of the SRP-S algorithm in more extreme cases, the bandwidth requirements of the microservice links are amplified in the second simulation. From Fig. 8 we can see that the SRP algorithm and the SRP-S algorithm produce fewer service failures in different cases and the SRP-S algorithm performs better in the more extreme cases. The reason for the superiority of the SRP algorithm lies in the consideration of network state and backup object selection, whereas the reason for the superiority of the SRP-S algorithm is the consideration of the shared backup path contention caused by simultaneous failure events. Fig. 8 shows that the SRP-S algorithm reduces the number of service failures by up to 21% compared to the SRP algorithm in extreme cases.

VIII Conclusion

In this paper, we address the intricate challenges of microservice placement with a focus on enhancing the reliability of MSA-based 5G and IoT services. The network-aware service reliability model thoroughly considers the impact of network load and routing on service reliability, offering profound insights into system reliability assessment. Based on the proposed service reliability model, we propose an innovative heuristic SRP algorithm that effectively addresses the microservices placement problem with the fully protected path mechanism. For the purpose of reducing bandwidth consumption, we further propose the SRP-S algorithm by considering the shared backup path contention caused by simultaneous failures, which effectively tackles the microservice placement problem with the shared backup path mechanism. Simulation results validate the proposed service reliability model and show that the SRP algorithm can reduce the number of failures by up to 29% compared to the benchmark algorithms with the fully protected path mechanism. With the shared backup path mechanism, the SRP-S algorithm can reduce bandwidth consumption by up to 62% compared to the SRP algorithm with the fully protected path mechanism, and reduce the number of service failures by up to 21% compared to the SRP algorithm with the shared backup path mechanism.

For future work, we plan to extend our proposed reliability model to more diverse backup mechanisms to reduce bandwidth consumption even further. In addition, it is also one of our goals to adjust the network resource utilization through microservice migration or scaling to improve service reliability in the future.

References

[1] M. Usman, S. Ferlin, A. Brunstrom, and J. Taheri, “A survey on observability of distributed edge & container-based microservices,” IEEE Access, vol. 10, pp. 86904–86919, 2022.
[2] M. Söylemez, B. Tekinerdogan, and A. Kolukısa Tarhan, “Challenges and solution directions of microservice architectures: A systematic literature review,” Applied Sciences, vol. 12, no. 11, p. 5507, 2022.
[3] K. Kaur, F. Guillemin, and F. Sailhan, “Container placement and migration strategies for cloud, fog, and edge data centers: A survey,” International Journal of Network Management, vol. 32, no. 6, p. e2212, 2022.
[4] H. Siddiqui, F. Khendek, and M. Toeroe, “Microservices based architectures for iot systems - state-of-the-art review,” Internet of Things, vol. 23, p. 100854, 2023.
[5] R. Kumar and N. Agrawal, “Analysis of multi-dimensional industrial iot (iiot) data in edge–fog–cloud based architectural frameworks : A survey on current state and research challenges,” Journal of Industrial Information Integration, vol. 35, p. 100504, 2023.
[6] Y. Chen, H. Lu, L. Qin, C. Zhang, and C. W. Chen, “Statistical qos provisioning analysis and performance optimization in xurllc-enabled massive mu-mimo networks: A stochastic network calculus perspective,” IEEE Transactions on Wireless Communications, pp. 1–1, 2024.
[7] S. Pallewatta, V. Kostakos, and R. Buyya, “Placement of microservices-based iot applications in fog computing: A taxonomy and future directions,” ACM Comput. Surv., vol. 55, jul 2023.
[8] Y. Zeng, Z. Qu, S. Guo, B. Ye, J. Zhang, J. Li, and B. Tang, “Safedrl: Dynamic microservice provisioning with reliability and latency guarantees in edge environments,” IEEE Transactions on Computers, vol. 73, no. 1, pp. 235–248, 2024.
[9] Y. Wang, L. Zhang, P. Yu, K. Chen, X. Qiu, L. Meng, M. Kadoch, and M. Cheriet, “Reliability-oriented and resource-efficient service function chain construction and backup,” IEEE Transactions on Network and Service Management, vol. 18, no. 1, pp. 240–257, 2021.
[10] G. Baranwal and D. P. Vidyarthi, “Trappy: a truthfulness and reliability aware application placement policy in fog computing,” The Journal of Supercomputing, vol. 78, pp. 7861–7887, Apr 2022.
[11] Y. Qiu, J. Liang, V. C. Leung, X. Wu, and X. Deng, “Online reliability-enhanced virtual network services provisioning in fault-prone mobile edge cloud,” IEEE Transactions on Wireless Communications, vol. 21, no. 9, pp. 7299–7313, 2022.
[12] M. Zhu, F. He, and E. Oki, “Resource allocation model against multiple failures with workload-dependent failure probability,” IEEE Transactions on Network and Service Management, vol. 19, no. 2, pp. 1098–1116, 2022.
[13] L. Rui, X. Chen, X. Wang, Z. Gao, X. Qiu, and S. Wang, “Multiservice reliability evaluation algorithm considering network congestion and regional failure based on petri net,” IEEE Transactions on Services Computing, vol. 15, no. 2, pp. 684–697, 2022.
[14] Z. Liu, S. Yang, M. Yang, and R. Kang, “Software belief reliability growth model based on uncertain differential equation,” IEEE Transactions on Reliability, vol. 71, no. 2, pp. 775–787, 2022.
[15] X. Qiu, Y. Dai, Y. Xiang, and L. Xing, “A hierarchical correlation model for evaluating reliability, performance, and power consumption of a cloud service,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 3, pp. 401–412, 2016.
[16] B. Schroeder and G. A. Gibson, “A large-scale study of failures in high-performance computing systems,” IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337–350, 2010.
[17] A. Zhou, S. Wang, B. Cheng, Z. Zheng, F. Yang, R. N. Chang, M. R. Lyu, and R. Buyya, “Cloud service reliability enhancement via virtual machine placement optimization,” IEEE Transactions on Services Computing, vol. 10, no. 6, pp. 902–913, 2017.
[18] L. Zhu, Q. Zhuang, H. Jiang, H. Liang, X. Gao, and W. Wang, “Reliability-aware failure recovery for cloud computing based automatic train supervision systems in urban rail transit using deep reinforcement learning,” Journal of Cloud Computing, vol. 12, no. 1, p. 147, 2023.
[19] Z. Liu, G. Fan, H. Yu, and L. Chen, “An approach to modeling and analyzing reliability for microservice-oriented cloud applications,” Wireless Communications and Mobile Computing, vol. 2021, p. 5750646, Aug 2021.
[20] H. Huang, H. Zhang, T. Guo, J. Guo, and C. He, “Reliable redundant services placement in federated micro-clouds,” in 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS), pp. 446–453, IEEE, 2019.
[21] M. Ibrar, L. Wang, N. Shah, O. Rottenstreich, G.-M. Muntean, and A. Akbar, “Reliability-aware flow distribution algorithm in sdn-enabled fog computing for smart cities,” IEEE Transactions on Vehicular Technology, vol. 72, no. 1, pp. 573–588, 2023.
[22] J. Paul Martin, A. Kandasamy, and K. Chandrasekaran, “Crew: cost and reliability aware eagle-whale optimiser for service placement in fog,” Software: Practice and Experience, vol. 50, no. 12, pp. 2337–2360, 2020.
[23] M. Dadashi and A. Rajabzadeh, “Daip: a delay-efficient and availability-aware iot application placement in fog environments,” Computing, vol. 105, pp. 2007–2035, Sep 2023.
[24] Y. Ramzanpoor, M. Hosseini Shirvani, and M. Golsorkhtabaramiri, “Multi-objective fault-tolerant optimization algorithm for deployment of iot applications on fog computing infrastructure,” Complex & Intelligent Systems, vol. 8, no. 1, pp. 361–392, 2022.
[25] Y. Qiu, J. Liang, V. C. M. Leung, X. Wu, and X. Deng, “Online reliability-enhanced virtual network services provisioning in fault-prone mobile edge cloud,” IEEE Transactions on Wireless Communications, vol. 21, no. 9, pp. 7299–7313, 2022.
[26] A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot, “Characterization of failures in an operational ip backbone network,” IEEE/ACM Transactions on Networking, vol. 16, no. 4, pp. 749–762, 2008.
[27] G. Le, S. Ferdousi, A. Marotta, S. Xu, Y. Hirota, Y. Awaji, S. Savas, M. Tornatore, and B. Mukherjee, “Reliable provisioning with degraded service using multipath routing from multiple data centers in optical metro networks,” IEEE Transactions on Network and Service Management, vol. 20, no. 3, pp. 3334–3347, 2023.
[28] R. S. Guimarães, C. Dominicini, V. M. G. Martínez, B. M. Xavier, D. R. Mafioletti, A. C. Locateli, R. Villaca, M. Martinello, and M. R. N. Ribeiro, “M-polka: Multipath polynomial key-based source routing for reliable communications,” IEEE Transactions on Network and Service Management, vol. 19, no. 3, pp. 2639–2651, 2022.
[29] L. Qu, C. Assi, M. J. Khabbaz, and Y. Ye, “Reliability-aware service function chaining with function decomposition and multipath routing,” IEEE Transactions on Network and Service Management, vol. 17, no. 2, pp. 835–848, 2020.
[30] L. Tang, G. Zhao, C. Wang, P. Zhao, and Q. Chen, “Queue-aware reliable embedding algorithm for 5g network slicing,” Computer Networks, vol. 146, pp. 138–150, 2018.
[31] Y. Al Mtawa, A. Haque, and H. Lutfiyya, “Migrating from legacy to software defined networks: A network reliability perspective,” IEEE Transactions on Reliability, vol. 70, no. 4, pp. 1525–1541, 2021.
[32] H. Zhao, S. Deng, Z. Liu, J. Yin, and S. Dustdar, “Distributed redundant placement for microservice-based applications at the edge,” IEEE Transactions on Services Computing, vol. 15, no. 3, pp. 1732–1745, 2022.
[33] G. Baranwal and D. P. Vidyarthi, “Trappy: a truthfulness and reliability aware application placement policy in fog computing,” The Journal of Supercomputing, vol. 78, pp. 7861–7887, Apr 2022.
[34] M.-Y. Saidi and B. Cousin, “Resource saving: Which resource sharing strategy to protect primary shortest paths?,” in 2016 13th IEEE Annual Consumer Communications & Networking Conference (CCNC), pp. 297–298, 2016.
[35] W. Zheng, M. Yang, C. Zhang, Y. Zheng, and Y. Zhang, “Robust design against network failures of shared backup path protected sdm-eons,” Journal of Lightwave Technology, vol. 41, no. 10, pp. 2923–2939, 2023.
[36] D. Ergenç, J. Rak, and M. Fischer, “Service-based resilience via shared protection in mission-critical embedded networks,” IEEE Transactions on Network and Service Management, vol. 18, no. 3, pp. 2687–2701, 2021.
[37] J. Edmonds and R. M. Karp, “Theoretical improvements in algorithmic efficiency for network flow problems,” Journal of the ACM (JACM), vol. 19, no. 2, pp. 248–264, 1972.
[38] F. Zhang, “Microservice placement simulations.” https://github.com/ZfyInfonet/SRP, 2024.
[39] P. ERDdS and A. R&wi, “On random graphs i,” Publ. math. debrecen, vol. 6, no. 290-297, p. 18, 1959.