1. Introduction
Under normal circumstances, data networking systems are designed to provide connectivity to all its nodes while, simultaneously, managing limited resources such as bandwidth, buffers, and the number of simultaneous connections. In the presence of failures or attacks, the design problem becomes very challenging because it must jointly provide some level of connectivity to the operating nodes, using protection schemes, manage the available resources, and offer restoration schemes. Thus, the purpose of the resilient design is to ensure both that a large portion of a communication network remains connected after a failure occurs and recovers promptly. In the literature, this is referred to as the reliable path provisioning problem, and such issue evidences the fundamental trade-off between providing reliable paths and efficiently utilizing the network resources. Lastly, correlated failures affecting network nodes have raised the attention of researchers because they impact multiple nodes, thereby their consequence on both users and network operators is severe [
1,
2,
3,
4,
5]. Correlated failures may be triggered from natural phenomena, such as earthquakes and hurricanes, or may be induced intentionally by men, as in the case of weapons of massive destruction, electromagnetic pulses, or cyber attacks.
Data networking systems are highly homogeneous because the trend in networking has been that all the technologies, at each layer of the architecture, must converge to a single one. The main effect of such a tendency is that most of the nodes are purchased from a single vendor and, consequently, they turn out to be identical or very similar devices. Data networks designed in this fashion are termed as monoculture networks. From joint management, economic, and interoperability point of view, monocultures are appealing. However, operating monoculture networks may introduce severe problems to their survivability. For instance, the lack of diversity in monoculture networks introduces shared risks to attacks/failures such as exploits (0-day vulnerabilities). Thus, a single attack/failure may affect multiple nodes and entirely disrupt the network operation.
Nowadays, network operators have at hand new, flexible, compatible, and more importantly diverse technologies for managing networked systems. For instance, SDN allows network operators to manage and control networking devices from multiple vendors [
6]. Besides, NFV techniques implement NF exploiting software virtualization, and such functions are executed on commodity hardware from multiple vendors [
7]. Practical examples of how these two technologies have enabled multivendor implementations are: The CloudNFV platform for cloud computing, [
8], the SDN-based Packet Transport Network operated by China Mobile [
9], the Optical SDN designed by China Telecom [
9], and the NFV-based Service Orchestration implemented by Anuta Networks, [
10]. Remarkably, the multivendor interoperability trend has been stated as a requirement in 5G implementations and applications such as M2M and IoT [
11].
Since the tide is turning now towards multivendor environments, we raise the following question: May we exploit the available node technology diversity to improve the resilience of an entire network, which faces multiple correlated node failures, by introducing in the design process multiculture networks? This seems to be a valid question because the diversity in biological systems is indeed a valuable commodity for survivability, and researchers have shown that provisioning an adequate number of different species may be one reason for preserving biodiversity [
12].
In this paper, we tackle the problem of improving network resilience, in the presence of correlated failures, by carrying out an optimal multiculture network design. To do so, we divided this complicated problem into simpler, sequential optimization problems. First, we pose a constrained optimization problem for selecting as many different technologies as possible, which do not share common risks while are capable of communicating the network nodes. We termed this problem as “The optimal selection of the technology set.” Second, we propose another constrained optimization problem for selecting the number of nodes to be used from each technology, in order to fulfill a network design requirement with a CAPEX restriction. We termed this problem as “The fair technology distribution problem.” Lastly, we state yet another constrained optimization problem for placing the nodes within the network topology so that its resilience is maximized. We termed this last problem as “The reliable node placing problem.” The correlated failures regarded in this paper are modeled using SRNG. These failures represent here cyber attacks, which aim to take advantage of specific vulnerabilities shared in all the network nodes belonging to the same SRNG event. We comment that, in the eight real-world network topologies considered here, our multiculture design enhances network resilience as compared to a monoculture design. In fact, results show that the proposed multiculture network design, which regards individual and shared node risks, can cope with multiple node failures induced by correlated SRNG events. In doing so, it efficiently trades off the number of technologies to be used (i.e., the degree of heterogeneity in the network), the number of nodes to specify per technology, and their location in the network topology. We note that, when our algorithms select larger sets of technologies, an attacker should make more efforts to compromise the network integrity and the impact caused by a particular attack on the network infrastructure is reduced. Additionally, our results also show that the node placing algorithm attempts to cluster nodes belonging to the same technology, and remarkably, it locates the most vulnerable technologies at sites where the impact of multiple node failures on the entire network connectivity is reduced. Lastly, we comment that the location method proposed here exhibits better results, concerning the ATTR metric than our earlier method [
13].
The rest of this paper is organized as follows. In
Section 2, we present and summarize the related work in the area. In
Section 3, we introduce the principles of multiculture network design that we have employed in our work for increasing resilience in the face of correlated attacks. In
Section 4, we explain our methodology, mathematically define the problem of redesigning a network topology, introduce our correlated failure model, and formulate the three optimization problems used to carry out the multiculture network design. Next, we introduce the resilience metrics used here and present the search algorithms we developed for solving the above-mentioned optimization problems. In
Section 5, we present and compare the numerical results achieved by our algorithms. Lastly, in
Section 6, we draw the conclusions of the paper.
2. Related Work
Diversity has been accepted as a method that plays a decisive role in network resilience [
14,
15,
16]. Furthermore, it has been exploited as a mechanism to avoid fate sharing during correlated cyber-attacks, large-scale natural disasters, “buggy” software updates, etc. Sterbenz et al. provide in [
14,
15] a framework for resilience in communication networks. They formally define the terms resilience, reliability, survivability, and disruption tolerance in communication networks, and present diversity as an important requirement to deal with attacks of intelligent adversaries. A diverse system ensures that, under a correlated attack, all its parts are unlikely to share the same fate. Consequently, they can maintain a partial system operation. In [
16] a systematic approach is conceived to build a resilient network, taking actions in a control loop to respond to attacks and recover to normal operation. In such a framework, redundancy and diversity are exploited as defensive mechanisms to maintain reliability in the presence of software faults.
Unlike in communication networks, diversity is a well-established concept for increasing reliability and robustness in software engineering. Pioneering research works such as [
17,
18,
19,
20,
21] have developed concepts like N-version programming and data or instruction set randomization to introduce diversity. Regarding N-version programming, the term “natural diversity” was coined in [
21] and described the existence of different pieces of either software or OS with the same functionality, which appear spontaneously in the market and are supposed not exhibit common vulnerabilities. Examples of natural diversity can be found in web browsers, firewalls, virtual machines, routers, etc. In [
22,
23], the research focus was to disclose which applications and OS that are available in the market, offer mutually exclusive software risks. Remarkably, more than 50% of the analyzed OS had, at most, one non-application common vulnerability that can be remotely exploited, while up to 98% of the analyzed software could be exchanged by one with the same functionality, yet with different not-common vulnerabilities. Also, from vulnerabilities disclosure websites, such as [
24,
25,
26,
27,
28], it can be observed that from the extensive list of data routers available in the market, only a few of them share the same vulnerability risks.
Some research works have been carried out to increase network robustness exploiting mutually exclusive software risks, [
13,
29,
30,
31]. In these works, nodes belong to mutually exclusive risk classes and when a failure occurs at a node, all the nodes in the same class fail simultaneously. Thus, researchers develop algorithms aiming to increase the network connectivity based on node diversity and the way such nodes can connect to each other. In [
29] a grid network is created and the number of network devices is divided evenly among the classes. Next, every node in each class is linked to nodes belonging to a different class. Thus, the graph partitioning algorithm creates a topology that maintains a connected network when a single class of nodes breaks down. Caballero et al. introduced two methods for increasing the network resilience against software defects, bugs, and vulnerabilities affecting network routers [
30]. The methods aim to locate network nodes using graph clustering and graph partitioning. Routers were clustered according to their risk classes, and the network connectivity was maximized when any class was removed from the graph. We note that, since in both works node classes were considered to be homogeneous, the optimal class balance turned out to be a uniform distribution of nodes per class. This is a major difference to our work. In [
31] the Diversity Assignment Problem is proposed: find the optimum assignment, from classes to network nodes, that maximizes the average connectivity among end-nodes. In a case study, they showed that by employing three classes of nodes, which fail independently and with different probabilities, a more resilient network can be obtained as compared to a monoculture topology, which is composed solely by nodes from the most reliable class. The optimum assignment is reached when for every class there is at least one path, consisting of nodes of the same class, set up among every pair of end-nodes. Prieto et al. proposed in [
13] to decompose the network diversification problem in two subproblems: One for identifying the number of nodes per class should be used to minimize the entire network vulnerability, and another to determine the location of those different node classes to maximize the average connectivity of the entire network. The work presented here is a substantial extension of the ideas exposed in [
13]. The key differences between the works are: (i) Here the network design starts one step earlier by selecting the network technologies; (ii) classes are mapped here to technologies, bringing our earlier theoretical work to a more applied setting; (iii) the vulnerability index introduced in [
13] is now defined through the number of risks exhibited by a technology; (iv) the devices’ placement method is improved here by formulating an optimization problem that aims to locate the more resilient nodes at the most vulnerable network zones; and (v) novel methods for solving the proposed optimization problems are presented here.
Other approaches have been also employed for improving resilience and survivability. Zhang et al. presented a new model for network survivability based upon network heterogeneity [
32]. They introduced the concept of diversity space, where different network elements, like OS, communication media, service models, network protocols, routing mechanisms, etc., were mapped onto different dimensions. Equipped with this definition for a diversity space, the authors used the distance between network elements as a metric for stating that, the larger the distance between network elements, the higher the diversity and smaller the network vulnerability. In [
33], Caesar and Rexford proposed a bug tolerant router. This router contains several virtual implementations running in parallel inside the same hardware. The idea is that software diversification makes unlikely that every implementation fails at the same time. Finally, the output of the virtual instances enters a voting process that selects the router output.
Another two different applications of network diversity for increasing network survivability and reliability were introduced in the contexts of cyber-security [
34,
35,
36,
37] and virus contention [
38,
39,
40,
41]. In [
34,
35] attack graphs and attack paths are defined as the ways an attacker can get access to a network asset. Security metrics were designed to characterize how difficult it is for the attacker to exploit the security mechanism of each node between him and the asset. Thus, node diversity imposes independent efforts from the attacker to get access to each of them. In [
36] Zhang et al. proposed both the least attacking effort and the average attacking effort metrics to compute a distance between an attacker and an asset. These metrics were based upon the number of hops and the number of different types of nodes separating the attacker and the asset. Consequently, the more diverse the types of nodes the more resilient the network in the face of 0-day attacks. In [
37], the authors aimed to allocate heterogeneous security mechanisms at the network nodes, thereby making difficult the access of an attacker to a target asset of interest. Their main research idea relied on allocating nodes in such a way that neighbors should not share the same vulnerabilities. This idea produced a decrease in the severity of cyber-attacks by reducing the repetition of a single vulnerability in every attack path. To avoid malware spreading the theory of perfect coloring, which aims to prevent that two neighboring nodes share the same color, was used. In [
38], the authors established a relationship between the average degree and the number of necessary classes required to avoid the emergence of a giant component in random networks. In [
39], three random-distributed techniques were developed to sub-optimally solve the NP-hard perfect coloring problem in non-exponential time. Huang et al. proposed the graph multicoloring problem to minimize the number of shared software executed on neighboring nodes [
40]. Should malware compromises software in one node, this would stay contained in the subgraph containing the node and the neighbors with the common vulnerability. Temizkan et al. considered shared vulnerabilities between software variants and proposed a software allocation mechanism, based on combinatorial optimization and linear programming, that was applied on scale-free networks prone to be infected by viruses [
41]. As a direct result, such methodology increased the network resilience against virus and worms attacks.
3. Rationale
Before presenting the rationale of our work, we formally define the terms monoculture and multiculture technology in a communication network.
Definition 1. A data communication network is defined as a monoculture if the technology used to implement the networking nodes is homogeneous. More precisely, the network technology is a monoculture if all its communication nodes belong to the same vendor and the implementations of their OS and protocol stack are the same.
Definition 2. A data communication network is defined as a multiculture if the technology used to implement the networking nodes is heterogeneous. More precisely, the network technology is a multiculture if all its communication nodes either do not belong to the same vendor, execute different OS, or employ different protocol stack implementations.
We note that Definitions 1 and 2 transpire from the diversity space introduced in [
32] to represent the functional capabilities of network nodes and architectures.
Multiculture network design can offer to network architects clear advantages as compared to monoculture networks.
Figure 1 depicts three networks, with different technologies and different node locations in the topology, showing how a proper multiculture design can improve network resilience. The first case, depicted in
Figure 1a, is a monoculture network, where only one kind of technology is employed by all the nodes. The problem here is that one type of exploit is enough to attack all the nodes and, consequently, induce multiple failures in the network. The second case, depicted in
Figure 1b, exhibits some degree of diversity. In fact, three technologies are deployed on the network, and are represented by different colors. It can be observed that should a failure or an attack occur on the orange nodes, the rest of the network would remain disconnected because of the improper location of such nodes in the topology. The third case, depicted in
Figure 1c, shows that a multiculture network design, where several technologies coexist without shared common risk, in conjunction with a proper location of these diverse technologies, effectively reduces the post-failure lack of connectivity as compared to both a monoculture network and a multiculture network incorrectly designed. In this work, we take into consideration these issues and aim to design multiculture networks that are minimally affected by failures or attacks to a single technology.
The optimal multiculture network design involves the selection of as many different technologies as possible, which do not share common risks, to be properly placed in an existing network topology. This problem represents a huge challenge in terms of modeling, the definition of optimization functions and their associated interdependent constraints, and also the emergence of huge search spaces where feasible solutions must be found. The approach in this work is to simplify this complicated problem by breaking it down into simpler sequential optimization problems.
4. Methodology
In this section, we describe the materials and methods used in our research. First, we present the mathematical models used to represent a communication network and its correlated failures. Next, we formulate three sequential optimization problems, which correspond to the core of the multiculture network design method. The first problem introduces diversity in the selection of node technologies by finding the maximum number of different technologies that do not share common risks. The second problem optimally selects the number of nodes of each technology that must be specified to maximize the network resilience. The third problem optimally specifies the location of each technology on each node, on a given network topology, in order to minimize the impact of a correlated failure on the entire network connectivity. Since the above-mentioned problems are NP-hard, we also introduce the search algorithms we developed for solving the optimization problems. The materials used in this paper correspond to both the eight real-world topologies that are commonly used in the literature to assess networking methods and the reliability metrics used to evaluate the results of the proposed methods. In summary, we state that the research questions guiding our work are: (i) What is the optimal number of technologies that can be employed to increase the diversity in a given network? (ii) What is the optimal number of nodes of each technology that are required to maximize the resilience of the entire network? and (iii) What is the optimal node location of each technology for maximizing the network resilience?
4.1. Problem Statement, System Model, and Correlated Failures Model
As mentioned earlier, the goal of this paper is to devise an optimal multiculture network capable of improving the network resilience when correlated failures impair simultaneously several nodes. The communication network topology is mathematically represented by the undirected graph
, where
is the set of communication nodes and
is the set of communication links between nodes. Suppose that a network architect must carry out a technological update of network nodes. To do so, a multiculture set of nodes may be used to replace the existing ones, thereby different technologies, i.e., vendors, OS, and protocol stack implementations, can be introduced in the topology. We denote here by
the set of
N different networking node technologies available to the architect. We assume also that the
N available technologies implement a joint protocol stack with a set of
L different communication protocols, where
(correspondingly,
) denotes that a node, equipped with technology
i, is able to (correspondingly, incapable of) communicate to other nodes through protocol
l. Next, let us assume that the set
represents the technologies to be implemented in the network devices during the upgrade. (This set is defined by solving the problem in
Section 4.2.1).
We are interested here in modeling correlated failures triggered by breakdowns or attacks which diminish the connectivity of the infrastructure by impairing several nodes at the same time and for a long period. Thus, we assume here both that each regarded technology is prone to fail or to be attacked without recovery and that they share common risks.
Definition 3. A SRNG in a data communication network is a set of nodes that may be affected by a common failure to the infrastructure under the condition that they share a common failure risk.
Suppose now that there exists a set of M different SRNG events that may induce correlated failures to the network. Furthermore, assume that each event has a probability of occurrence of . Consequently, when the SRNG event A happens, the set of nodes V can be partitioned into two sets: and , where the former set denotes the collection of all networking nodes sharing the common risk associated to the SRNG event and the latter set denotes all those nodes unaffected by the event.
Definition 4. A PSRNG in a data communication network is a set of nodes that fail with a positive failure probability, in the event of a SRNG failure. More precisely, the failure probability of the ith node, conditional on the SRNG failure event , is denoted as and satisfies: for all and otherwise.
Definition 5. We say that the nodes i and j belonging to a data communication network are correlated if and are both positive for the PSRNG. Moreover, upon the occurrence of the SRNG event, these probabilities are identical and mutually exclusive for all the pairs of nodes in , that is: for all .
Following [
2], we assume that only one PSRNG event may occur at a time. This means that the
M shared risks defining the PSRNG events are mutually exclusive; therefore:
. This otherwise arbitrary definition has been effectively used in the networking community and makes sense in the context of the class of failures considered in this paper [
2,
3,
4]. We note also that from Definitions 1 to 5, both monoculture and multiculture data network technologies can be affected by more than one SRNG event.
Definition 6. The resilience of a data communication network is defined as its ability to provide and maintain an acceptable level of node connectivity in the face of correlated failures triggered by the above specified PSRNG.
In this paper, we will assess the resilience of a communication network after the occurrence of a PSRNG event by means of two metrics, which are mathematically defined in
Section 4.4. One metric quantifies whether the network topology remains connected or is partitioned after an event, while the other metric quantifies how well-connected remains a network after the occurrence of a PSRNG event. With this ideas at hand, we can now introduce quantitative definitions for the resilience of a data communication network, which complement Definition 6.
Definition 7. A data communication network is defined as resilient to correlated failures triggered by the above specified PSRNG, if its ATR metric is equal to one. In addition, the average degree of resilience of a data communication network, in the face of correlated failures triggered by the above specified PSRNG, is given by the ATTR metric.
4.2. Sequential Optimization Problems
The three sequential optimization problems mentioned at the beginning of
Section 4 are specified next in full detail. For clarity in the presentation, Algorithm 1 is presented at the end of the section to summarize the workflow for solving these problems.
Algorithm 1 Optimal Multiculture Network Design |
Require: , , , , B Ensure: M = dim(); K = dim(); ← Optimal Selection of Technology Set , where Compute ← Fair Technology Distribution Problem ← Reliable Node Placing
|
4.2.1. Optimal Selection of Technology Set
The goal of the proposed multiculture network design is to provide diversity in the communication network nodes. In this work, we aim to specify as many different compatible node technologies as possible, from a given pool of technologies, under the constraint that the selected technologies must communicate and are mutually exclusive in their risks, that is they do not belong to the same PSRNG. Consequently, an attack on some vulnerability would not damage more than one kind of technology. In this scenario, and relying on a database with both the node technologies available in the market and the information about their risks and communication protocols, it is possible to formulate the following optimization problem: Picking the maximum number of technologies that do not present common risks and are able to communicate between them. We have called this problem the optimal selection of technology set.
We depict, in
Figure 2, an example of the problem showing the input data (in two tables) and the solution. Columns, at the left table, list four different types of SRNG events (
to
represent, respectively, the shared risks number 1 to 4), while at the right table list the different communication protocols that each node is equipped with (
to
represent, respectively, the communication protocols 1 to 3). Rows list seven different node technologies. In
Figure 2, the optimal technology set was computed using exhaustive search and has been marked with a red box. Note that in this optimal solution the number of selected technologies is maximal, technologies do not exhibit shared risks, and they are capable of communicating, at least, by one available protocol. We note that, we have followed the seminal work [
42] and used a risk matrix representation to characterize the software risks and their correlations with other network nodes for failure correlation analysis.
The first problem tackled in this paper is the optimal selection of a technology set, that is, optimally specifying how many and which technologies are needed to maximize the diversity of the entire network. This problem can be mathematically posed as:
subject to:
where
i,
j, and
r represent, respectively, the
ith and
jth technology, as well as the
rth SRNG event in the network,
is a binary variable indicating the presence,
or absence,
, of technology
i in the solution,
represents that the shared risk
r affects the technology
i and
represents otherwise,
is a row vector containing all the communication protocols offered by technology
i. In addition,
is the search space, which corresponds to the collection of every possible combination of technologies that can be selected out of the
N available classes, while
is an element of
specifying the optimal selection. For the sake of notation, we introduce also the risk vector associated to the
ith technology as the row vector
, the risk matrix
, and the communication protocol matrix
. We recall that the parameters
M,
N, and
L are the total number of SRNG events, technologies, and protocols respectively.
The first set of
M constraints ensures that if a particular technology belongs to the optimal solution, it is the only one exposed to the SRNG event
r. We note that these constraints do make the problem unfeasible but they yield a monoculture network. In addition, the second set of, at most
, constraints ensures that technologies are not orthogonal in terms of communication protocols, that is, there must exist at least one shared communication protocol between technologies
i and
j, should they appear in the optimal solution. (Consequently, such constraint was formulated in terms of the dot product between the communication protocol vectors associated to every pair candidate technologies.) Finally, we mention that the failure probability associated to the
rth SRNG event can be computed in practice from the risk matrix as:
which means that such probability is given by the frequency of occurrence of a SRNG event among the available technologies in the market.
4.2.2. Fair Technology Distribution Problem
The solution to the problem of
Section 4.2.1 provides the
technologies, out of the
N available, that will be used in the design. Most of the time, the number of these available technologies is less than the number of nodes in the network, meaning that several nodes will be using the same technology. The number of required devices, per selected technology, to minimize the vulnerability of the entire network is calculated by fairly balancing the total number of SRNG events among the network devices. This design step depends on the number of nodes in the analyzed topology and, in practice, is constrained by a fixed CAPEX. The problem is mathematically stated as:
subject to:
where
is a vector of
elements that specifies the number of nodes of each technology,
represents the search space in the optimization problem and corresponds to the collection of every possible combination of number of nodes per technologies that can be selected,
is the cost, in some predefined currency, of each node belonging to the
kth technology,
B is the total CAPEX available to purchase network nodes, and
. (We refer to [
13] for the details on how to obtain the value of
).
The term
is a key parameter termed as the risk index associated to the
kth technology. We introduced this parameter for the first time in [
13], and in this paper, we redefine it formally in a more practical manner using the formula:
which means that the risk index of each technology is given by the failure probability of each SRNG event, disclosed in the market, that affecting it. Note that we have exploited the assumption about the SRNG events being mutually exclusive.
4.2.3. Reliable Node Placing Problem
The last step in the sequential design method proposed here is to optimally place the selected node technologies on a given topology. The idea of the placing method is that the failure impact of the more vulnerable technologies should be as low as possible on the network connectivity. We remark that, when a node fails, it immediately affects its communication links; therefore, a proper network design must minimize the number of links affected by the failure of an entire set of nodes belonging to the most vulnerable technology. From Network Science theory, we know that the clustering coefficient from each technology in the network is a proper metric to assess the impact of such correlated failures on the network connectivity [
43].
With this rationale in mind, we mathematically formulated the reliable node placing problem as:
subject to:
where
,
is a mapping from
V to
assigning to the
uth node the
technology,
, and
represents the search space of all possible mappings for assigning all
technologies to
n nodes.
is the network topology resulting after a failure affecting all the nodes of the
kth technology,
if
represents a working link after the failure of the
kth technology, and
represents otherwise. The number of connected components in
is represented by
, and
is the indicator function stating that the
uth node belongs to the
kth technology.
We note that in the cost function Equation (
10), the inner summation aims to maximize the number of working links after a failure of the
kth technology, while the term
penalizes the existence of a large number of connected components after a failure. Lastly, we note also that, by introducing the number of connected components emerging after a failure in the cost function Equation (
10), we aim to maximize the resilience of a data communication network according to Definitions 6 and 7.
4.3. Efficient Search Algorithms Based on Transformations and Metaheuristics
In this section, we describe the algorithms we developed for solving the sequential optimization problems formulated in
Section 4.2.
4.3.1. Optimal Selection of Technology Set
The technology diversity problem can be transformed into equivalent formulations, to obtain a more convenient representation that reduces it into a well known NP-hard problem termed as “The Clique Problem.” The first step of the transformation is blending the risks and communication protocol matrices into the so-called compatibility matrix
. Here, the element
in the compatibility matrix
is equal to “1” if the pair of technologies
i and
j meet both constraints jointly, and is equal to “0” otherwise.
Figure 3 shows an example of the compatibility matrix for the case depicted in
Figure 2.
Based on the compatibility matrix, the problem is transformed in finding the largest set of jointly compatible technologies. If
is interpreted as an adjacency matrix, then it can be represented by a graph, and the above mentioned problem reduces to the well-known Maximum Clique problem. Clique definitions are rooted in social sciences [
44] and the problem is part of Karp’s 21 NP-complete problems [
45], also in [
46] more information is available. A survey concerning maximum clique and related algorithms to solve it could be found in [
47].
The simplest equivalent formulation as an integer programming problem, presented in [
47] and termed as “the edge formulation of the problem,” is used in this work:
subject to:
where
is the graph that transpires from the compatibility matrix,
is the complement graph of
, and
and
.
is a binary variable that indicates if technology
i belongs to the maximum clique.
The edge formulation given before is also NP-hard; however, it has been implemented in software packages and its execution takes an acceptable amount of time for the problem sizes analyzed here.
4.3.2. Fair Technology Distribution
We propose to solve the fair technology distribution problem through the GA technique. This technique belongs to the more general evolutionary algorithms, relies on natural selection ideas and genetic operators, such as mutation and crossover, and is highly employed in non-well-structured problems [
48].
For the GA technique, we coded the chromosome, which represents a possible solution, in a fixed-length integer-valued string as depicted in
Figure 4. The
jth position in the chromosome of length
N denotes the
jth node in the network. The
jth chromosome position stores a non-negative integer value, say,
k, which specifies that the
jth node belongs to the
kth technology.
Regarding the GA operators, we followed standard guidelines from the GA theory to set the algorithm parameters at recommended values. Thus, population size is set to 500 chromosomes. We employ the single point crossover, with a probability of 0.8 for executing the operation, as the crossover operator. For mutations, one position of the chromosome is selected randomly, and its value is changed, with a probability of 0.01, by one of the other technologies available in the design. For selection, we used the fitness proportional selection, implemented by a roulette wheel.
Following standard procedures to handle constrained optimization problems using GA, we transformed ours into a non-constrained problem by adding the constraints as penalty functions to the objective function, [
49,
50,
51,
52]. Following [
49], the penalty functions should increase their values as the generation number
g does, thereby adding more selective pressure on the GA to converge to feasible solutions. From these ideas, the fitness function,
, at the generation number
g is given by:
which clearly addresses a minimization problem through the fitness proportional selection. Note that, the smaller the cost function, the higher the probability that a chromosome should be selected to the next generation and vice versa. We also note that Equation (
15) contains both the cost function Equation (
6) (at the left-hand side of the denominator) and the penalty function that represents the inequality constraint Equation (
7) (at the right-hand side of the denominator). However, the number of nodes constraint Equation (
8) is not included as a penalty function in Equation (
15) because is directly handled by the chromosome coding.
Lastly, the stopping criteria for the GA considers two options: (i) The number of iterations carried out achieves a maximum, predefined number; and (ii) the absolute difference between the mean values of the cost function, in two consecutive generations, for the entire population is smaller than some predefined tolerance
. The mean value of the cost function,
, at the generation number
g is defined as:
where
is the total number of chromosomes in the population at the generation
g,
is the number of nodes of the
kth technology as specified by the
hth chromosome in the population at the generation
g, and
. (For the sake of notation, we have omitted in
and
their dependency on the generation number
g.) Hence, the second stop criterion is given by:
The solution to the optimization problem, which is obtained at the final generation
, will be the chromosome extracted from the population that reaches:
At this point, it is important to note that the fair technology distribution problem in Equations (
6)–(
8) was formulated considering the information about the nodes yet disregarding the connectivity information provided by the links. In our earlier work, we disclosed that the ATTR metric may not be improved by selecting the best combination of technologies [
13]. In fact, sub-optimal technology distributions provided best solutions in terms of the ATTR metric. To overcome this issue, in this paper we have decided to modify the traditional GA methodology and generate, instead of a single solution, a list with the best solutions found by the GA method after its execution. This list of solutions will be used as the input to the reliable node placing problem. More precisely, the modified GA method ranks and stores, at each population number, a list with the 10% best feasible solutions. (The total number of different assignments of
t technologies to
n nodes is given by:
.) Thus, the list is updated in every generation of the GA, from the current chromosomes and the past results. Algorithm 2 depicts a pseudocode for the GA method proposed here.
Algorithm 2 GA for the fair technology distribution problem.
|
Require:, , (max iteration number), Ensure: ( ranked solution list) ; Initial population Evaluate population ranked Compute repeat Single Point Crossover(, ) Mutation(, ) Evaluate population Selection (, proportional selection) Evaluate population ranked () Compute until ( )
|
4.3.3. Reliable Node Placing
We propose to solve also the reliable node placing problem using a GA method, under the proviso that each one of the best solutions found for the fair technology distribution must be used as an input. Thus, all the solutions found for each one of the different inputs are compared to obtain the maximum value for the reliable node placing problem.
For the GA technique, we coded the chromosome, which represents a possible solution, in a positive integer-valued string of length
n as depicted in
Figure 5. The chromosome represents the list of nodes in the network, and the value in any string position represents the technology associated with that node.Note that the chromosome coding takes into account the problem constraints, which assign a specific number of nodes to each technology.
Regarding the GA operators, we followed standard guidelines from the GA theory to set the algorithm parameters at recommended values. Thus, population size is set to 500 chromosomes. We employ the first order crossover, with a probability of 0.8, as the crossover operator for chromosomes. For mutations, swap mutation, which exchanges the value of two randomly chosen positions in the chromosome, was selected with a probability of mutation of 0.01. For selection, the fitness proportional selection was chosen again as in the fair technology distribution problem.
The fitness function was designed to exert selective pressure on the GA, that is, the larger the generation number
g, the bigger the difference between the chances of selecting a better solution than a poor one, to pass to the next generation. Mathematically, the fitness function is given by:
The stopping criteria is the same one employed in
Section 4.3.2; however, the mean value of the cost function,
, at the generation number
g is defined as:
where
is the total number of chromosomes in the population at the generation
g,
is the set of links that remain operative after a failure of the
kth technology as specified by the
hth chromosome in the population at the generation
g,
is a specific link of
as specified by the
hth chromosome in the population at the generation
g, and
the number of connected components in
, after a failure of the
kth technology, as specified by the
hth chromosome in the population at the generation
g. (For the sake of notation, we have omitted in
,
,
, and
their dependency on the generation number
g.) Thus, the solution is determined, in the population of the last generation, by:
We remark that the solution for the reliable node placing problem is the maximum value among all the results obtained after executing the above mentioned procedure, for the best solutions found for the fair technology distribution problem. For the sake of notation, we have omitted an index in Equations (
19)–(
21) to denote this dependency. Lastly, Algorithm 3 indicates the way to solve the problem proposed here.
Algorithm 3 GA for the reliable node placing problem.
|
Require:, , (max iteration number), Ensure: RNP (Decoded chromosome with maximum fitness) ; Initial population Evaluate population Compute repeat First Order Crossover(, ) Swap Mutation(, ) Evaluate population Selection (, proportional selection) Evaluate population Compute until ( ) RNP ← max
|
4.4. Resilience Metrics
We use two metrics to assess the resilience of the networks after solving the optimization problems mentioned above.
The ATTR metric quantifies how well-connected is a network after the occurrence of a failure event [
2,
4,
53]. The ATTR is effectively the probability that a pair of networking nodes, chosen at random, is connected. Thus, if a network is fully connected its value is equal to 1. Since we are considering here different failure events for different technologies, we modify first the traditional definition for the ATTR metric by parameterizing it in terms of the different SRNG event probabilities. That is, the ATTR of a network topology when the failure event associated to the
rth SRNG event occurs is given by:
where
is the binomial coefficient and
is a binary variable that takes the value 1 if there is a path between the nodes
u and
v after a failure of the
rth technology, and takes the value 0 otherwise. Next, the ATTR of the network is computed as the weighted average over all the technology failures:
where
.
After the occurrence of a failure event associated to a SRNG, the resulting working topology may remain connected or may be partitioned. Here we assess this effect in terms of the number of connected components arising after failure. A connected component is formally defined as a subgraph where any two nodes are connected to each other by paths, and there are no connections to other nodes in the supergraph modeling the netork before failure [
43]. If the number of connected components is 1, then network is connected. We note that this quantity is related to the ATR metric commonly used in the networking community since the ATR is defined as 1 if the network is connected and 0 otherwise [
2,
4].
4.5. Topologies
In this paper, we have used eight real-world networks to assess the capability of the proposed multiculture design in improving their resilience.
Figure 6 depicts the topologies and show their average degree. Networks in
Figure 6a–g were extracted from Internet Topology Zoo [
54] and are commonly used in the research community as benchmarks. Moreover, infrastructures having different node degrees were selected to study their effect on the multiculture network design. In addition, topology in
Figure 6h corresponds to the network connecting all the universities in Chile. Lastly, we comment that networks labeled as Navigata, Kreonet, and Reuna correspond to subgraphs of the original networks.
6. Conclusions
In our work, we proposed the idea of exploiting multiculture network design, i.e., introducing node technology diversity, as a means to provide resilience during a network upgrade process. The methodology presented here comprises a series of sequential optimization problems that address the different stages in the network design process: The technology selection, the specification of the number of devices per technology, and the network placement of the selected devices. We comment that our work is not only a contribution to the theory of network resilience through software diversity but also provides a practical methodology to network architects for achieving a resilient network design.
The solution to the first optimization problem presented here allowed us to specify, from a set of available technologies, the largest number of node implementations that do not share common risks. The larger the selected set of technologies is, the more the efforts an attacker should make to compromise the network integrity, and the less would be the impact caused by a particular vulnerability attack on the network infrastructure.
The solution to the second problem allowed to optimally calculate the number of network devices, from each technology, that must be deployed on the network. The key idea exploited by our method is to balance the number of SRNG events among the devices, and simultaneously, all those technologies presenting a larger number of vulnerabilities will be less represented in the network. Besides, the effect of the CAPEX assigned to the network architect on the technology diversification was analyzed. The risk index, which accounts for the number of vulnerabilities in one technology, was also redefined here in a practical manner.
The solution to the third problem enabled us to carry out the optimal placement of nodes on the network topology. Since the problem of computing the number of devices per technology was solved disregarding the topological information of the network, the GA-based solver was modified to supply not one but a group of best solutions. Such modification trades off between the number of nodes per technology and their location for increasing the network resilience, as shown by the results listed in
Table 1. Results show also that in the 75% of real-world network topologies analyzed in this paper, the optimal multiculture network design proposed here yields networks with an ATTR metric of 1. This means that such networks remain connected after failure, since the ATTR represents the probability that a pair of nodes picked at random are connected. For the remaining 25% of the analyzed topologies, whose average node degree was less than 2, the ATTR was at least 0.7867. The latter results mean that both multiculture design and topology connectivity are necessary to achieve network resilience in the presence of correlated failures. Besides, results also show that certain network properties, like clustering, favor connectivity in the presence of correlated failures triggered by common node vulnerabilities. Remarkably, the proposed design method locates the nodes on the network in such a way that the most vulnerable nodes are assigned to locations where network connectivity is affected the least upon a failure.
Our future research work on this subject will involve developing a new model for improving network connectivity, which could be solved as a single optimization problem. To achieve feasibly solutions, we will relax the constraint that technology risks must be exclusive.