research-article

Open access

Load-optimization in Reconfigurable Data-center Networks: Algorithms and Complexity of Flow Routing

Authors:

Wenkai Dai,

Klaus-Tycho Foerster,

David Fuchssteiner,

Stefan SchmidAuthors Info & Claims

ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Volume 8, Issue 3

Article No.: 8, Pages 1 - 30

https://doi.org/10.1145/3597200

Published: 18 July 2023 Publication History

PDF eReader

Abstract

Emerging reconfigurable data centers introduce unprecedented flexibility in how the physical layer can be programmed to adapt to current traffic demands. These reconfigurable topologies are commonly hybrid, consisting of static and reconfigurable links, enabled by e.g., an Optical Circuit Switch (OCS) connected to top-of-rack switches in Clos networks. Even though prior work has showcased the practical benefits of hybrid networks, several crucial performance aspects are not well understood. For example, many systems enforce artificial segregation of the hybrid network parts, leaving money on the table.

In this article, we study the algorithmic problem of how to jointly optimize topology and routing in reconfigurable data centers, in order to optimize a most fundamental metric, maximum link load. The complexity of reconfiguration mechanisms in this space is unexplored at large, especially for the following cross-layer network-design problem: given a hybrid network and a traffic matrix, jointly design the physical layer and the flow routing in order to minimize the maximum link load.

We chart the corresponding algorithmic landscape in our work, investigating both un-/splittable flows and (non-)segregated routing policies. A topological complexity classification of the problem reveals NP-hardness in general for network topologies that are trees of depth at least two, in contrast to the tractability on trees of depth one. We moreover prove that the problem is not submodular for all these routing policies, even in multi-layer trees.

However, networks that can be abstracted by a single packet switch (e.g., nonblocking Fat-Tree topologies) can be optimized efficiently, and we present optimal polynomial-time algorithms accordingly. We complement our theoretical results with trace-driven simulation studies, where our algorithms can significantly improve the network load in comparison to the state-of-the-art.

1 Introduction

Data centers nowadays empower everyday life in aspects such as business, health, and industry, but also science and social interactions. With the rise of related data-intensive workloads as generated by machine learning, artificial intelligence, and the distributed processing of big data in general, data center traffic is growing very fast [63, 73]. Much of this traffic is internal to data centers, evoking considerable interest in data center design problems [64, 84].

Herein the emergence of a programmable physical layer, enabled by optical circuit switches [29, 50, 82], free-space optics [12, 37], or beamformed wireless connections [44, 45], leads to intriguing new possibilities, as leveraging fully electrically packet switched networks “is increasingly cost prohibitive and likely soon infeasible” [60, 62], see also the recent report by Microsoft [27]. In other words, electrical chips are unlikely to deliver sufficient performance for next-generation networks, and in turn, we must rely on programmable optical topologies for increased bandwidth, connectivity, and power-efficiency [5].

Extensive past work has already shown significant benefits of such reconfigurable data center networks [34, 43], but the underlying complexity is not well understood [11]. For example, many works artificially restrict their flow routing policies to be segregated between programmable and static network parts, aiming to place elephant flows on reconfigurable links [33].

Whereas some general algorithmic results exist w.r.t. latency [32, 37] or specific traffic patterns [10, 80], complexity questions of network-design for the objective of load-optimization are mostly uncharted. The exceptions are the work by Yang et al. [87], which focuses on the hardness induced by wireless interference, the work by Zheng et al. [91], who provide intractability results on general non-data-center topologies, and the results by Dai et al. [20], which uncover the approximation hardness for special settings. However, tree-induced topologies, as commonly employed in data centers, e.g., Fat-Tree, have yet not been exposed to a fine-grained complexity analysis, which can reveal a complexity dichotomy between network designs, as we will show in this article.

At the same time, link load is a most central performance metric [15, 44, 46, 76], and flow routing in traditional networks has been investigated for decades already [3]. We are thus motivated by the desire to take the first steps towards fundamentally understanding the network-design problem for load-optimization in data center networks, jointly considering flow routing and (interference-free) physical layer programmability enabled by, e.g., optical circuit switches.

1.1 Contributions

This article initiates the network-design study of load-optimization in reconfigurable networks with optical circuit switches, leveraging the flexibility of emerging programmable physical layers for flow routing. We investigate multiple problem dimensions, from splittable to unsplittable flows, to fully flexible (non-segregated) versus segregated routing policies. Our results not only include efficient algorithms and complexity characterizations but also simulations on real-world workloads:

(1)

Complexity: We prove strong NP-hardness for non-segregated and segregated routing on tree networks of height greater or equal than two, for un-/splittable flow models, excluding star networks, which are summarized in Table 1. Moreover, all four problem settings are not submodular w.r.t. load-optimization, preventing common approximation techniques.

(2)

Algorithms: In turn, we give polynomial-time optimal algorithms for the hybrid switch model of Venkatakrishnan et al. [79], which applies to non-blocking data center interconnects as, e.g., Fat-Trees. To this end, we leverage a combination of subset matching results and topology-specific insights.

(3)

Evaluations: Our workload-driven simulations (using Facebook, pFabric, and high-performance computing traces) show that our algorithms significantly improve on state-of-the-art methods, decreasing the maximum load by \(1.6\times\) to \(2.0\times\).

Table 1.

Time Complexity	Splittable	Segregated	Height of Tree h	References
Strongly NP-hard	Yes	Yes	\(h\ge 2\)	Theorem 3.2
Strongly NP-hard	Yes	No	\(h\ge 2\)	Theorem 3.6
Strongly NP-hard	No	Yes	\(h\ge 2\)	Theorem 3.2
Strongly NP-hard	No	No	\(h\ge 2\)	Theorem 3.5
Polynomial-time	Yes	Yes	\(h= 1\)	Theorem 4.6
Polynomial-time	Yes	No	\(h= 1\)	Theorem 4.6
Polynomial-time	No	Yes	\(h= 1\)	Theorem 4.6
Weakly NP-hard	No	Yes	\(h= 1\)	Theorem 3.4

Table 1. Network-design Complexity for Load-optimization in Reconfigurable Networks for Un-/splittable and Non-/segregated Routings when the Topologies are Trees of Height \(h=1\) and \(h\ge 2\)

As a contribution to the research community and in order to ensure the reproducibility of our results, we will make the source code of our algorithms as well as experimental artifacts publicly available at https://gitlab.cs.univie.ac.at/ct-papers/2021-tompecs-load-optimization.

Overview. We start with a formal model and preliminaries in Section 2, followed by complexity (Section 3) results for trees and algorithms for the hybrid switch model (Section 4). We then investigate the performance of our algorithms with trace-driven evaluations in Section 5. Lastly, we discuss related work in Section 6 and conclude in Section 7.

2 Model and Preliminaries

Network model. Let \(N=(V,E,\mathcal {E},C)\) be a hybrid network [56, 79] connecting the n nodes \(V=\lbrace v_1,\dots ,v_n\rbrace\) (e.g., top-of-the-rack switches), using static links E (usually connected by electrical packet switches). The network N also contains a set of reconfigurable (usually optical) links \(\mathcal {E}\). The graph \((V,E\cup \mathcal {E})\) is a bidirected¹ graph such that two directions of each bidirected link \(\lbrace v_i,v_j\rbrace \in E\) (respectively, \(\lbrace v_i,v_j\rbrace \in \mathcal {E}\)), where \(v_i,v_j\in V\), work as two (anti-parallel) directed links \((v_i,v_j)\) and \((v_j,v_i),\) respectively. We use the symbol \(\overrightarrow{E}\) (respectively, \(\overrightarrow{\mathcal {E}}\)) to denote the set of corresponding directed links of E (respectively, \(\mathcal {E}\)). Moreover, a function \(C: \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\mapsto \mathbb {R}^+\) defines capacities for both directions of each bidirected link in \(E\cup \mathcal {E}\). Note that \((V,E\cup \mathcal {E})\) can be a multi-graph, e.g., when a reconfigurable link in \(\mathcal {E}\) also connects two endpoints of a static link in E.

Reconfigured network. We say that a hybrid network N is reconfigured by a reconfigurable switch S if some reconfigurable links \(M\subseteq \mathcal {E}\), which must induce a matching,² are configured (implemented) by S to enhance the static network \((V,E)\). The set of configured (bidirected) links M, i.e., a matching, is called a reconfiguration of N. The enhanced network obtained by integrating the configured links M with the static links E of the hybrid network N is called a reconfigured network, i.e., \(N(M)=\left(V, E\cup M \right)\). The static network \((V,E)\) of the hybrid network N before reconfiguration can also be thought as a reconfigured network denoted by \(N(\emptyset)\).

Hardware. Our results also apply to non-optical switches and links, as long as they match the theoretical properties described in the model. As such, we will only talk about reconfigurable switches and reconfigurable links, implying any appropriate technology that matches our model.

Topologies. Our network model does not place a restriction on the underlying static topology and hence can be applied generally. Notwithstanding, for our hardness results in Section 3, already tree topologies suffice, whereas our positive algorithmic results cover many data center topologies, as we elaborate from Section 4 onwards.

Traffic demands. The resulting network should serve a certain communication pattern, represented as a \(|V| \times |V|\) communication matrix \(D:=(d_{ij})_{|V| \times |V|}\) (demands) with non-negative real-valued entries. An entry \(d_{ij}\in \mathbb {R}^+\) represents the traffic load (frequency) or a demand from the node \(v_i\) to the node \(v_j\). With a slight abuse of notation, let \(D(v_i,v_j)\) also denote a demand from \(v_i\) to \(v_j\) hereafter.

Routing models. For networking, unsplittable routing requires that all flows of a demand must be sent along a single (directed) path, while splittable routing does not restrict the number of paths used for the traffic of each demand; For a reconfigured network, segregated routing requires flows being transmitted on either static links or configured links, but non-segregated routing admits configured links to be used as shortcuts for flows along static links [29, 82]. Hence, there are four different routing models: Unsplittable and Segregated(US), Unsplittable and Non-segregated(UN), Splittable and Segregated(SS), and Splittable and Non-segregated(SN).

2.1 Load Preliminaries

“As minimizing the maximum congestion level of all links is a desirable feature of DCNs [44, 46], the objective of our work is to minimize the maximum link utilization of the entire network.”
Yang et al. [87], presented at ACM SIGMETRICS 2020 [88]

Load optimization. Given a reconfigured network \(N(M)\) and demands D, let \(f:\overrightarrow{E}\cup \overrightarrow{M}\mapsto \mathbb {R}^+\) be a feasible flow serving demands D in \(N(M)\) under a routing model \(\tau \in \lbrace \text{US}, \text{UN}, \text{SS}, \text{SN}\rbrace\). The load of each directed link \(e\in \overrightarrow{E}\cup \overrightarrow{M}\) induced by the flow f is defined as \(L(f\left(e\right)): = f\left(e\right)/C\left(e\right)\). Then, for a feasible flow f in \(N(M)\), the maximum load is defined as \({L_\text {max}(f)}:=\max \lbrace L(f(e)): e\in \overrightarrow{E}\cup \overrightarrow{M}\rbrace\), and there must be an optimal flow \(f_{\text{opt}}\) to serve D such that its maximum load is minimized for all feasible flows in \(N(M)\). Such an optimal flow is called a load-optimization flow in \(N(M)\).³ For a reconfigured network \(N(M)\), with a slight abuse of notation, let \(f^{M}_{\text{opt}}\) denote an arbitrary load-optimization flow in \(N(M)\), then we define a function \(L_{\text{min-max}}(N(M)):=L_{\text{max}}(f^{M}_{\text{opt}})\). Load-optimization reconfiguration problem. Given a hybrid network N, a routing model \(\tau \in \left\lbrace \text{US}, \text{UN}, \text{SS}, \text{SN} \right\rbrace\), and demands D, the \(\tau\)-load-optimization reconfiguration problem is to find an optimal reconfiguration \(M\subseteq \mathcal {E}\) to generate an optimally reconfigured network \(N(M)\) such that \(L_{\text{min-max}}\left(N\left(M\right) \right)\) is minimized for all valid reconfigurations \(M_i\subseteq \mathcal {E}\) of N. The \(\tau\)-load-optimization reconfiguration problem is also abbreviated as the \(\tau\)-reconfiguration problem henceforth. We lastly need to find a load-optimization flow for the optimally reconfigured network.

To illustrate the \(\tau\)–optimization reconfiguration problem, we give a small example in Figure 1. Figure 1(a) depicts the hybrid network before adding any reconfiguration, with five nodes \(V=\lbrace a,b,c,d,e\rbrace\), four static (bidirected) links E: \(\lbrace d,c\rbrace\), \(\lbrace b,c\rbrace\), \(\lbrace a,c\rbrace\), and \(\lbrace e,c\rbrace\) and six reconfigurable (bidirected) links \(\mathcal {E}\): \(\lbrace a,d\rbrace\), \(\lbrace d,b\rbrace\), \(\lbrace b,e\rbrace\), \(\lbrace a,e\rbrace\), \(\lbrace a,b\rbrace\), and \(\lbrace d,e\rbrace\).

Fig. 1.

We consider the routing model \(\tau =\text{SN}\) and a capacity function \(\forall e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}: C(e)=20\), with the six demands \(D\left(a,b\right) =8\), \(D\left(a,c\right)=6\), \(D\left(c,b\right) =6\), \(D\left(d,b\right) =6\) and \(D\left(a,e\right) =6\). In Figure 1(a), each flow can only be routed along static links, creating a link load of \(20/20=1\) on, e.g., \(\left(a,c\right)\) with three demands of size 8,6,6 from a. In order to improve the maximum link load, one could, e.g., greedily add reconfigurable links in order to reduce the maximum load, such as \(\lbrace a,b\rbrace\) in Figure 1(b). Now, the demand \(D\left(a,b\right) =8\) is routed directly, reducing the maximum load to just 0.6. Yet, only one further reconfigurable link can be chosen, \(\lbrace d,e\rbrace\), without violating the matching constraints. In this situation, any further rerouting does not decrease the maximum link load. For example, when attempting to alleviate the load of 0.6 on \(\left(c,b\right)\), the load on \(\left(a,c\right)\) will increase, and vice versa, in the best case canceling each other’s load increase.

Notwithstanding, we can improve the maximum load further. To this end, we select \(\lbrace a,e\rbrace\) and \(\lbrace d,b\rbrace\) as reconfigurable links, as shown in Figure 1(c). At first, this might seem counter-intuitive, as \(D\left(a,e\right)\) and \(D\left(d,b\right)\) are only of size 6 each, leaving a load of 0.7 on the links \(\left(a,c\right)\) and \(\left(c,b\right)\). However, the demand \(D\left(a,b\right) =8\) can be routed indirectly, via the path \(\lbrace a,e,c,d,e\rbrace\), yielding an optimal maximum link load of 0.5.

3 Complexity

In this section, we consider the underlying complexity of the load-optimization problem in reconfigurable networks. We begin with the investigation of NP-hardness, where we study segregated routing (Section 3.1) and non-segregated routing (Section 3.2). For all four routing models, we prove NP-hardness for trees of any height of two or greater.

Yang et al. [87] considered the case of unsplittable segregated routing on trees and weak NP-hardness, i.e., for large demand sizes. Our NP-hardness results also hold for small demand sizes and we moreover extend the previous result [87] to trees of height one. To show hardness, we can consider special cases, where all directed links have the same capacity of \(\gamma \in \mathbb {R}^+\). In particular, we set \(\gamma =1\) in all our NP-hard proofs s.t. the load of each link equals the flow size on itself, but our proofs work for arbitrary \(\gamma\).

We then prove in Section 3.3 that all four routing models are not submodular, i.e., resist common approximation schemes. Venkatakrishnan et al. [79] considered different objective functions and showed submodularity for the hybrid switching model, resulting in approximation algorithms which therefore cannot be applied here.

3.1 Segregated Routing

We start with the case of segregated routing w.r.t. NP-hardness. The following and some later proofs will make use of the strongly NP-hard 3-Partition problem, which we define first:

Definition 3.1 (3-Partition [36]).

Given a finite set A of 3m elements, a bound \(B\in \mathbb {Z}^+\), and a size function: \(s(a)\in \mathbb {Z}^+\) for each \(a\in A\) such that each \(s(a)\) satisfies \(B/4\lt s(a)\lt B/2\) and such that \(\sum _{a\in A} s(a)=mB\), can we partition A into m disjoint sets \(A_1,\ldots , A_m\), such that for \(1\le i\le m\), \(\sum _{a\in A_i} s(a)= B\), where \(| A_i |=3\)?

Theorem 3.2.

The \(\tau\)-load-optimization reconfiguration problem, where \(\tau \in \lbrace \text{US}, \text{SS}\rbrace\), are strongly NP-hard when the given hybrid network N, before reconfiguration, is a tree of height \(h\ge 2\).

Proof.

We first consider the US-load-optimization reconfiguration problem. Given an instance of 3-Partition \((A, B, s)\), we construct an instance of the US-reconfiguration problem as follows: the constructed tree T has the node r as its root, and for each element \(a_i\in A\), r has a direct child \(s_i\in S\). The root r also has m subtrees \(T_i\) for \(1\le i\le m\). For each subtree \(T_i\), its root is the node \(r_i\), which is a direct child of r, and the root \(r_i\) has 3m direct child nodes \(F^i=\lbrace f^i_1,\ldots , f^i_{3m}\rbrace\). The tree T represents the static (bidirected) links E and nodes V of the hybrid network N. For the set of all reconfigurable links \(\mathcal {E}\), if two nodes \(u\in V\) and \(v\in V\) are not connected by a static link in T, then there is a reconfigurable (bidirected) link \(\lbrace u,v\rbrace \in \mathcal {E}\). Regarding demands D, for each \(a_i\in A\), we define \(D(f^k_i,s_i)=s(a_i)\) for each \(1\le k\le m\). Note that the constructed tree only has a height of two. For each \(e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\), the capacity \(C(e)=1\). We claim that after N being reconfigured, there is no directed link \(e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\) that has load higher than \((m-1)B\) if and only if there exists a valid 3-Partition for the set A.

Before reconfiguration, each static link \((r_i,r)\in \overrightarrow{E}\) has a flow size of mB. Assume A has a 3-partition \(A_1,\ldots , A_m\), where \(\sum _{a\in A_i} s(a)= B\) for \(1\le i\le m\). For each \(A_i=\lbrace a_j,a_k,a_f\rbrace\), where \(1\le i\le m\), we connect \(s_j,s_k,s_f\in S\) to the corresponding nodes \(f^i_j,f^i_k,f^i_f\) in the subtree \(T_i\), by the configured links. Thus, the flow size on each static link \((r_i,r)\in \overrightarrow{E}\) is decreased by B for \(1\le i\le m\), i.e., \((m-1)B\).

On the other hand, we assume that we could find a set of configured (bidirected) links \(M\subseteq \mathcal {E}\) such that each static (directed) link \((r_i,r)\in \overrightarrow{E}\) for \(1\le i\le m\) does not have a flow size more than \((m-1)B\). Note that each element \(a_i\in A\) has \(B/4\lt s(a_i)\lt B/2\). Due to \(D(f^k_i,s_i)=s(a_i)\) for \(1\le k\le m\), for any two configured links in M, they cannot convey flows more than B. Moreover, if more than three (bidirected) links are configured between nodes S and the children nodes \(F^i\) in the subtree \(T_i\), then there must be another subtree \(T_j\), where \(j\ne i\), whose children \(F^j\) have at most two configured links connecting to S, since the set of configured links must be a matching M. Thus, for each subtree \(T_i\), where \(1\le i\le m\), there must be exactly three configured (bidirected) links between three nodes in \(F^i\) and three nodes in S. It is known that the direction from \(F^i\) to S in each of these three configured links should have a flow of size B. Let \(s_j,s_k,s_f\in S\) be these three nodes in S connected to \(F^i\) by M, which exactly correspond to a partition \(A_i\subseteq A\). Therefore, a valid 3-partition can be obtained.

To prove the hardness of the SS-reconfiguration problem, we use the same construction and claim as for the US-reconfiguration problem above. If A has a 3-partition \(A_1,\ldots , A_m\), where \(\sum _{a\in A_i} s(a)= B\) for \(1\le i\le m\), then we have proven a valid solution M exists for the US-reconfiguration problem, which is also a solution for SS-reconfiguration since routing for unsplittable flows is a special case of the splittable flow variant. Recall that in the segregated model, a configured (directed) link \((u,v)\in \overrightarrow{\mathcal {E}}\) (respectively, \((v,u)\in \overrightarrow{\mathcal {E}}\)) can only carry flows for \(D(u,v)\) (respectively, \(D(v,u)\)). By this setting, we know configured links can only be between leaf nodes of T. If a reconfigurable (bidirected) link \(\lbrace s_i, f^k_i\rbrace\), where \(k\in \lbrace 1,\ldots ,m\rbrace\), is configured, then all flows of the demand \(D(f^k_i,s_i)\) can go through \((f^k_i,s_i)\in \overrightarrow{\mathcal {E}}\) even under a splittable model due to \(D(f^k_i,s_i)\le (m-1)B\). On the other hand, if we can find a set of configured links \(M\subseteq \mathcal {E}\) such that each link \((r_i,r)\in \overrightarrow{E}\), where \(1\le i\le m\), does not have a flow size more than \((m-1)B\), then we know each configured (directed) link \((f^k_i,s_i)\in \overrightarrow{M}\) must carry all flows of its demand \(D(f^k_i,s_i)\). This corresponds to an unsplittable model, otherwise, there must be a static (directed) link \((r_i, r)\in \overrightarrow{E}\), where \(i\in \lbrace 1,\ldots ,m\rbrace\), which has a flow size more than \((m-1)B\). The same conclusion can be drawn for the SS-reconfiguration problem.□

3.2 Non-segregated Routing

For the non-segregated routing model, we obtain

—

weak NP-hardness for trees of height \(h=1\) in the UN model,

—

strong NP-hardness for trees of height \(h\ge 2\) in the UN model,

—

strong NP-hardness for trees of height \(h\ge 2\) in the SN model.

For the UN model, we start with the weakly NP-hard case of \(h=1\) in Theorem 3.4, followed by the strongly NP-hard case of \(h=2\) in Theorem 3.5. To show the weakly NP-hardness, we will give a reduction from the weakly NP-hard 2-Partition problem, which is defined as follows:

Definition 3.3 (2-Partition [36]).

Given a set of n integers \(S=\lbrace s_1,\ldots , s_n \rbrace\) where \(B=\sum _{s_i\in S}s_i\), can we divide S into two disjoint subsets \(S_1\) and \(S_2\) such that \(\sum _{s_i\in S_1}s_i=\sum _{s_j\in S_2}s_j\)?

Theorem 3.4.

The UN-load-optimization reconfiguration problem is weakly NP-hard when the given hybrid network is a hybrid switch network, i.e., a tree of height \(h=1\).

Proof.

We give a reduction from the 2-partition problem, which is weakly NP-hard [36]. Our proof is conceptually similar to the one by Yang et al. [87, Theorem 1], but also applies to the hybrid switch model. For the UN-load-optimization reconfiguration problem, we consider a tree of the height one that has the root node c; for each \(s_i\in S\), there is a node \(a_i\in A\) connected to c by a static (bidirected) link in E, while c has two additional adjacent nodes r and b. This construction constitutes a hybrid switch network since all nodes only connect to c, where we only allow reconfigurable links between leaf nodes (but not with c). For each \(a_i\in A\), we set \(D(r,a_i)=s_i\). Without loss of generality, let n be an even number. For each odd number i, where \(1\le i \le n\), we define \(D(a_i,a_{i+1})=B/2\). For capacity, it has \(\forall e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}: C(e)=1\).

After construction, we claim that after N being reconfigured, there is no directed link \(e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\) having its load \(\gt B/2\) if and only if there exists a valid 2-Partition for the set A.

Now, the question is how to make links having a load value no more than \(B/2\). According to our demands, only one direction in each bidirected link needs to carry traffics. For each odd number i, we need to configure \(\lbrace a_i,a_{i+1}\rbrace \in \mathcal {E}\), which gives a matching of \(n/2\) bidirected links, otherwise there must be a directed link \((c, a_{i})\) that carries flows for demands \(D(r, a_{i})=B/2\) and \(D(r,a_i)=s_i\) for \(i\in \lbrace 2, \ldots , n\rbrace\). We have to reconfigure the (bidirected) link \(\lbrace r,b\rbrace \in \mathcal {E}\), otherwise the load on the link \((r,c)\) is \(B=\sum _{s_i\in S}s_i\). Now, for these two directed paths from r to c: \((r,b,c)\in \overrightarrow{\mathcal {E}}\) and \((r,c)\), we have to decide which nodes in A have their flows going through the (directed) link \((r,b)\in \overrightarrow{\mathcal {E}}\) s.t. no (directed) link has load more than \(B/2\) in the UN model, which implies a solution to the 2-Partition problem and vice versa.□

Theorem 3.5.

The UN-load-optimization reconfiguration problem is strongly NP-hard when the given hybrid network, before reconfiguration, is a tree of the height \(h\ge 2\).

Proof.

We give a reduction from the 3-partition problem. Given an instance of 3-Partition \((A, B, s)\), we construct an instance of the UN-load-optimization reconfiguration problem as follows: the constructed tree T has the node r as its root, and for each element \(a_i\in A\), r has a direct child \(s_i\in S\). The root r also has m subtrees \(T_i\) for \(1\le i\le m\). For each subtree \(T_i\), its root is the node \(r_i\), which is a direct child of r, and the root \(r_i\) has 3m direct child nodes \(F^i=\lbrace f^i_1,\ldots , f^i_{3m}\rbrace\). The tree T represents the static (bidirected) links E and nodes V of the hybrid network N. Clearly, the constructed tree only has a height of two. For reconfigurable links \(\mathcal {E}\), if two nodes \(u\in V\) and \(v\in V\) are not connected by a static link in T, then there is a reconfigurable (bidirected) link \(\lbrace u,v\rbrace \in \mathcal {E}\). For each \(e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\), the capacity \(C(e)=1\). Regarding demands D, for each \(a_i\in A\), we define \(D(f^k_i,s_i)=s(a_i)\) for each \(1\le k\le m\). Moreover, for each node \(s_i\in S\), which is a direct child of the root r connected by \(\lbrace r,s_i\rbrace \in E\), we define a demand \(D(r,s_i)=(m-1)B-(m-1)*s(a_i)\).

We claim that after N being reconfigured, there is no directed link \(e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\) that has load higher than \((m-1)B\) if and only if there exists a valid 3-Partition for the set A. Note that, in our setting, only one direction of each bidirected link needs to carry flow.

Assume A has a valid 3-Partition \(A_1,\ldots , A_m\). For each \(A_i=\lbrace a_j,a_k,a_f\rbrace\), where \(1\le i\le m\), we connect \(s_j,s_k,s_f\in S\) to the corresponding nodes \(f^i_j,f^i_k,f^i_f\) in the subtree \(T_i\), respectively, by adding configured links. Thus, for each \(r_i\), where \(1\le i\le m\), the flow size conveyed by the static link \((r_i,r)\) is decreased by B, which is \((m-1)B\). For each static (directed) link \((r,s_j)\), where \(1\le j\le 3m\), it has the load value exactly \((m-1)B\).

On the other hand, we assume that a set of configured links M exists such that no static link, e.g., \((r,s_i)\) for \(1\le i\le m\), can have a load more than \((m-1)B\). Clearly, each node \(s_i\) must be included in a configured link s.t., at least a flow of size \(s(a_i)\) arrives at \(s_i\) via a configured link, otherwise \((r,s_i)\) is overflowed. Each configured (directed) link \((f^k_i,s_i)\), where \(1\le i\le 3m\) and \(1\le k\le m\) can only carry flow for the demand \(D(f^k_i,s_i)\); otherwise there must be some flows using static links \((r,s_i)\) to arrive at destination \(s_q\in S\), where \(i\ne q\) and \(1\le q\le 3m\); this causes the load on \((r,s_i)\) more than \((m-1)B\). Thus, a similar argument like the segregated model can be given, and it implies a valid 3-Partition \(A_1,\ldots , A_m\).□

Now, it remains to cover intractability for the fourth and remaining routing model:

Theorem 3.6.

The \(\text{SN}\)-load-optimization reconfiguration problem is strongly NP-hard when the given hybrid network N, before reconfiguration, is a tree of height \(h\ge 2\).

Proof.

Given an instance of 3-Partition \((A, B, s)\), we construct an instance of the SN-load-optimization reconfiguration problem as follows: the constructed tree T has the node r as its root, and r has a directed child node \(r_{0}\). For each element \(a_i\in A\), \(r_{0}\) has a direct child \(s_i\in S\). The root r also has m subtrees \(T_i\) for \(1\le i\le m\). For each subtree \(T_i\), its root is the node \(r_i\), which is the direct child of r; and the root \(r_i\) has 3 child nodes \(Q^i=\lbrace q^i_1, q^i_2, q^i_{3}\rbrace\). The tree T constitutes the static (bidirected) links E and nodes V of the hybrid network N. To construct the set of all reconfigurable (bidirected) links \(\mathcal {E}\), for each \(1\le i\le m\), there is a reconfigurable (bidirected) link \(\lbrace q^i_k,s_j\rbrace \in \mathcal {E}\), where \(k\in \lbrace 1,2,3\rbrace\) and \(s_j\in S\). Without loss of generality, we set \(\forall e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}:C(e)=1\). Regarding demands D, for each \(0\le i\le m\), we have \(D(r_i,r)=B\), and for each \(a_i\in A\), we have \(D(s_i,r_0) =B-s(a_i)\). For each \(1\le i\le m\), \(D(r_i,r_0) =B\) and \(D(r_i, q^i_k)=B/2+\epsilon\), where \(k\in \lbrace 1,2,3\rbrace\) and \(\epsilon \lt B/2-\max \lbrace a:a\in A\rbrace\). Clearly, the constructed tree only has a height of two. We claim that after N being reconfigured, there is no (directed) link that has the load more than B if and only if there exists a valid 3-Partition for the set A.

We note that, by our setting, each bidirected link in \(E\cup \mathcal {E}\) only needs to carry flow in one direction.

If A has a valid 3-Partition \(A_1,\ldots , A_m\), for each \(A_i=\lbrace a_j,a_k,a_f\rbrace\), where \(1\le i\le m\), we connect \(s_j,s_k,s_f\in S\) to the corresponding nodes \(q^i_1,q^i_2,q^i_3\) in the subtree \(T_i\), respectively, by adding three configured (bidirected) links, and then we send three flows of sizes \(s(a_j)\), \(s(a_k),\) and \(s(a_f)\) from \(r_i\) to \(r_0\) through the configured (directed) links \((q^i_1, s_j)\), \((q^i_2, a_k)\), and \((q^i_3, s_f)\), respectively. For other demands, we send them on their own static links, respectively. Clearly, all demands are served but no link in \(\overrightarrow{E}\cup \overrightarrow{M}\) has a load more than B.

Conversely, assume that we have an optimal reconfiguration \(M\subseteq \mathcal {E}\) for the SN-reconfiguration problem and a load-optimization flow f for \(N\left(M \right)\). Without loss of generality, for each \(D(r_i,r)\), where \(i\in \left\lbrace 0,\ldots ,m \right\rbrace\), we assume that the corresponding flow is only sent on the static (directed) link \((r_i,r)\) in f. If not, some flows for \(D(r_i,r)=B\) must also go through \((r_j,r)\), where \(j\in \left\lbrace 0,\ldots ,m \right\rbrace\) and \(r_j\ne r_i\). Since \(D(r_j,r)=B\), to make \(L(f(r_j,r)) \le B\), we know flows serving \(D(r_j,r)=B\) must go through \((r_i,r)\) too. Therefore, we can cancel the alternative path for each \(D(r_i,r)\), where \(i\in \left\lbrace 0,\ldots ,m \right\rbrace\), to force each demand \(D(r_i,r)\) only being sent on \((r_i,r)\) without increasing the load value of any (directed) link. For each subtree \(T_i\), where \(i\in \left\lbrace 1,\ldots ,m \right\rbrace\), we know flows serving \(D(r_i,q^i_j)\), where \(j\in \left\lbrace 1,2,3 \right\rbrace\), must be only sent on \((r_i,q^i_j)\) due to our assumption. Thus, for each \(T_i\), there must be three configured (directed) links from its three leaf nodes to three nodes in S, otherwise, one static (directed) link \((r_i,q^i_j)\), where \(j\in \lbrace 1,2,3\rbrace\), must overflow after serving \(D(r_i,r_0)=B\). To serve each \(D(s_i,r_0)= B-s(a_i)\), where \(s_i\in S\) and \(a_i\in A\), the link \((s_i,r_0)\) already carries a flow of size \(B-s(a_i)\), and then each \((s_i,r_0)\) can only convey a flow of size \(s(a_i)\) for some demands \(D(r_j,r_0)\), where \(j\in \lbrace 1,\ldots ,m\rbrace\). Therefore, for each subtree \(T_i\), we need to make three configured links to generate three (directed) paths: \((r_i,q^i_1, s_j, r_0)\), \((r_i,q^i_2, s_k, r_0)\) and \((r_i,q^i_3, s_f, r_0)\), where \(s_k,s_j,s_f\in S\), to convey B flows for \(D(r_i,r_0)\), which implies \(s(a_j) +s (a_k)+s (a_f)=B\). Finally, each subtree \(T_i\) has three configured (bidirected) links connecting three nodes in S, which indicates a valid 3-Partition \(A_1,\ldots , A_m\).□

3.3 Non-submodularity

The submodularity of objective functions plays an important role in approximating optimization problems [78], as by Venkatakrishnan et al. [79] for hybrid switch networks. However, their objective function does not consider load-balancing and hence does not apply in our setting, as we show next.

Definition of submodularity. We recall the definition of submodularity [38]: A function \(f:2^{B}\mapsto \mathbb {R}\), where \(2^{B}\) is a power set of a finite set B, is submodular if it satisfies that for every \(X,Y \subseteq B\) with \(X\subseteq Y\) and every \(x\in B \setminus Y\),

\begin{equation*} f\left(X\cup \left\lbrace x\right\rbrace \right)-f\left(X \right) \ge f\left(Y\cup \left\lbrace x\right\rbrace \right)-f\left(Y\right)\,. \end{equation*}

Overview. In this subsection, we investigate the submodularity of the objective function \(\Phi\) of a \(\tau\)-reconfiguration problem, which minimizes the maximum load of reconfigured networks \(N(M)\), i.e., \(L_{\text{min-max}}\left(N \left(M\right) \right)\), for all valid reconfigurations M of a given hybrid network N. Moreover, we are also interested in the submodularity of the objective function \(\Omega\) that maximizes the gap of the minimized maximum load between the given hybrid network N before reconfiguration and reconfigured networks \(N(M)\) for reconfigurations M of N. We will show that both functions \(\Phi\) and \(\Omega\) are not submodular functions. To prove the functions to be not submodular, we present special instances as counter-examples.

Theorem 3.7.

For \(\tau\)-load-optimization reconfiguration problems, where \(\tau \in \left\lbrace \text{US, UN, SS, SN} \right\rbrace\), the objective function \(\Phi\) that minimizes \(L_{\text{min-max}}\left(N\left(M\right) \right)\) for all reconfigurations M of N is not submodular.

Proof.

For a hybrid network \(N=(V,E,\mathcal {E},C)\), we define the set of nodes \(V=U\cup Q\), where \(U=\lbrace u_i:i=1,2,3\rbrace\) and \(Q=\lbrace q_i:i=1,2,3\rbrace\). For static (bidirected) links E, we have six options: \(\lbrace u_1,u_2\rbrace \in E\), \(\lbrace u_1,u_3\rbrace \in E\), \(\lbrace q_1,q_2\rbrace \in E\), \(\lbrace q_1,q_3\rbrace \in E\), and \(\lbrace u_2,q_2\rbrace \in E\), \(\lbrace u_3,q_3\rbrace \in E\). For each \(i\in \lbrace 1,2,3\rbrace\), there is a reconfigurable (bidirected) link \(\lbrace u_i,q_i\rbrace \in \mathcal {E}\). W.l.o.g., each (directed) link in \(\overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\) has the same capacity \(\gamma \in \mathbb {R}^+\). Let our objective function \(\Phi : 2^\mathcal {E}\mapsto \mathbb {R}^{+}\) be a function defined by an equation \(\Phi (M)=L_{\text {min-max}}\left(N(M)\right)\), where \(M\in 2^\mathcal {E}\) is a reconfiguration (matching). Recall the definition of submodularity. We define \(X=\lbrace \lbrace u_2,q_2\rbrace \rbrace \subseteq \mathcal {E}\), \(Y= \lbrace \lbrace u_1,q_1\rbrace , \lbrace u_2,q_2\rbrace \rbrace \subseteq \mathcal {E}\) and \(x=\lbrace u_3,q_3\rbrace \in \mathcal {E}\). When \(\tau =\text{SS, SN}\), we define demands as follows: \(D(u_3,q_3)=3\), \(D(u_2,q_2)=3\), \(D(q_2,u_2)=3\) and \(D(u_1,q_1)=3\). When \(\tau =\text{SS}\), we have \(\Phi (X \cup \lbrace x\rbrace)=\frac{9}{4\gamma }\), \(\Phi (X)=\frac{3}{\gamma }\), \(\Phi (Y\cup \lbrace x\rbrace)\gt \frac{3}{2\gamma }\), and \(\Phi (Y)\le \frac{9}{4\gamma }\). Thus, the Inequality (1) shows that the function \(\Phi\) is not submodular.

\begin{equation} \Phi \left(X\cup \left\lbrace x\right\rbrace \right)-\Phi \left(X \right)=\dfrac{-3}{4\gamma } \lt \Phi \left(Y\cup \left\lbrace x\right\rbrace \right)-\Phi \left(Y\right)\gt \dfrac{-3}{4\gamma }\, . \end{equation}

(1)

When \(\tau =\text{SN}\), we have \(\Phi \left(X \cup \lbrace x\rbrace \right)=\frac{9}{4\gamma }\), \(\Phi \left(X\right)=\frac{3}{\gamma }\), \(\Phi \left(Y\cup \lbrace x\rbrace \right)= \frac{9}{5\gamma }\), and \(\Phi \left(Y\right)=\frac{9}{4\gamma }\). Thus, the Inequality (2) shows that the function \(\Phi\) is not submodular.

(2)

When \(\tau =\text{US}\), we modify our above constructed network N by adding one more node d and one more static link \(\lbrace d,q_2\rbrace \in E\), while reconfigurable links \(\mathcal {E}\) are unchanged. We define new demands as follows: \(D(u_3,q_3)=3\), \(D(u_2,d)=3\), and \(D(u_1,q_1)=3\). Now, we have \(L_{\text{min-max}}(N(\emptyset))=6\). When \(\tau =\text{US}\), we know \(\Phi (X \cup \lbrace x\rbrace)=3\), \(\Phi \left(X\right)=6\), \(\Phi \left(Y\cup \lbrace x\rbrace \right)=3\), and \(\Phi \left(Y\right)=3\). Thus, the Inequality (3) shows that the function \(\Phi\) is not submodular.

\begin{equation} \Phi \left(X\cup \left\lbrace x\right\rbrace \right)-\Phi \left(X \right)=-3/\gamma \lt \Phi \left(Y\cup \left\lbrace x\right\rbrace \right)-\Phi \left(Y\right)=0\, . \end{equation}

(3)

When \(\tau =\text{UN}\), we extend the above constructed network N for \(\text{US}\) by adding one more static link \(\lbrace d,q_1\rbrace \in E\) and removing the static link \(\lbrace q_1,q_2\rbrace \in E\), while reconfigurable links \(\mathcal {E}\) are unchanged. We define new demands as follows: \(D(u_3,q_3)=3\), \(D(u_2,d)=3\), and \(D(u_1,d)=3\). Now, we know \(\Phi \left(X \cup \lbrace x\rbrace \right)=3/\gamma\), \(\Phi \left(X\right)=6/\gamma\), \(\Phi \left(Y\cup \lbrace x\rbrace \right)=3/\gamma\), and \(\Phi \left(Y\right)=3/\gamma\). Thus, the function \(\Phi\) is not submodular.□

Theorem 3.8.

For \(\tau\)-load-optimization reconfiguration problems, where \(\tau \in \left\lbrace \text{US, UN, SS, SN} \right\rbrace\), the objective function \(\Omega\) that maximizes \(L_{\text{min-max}}\left(N\left(\emptyset \right) \right)-L_{\text{min-max}}\left(N\left(M\right) \right)\) for all reconfigurations M of N, is not submodular.

Proof.

For a hybrid network \(N=(V,E,\mathcal {E},C)\), we define nodes \(V=U\cup Q\cup P\), where \(U= \lbrace u_i:i=1,2,3\rbrace\), \(P=\lbrace p_i:i=1,2,3\rbrace\) \(Q=\lbrace q_i:i=1,2,3\rbrace\); For each \(i\in \lbrace 1,2,3\rbrace\), we set two static links \(\lbrace u_i,q_i\rbrace \in E\) and \(\lbrace p_i,u_i \rbrace \in E\), and a reconfigurable link \(\lbrace u_i,q_i\rbrace \in \mathcal {E}\), and two demands \(D(u_i,q_i)=3\) and \(D(p_i,q_i)=3\). Let our objective function \(\Omega : 2^\mathcal {E}\mapsto \mathbb {R}^{+}\) be defined by an equation \(\Omega \left(M \right)=\omega -L_{\text {min-max}}\left(N(M)\right)\), where a reconfiguration (matching) \(M\in 2^\mathcal {E}\) and \(\omega =L_{\text {min-max}}(N(\emptyset))\). With the loss of generality, for each (directed) link in \(\overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\), it has the capacity \(\gamma\). Recall the definition of submodularity. We set \(X=\lbrace \lbrace u_1,q_1\rbrace \rbrace\), \(Y=\lbrace \lbrace u_1,q_1\rbrace , \lbrace u_2,q_2\rbrace \rbrace\) and \(x=\lbrace u_3,q_3\rbrace\). For each routing model \(\tau \in \lbrace \text{US, UN, SS, SN}\rbrace\), we always have \(\Omega (X \cup \lbrace x\rbrace)=\omega -6/\gamma\), \(\Omega (X)=\omega -6/\gamma\), \(\Omega (Y\cup \lbrace x\rbrace)=\omega -3/\gamma\), and \(\Omega \left(Y\right)=\omega -6/\gamma\). Hence, \(\Omega\) is not submodular.□

4 Hybrid Switch Networks

As we saw before, already tree networks of height \(\ge 2\) are NP-hard to optimize, and optimizations leveraging submodularity are not possible. Yet it is worth noting that the NP-hardness for stars, i.e., trees of depth one, is still open since the NP-hardness established for trees of height \(\ge 2\) collapses on simple structures of star topologies. In fact, many NP-hard problems can become tractable after restricting the input graphs, e.g., the minimum vertex cover becomes polynomially solvable on trees by using dynamic programming [19]. This raises the interesting question if we can obtain optimal and polynomial-time algorithms for a data center network that can be abstracted as a star topology.

4.1 Non-blocking Data Center Topologies

Common data center topologies have trees of height 2 as subgraphs or minors and hence seem like bad candidates for efficient algorithms at first glance. However, already early designs adapted from telecommunications such as Clos [18] topologies have a so-called non-blocking property, which we can use to our advantage. An interconnecting topology \(\mathcal {C}\) is non-blocking, if the servability of a flow from \(v_1\) to \(v_2\) via \(\mathcal {C}\) only depends on the utilization of the links \((v_1,\mathcal {C})\) and \((\mathcal {C}, v_2)\) : “such an interconnect behaves like a crossbar switch” [89]. In other words, from a load-utilization perspective, the maximum load inside \(\mathcal {C}\) will not be higher than on the egress/ingress links of \(\mathcal {C}\). Non-blocking interconnects have hence become popular data center topologies [4] in particular in the form of folded Clos networks or Fat-Trees [54], depicted in Figure 2(a): the actual topology inside the interconnect (marked in a blue rectangle) is immaterial and we only need to consider the links incident to the nodes⁴—a fact commonly used, e.g., for bandwidth guarantees of the hose model [25] in Clos topologies [40, Section 4.1].

Fig. 2.

Thus, for our purposes, we can abstract the data center interconnect \(\mathcal {C}\) (which can be understood as a packet switch) by a single center node c, leaving our previous intractability considerations behind. We hence turn our attention to hybrid switch networks as considered by of Venkatakrishnan et al. [79], which are represented by a packet and a circuit switch connected to all nodes, see Figure 2(b).

Routing in hybrid switch networks is straightforward (only one path exists for each node pair in the packet switched network), but the addition of a circuit switch adds a large degree of freedom: First, the number of possible matchings grows exponentially, and second, we have to decide for each flow which path to take as well. Notwithstanding, the special structure of hybrid switch networks allows us to solve reconfiguration and routing efficiently.

We structure our approach as follows. We first introduce an auxiliary problem in Section 4.2 and a constant-time triangle graph algorithm in Section 4.3, which we then leverage for our optimal algorithm in Section 4.4. We lastly discuss performance bounds and extensions in Section 4.5.

4.2 Red-target Matching

As each configuration of an OCS must be a matching, we cannot simultaneously create a reconfigurable connection for each demand. Still, intuitively, it is desirable to relieve the nodes, respectively, node-pairs with higher communication intensities by reconfigurable links. Later in our algorithms, we will mark some nodes (in red) which must be connected to the OCS in order to satisfy a given load threshold. However, not all reconfigurations, i.e., matchings, are suitable for such a task. Given such red-colored nodes, the question is if all such red nodes can be matched accordingly, which is formalized in Definition 4.1:

Definition 4.1 (Red-Target Matching (RTM)).

Given a graph \(G=(V,E)\), where a subset of nodes \(V^{\prime }\subset V\) are colored as “red”, the question is to find a matching M of G such that each colored node \(v\in V^{\prime }\) is covered by an edge of M.

To illustrate Definition 4.1, we introduce an example shown in Figure 3. The RTM problem looks for a restricted matching, which not only satisfies the degree-bound of a matching but also contains the set of all colored nodes \(V^{\prime }\subseteq V\).

Fig. 3.

Lemma 4.2.

The RTM problem (Definition 4.1) can be solved by a maximum-weight matching algorithm in polynomial time.

Proof.

For a given graph \(G=(V,E)\), if an edge \(e\in E\) does not contain a red node in its endpoints, then it can be removed directly. For each edge \(e\in E\) having both endpoints of the color red, we set the weight \(w(e)=2\), and for each edge \(e\in E\) that has only one endpoint of the color red, we set \(w(e)=1\). Let the number of red-colored nodes \(V^{\prime }\) in G be n. If we can find a matching M of the weight n, then all of these n red nodes are contained in M. There is no matching that can have a weight more than n, otherwise the number of red-colored nodes \(V^{\prime }\) in G is more than n. Therefore, if the RTM problem has a solution then a maximum-weight algorithm can always find a valid matching M for RTM. If the maximum-weight matching has a weight of less than n, then RTM has no solution. Lastly, a maximum-weight matching is solvable in polynomial time, e.g., by the Blossom algorithm [26], which has a running time \(O(| E| | V|^2)\).□

4.3 Selection of Suitable Reconfigurable Links

In the studied hybrid switch networks, reconfigurable links can be created between any pair of nodes connected to the packet switch, e.g., via an OCS. While we will select the (matching) subset of reconfigurable (bidirected) links in the next subsection, we herein identify the benefit of adding specific reconfigurable links.

Lemma 4.3.

Given a reconfigured network \(N(M)\), which is a triangle on three nodes \(V=\left\lbrace a,b,c\right\rbrace\) with the only configured (bidirected) link \(\left\lbrace a,b\right\rbrace \in \mathcal {E}\) and two static (bidirected) links \(\lbrace c,a\rbrace ,\lbrace c,b\rbrace \in E\), then for demands D, a load-optimization flow in \(N(M)\) can be computed in a constant time under routing models \(\tau \in \left\lbrace \text{US}, \text{SS}, \text{SN} \right\rbrace\).

Proof.

In the triangle \(N(M)\), there are at most six demands in D and six directed links \(\overrightarrow{E}\cup \overrightarrow{\mathcal {E}}\). For each demand, e.g., \(D(a,b)\), the directed link \((a,b)\) is called the shortcut of \(D(a,b)\), and the directed path \((a,c,b)\) from a to b is called the indirect path of \(D(a,b)\).

For the segregated routing model, demands \(D(c,a)\), \(D(a,c)\), \(D(c,b),\) and \(D(b,c)\) can be only sent on their shortcuts \((c,a)\) \((a,c)\), \((c,b),\) and \((b,c),\) respectively, which are static links in \(\overrightarrow{E}\), and then we only need to consider \(D(a,b)\) and \(D(b,a)\). Moreover, for the unsplittale routing model, each demand, e.g., \(D(a,b)\), can only be sent on a single path: either its shortcut \((a,b)\) or its indirect path \((a,c,b)\).

When \(\tau =\text{US}\), \(D(a,b)\) can only be sent through either \((a,b)\) or \((a,c,b)\), and the similar argument can be applied to \(D(b,a)\). In terms of the given capacity function C, a load-optimization flow can be decided by searching these four different routing possibilities for \(D(a,b)\) and \(D(b,a)\), which is in a constant time.

When \(\tau =\text{SS}\), the respective load values for directed links \((a,c)\), \((c,b),\) and \((a,b)\) only depends on how the demand \(D(a,b)\) divides its traffic between the indirect path \((a,c,b)\) and its shortcut \((a,b)\), while an analogous argument can be applied to the demand \(D(b,a)\). Thus, a load-optimization flow can be decided in a constant time.

When \(\tau =\text{SN}\), a load-optimization flow in the triangle \(N(M)\) can be computed in constant time as well. The details are given in Lemma A.1, deferred to the Appendix in order to improve readability.□

It remains to utilize the single triangle algorithms in a larger context: Lemma 4.4 shows that the optimal flow computed locally in each triangle \(\lbrace v_i,c, v_j\rbrace\) provides a lower bound for the subflow of a globally optimal flow of the hybrid switch network N and demands D in the same triangle, and Lemma 4.5 further indicates that a globally optimal flow can be obtained by combining these locally optimal flows in triangles.

Lemma 4.4.

Given a hybrid switch network N on leaves \(V^{\prime }=\lbrace v_1,\ldots , v_n \rbrace\), where we denote the central packet switch by a node c, demands D, and a routing model \(\tau \in \left\lbrace \text{US}, \text{SS}, \text{SN} \right\rbrace\), for any reconfiguration \(M^{\prime }\) of N, let f be an arbitrary flow serving D in the reconfigured network \(N (M^{\prime })\). Let \(\lbrace v_i,v_j\rbrace \in M^{\prime }\) be any configured (bidirected) link in \(M^{\prime }\), where \(v_i,v_j\in V^{\prime }\). For the triangle on \(\lbrace v_i,v_j,c\rbrace\), let \(E^\Delta _{ij}\) denote the six (directed) links of this triangle, i.e., \(E^\Delta _{ij}\subset \overrightarrow{E}\cup \overrightarrow{M}\), and let \(\mu ^\Delta _{ij}\) be the minimized maximum load computed by Algorithm 1 for the triangle \(\lbrace v_i,v_j,c \rbrace\). We then obtain:

\begin{equation} \max \left\lbrace L\left(f\left(e\right) \right) : e\in E^\Delta _{ij}\right\rbrace \ge \mu ^\Delta _{ij}\; . \end{equation}

(4)

Proof.

Let the given hybrid switch network N have nodes \(V=V^{\prime }\cup \lbrace c\rbrace\), where c is the central packet switch node and \(V^{\prime }= \lbrace v_1,\ldots ,v_n\rbrace\) are leaf nodes (leaves). Recall that a reconfiguration \(M^{\prime }\) must be a matching. Thus, in the reconfigured network \(N (M^{\prime })\), for each configured (bidirected) link \(\lbrace v_i,v_j\rbrace \in M^{\prime }\), where \(v_i,v_j\in V^{\prime }\), the node \(v_i\) (respectively, \(v_j\)) only connects to nodes c and \(v_j\) (respectively, \(v_i\)). Let f be an arbitrary flow serving D in \(N(M^{\prime })\). Any partial flow of f that start from the node \(u\in \lbrace v_i,v_j\rbrace\) and end at a node \(v\in V\setminus \lbrace v_i,v_j\rbrace\) must go through the center c to leave the triangle \(\lbrace v_i,c,v_j\rbrace\), and the size of these flows must be \(D^{\prime }(u,c)\) defined in Algorithm 1; on the other hand, any partial flow (sub-flow) of f that starts from a node \(v\in V\setminus \lbrace v_i,v_j\rbrace\) and ends at the node \(u\in \lbrace v_i,v_j\rbrace\) must go through the center c to enter the triangle \(\lbrace v_i,c,v_j\rbrace\) and the size of these sub-flows must be \(D^{\prime }(c,u)\) defined in Algorithm 1. Therefore, the local sub-flows of f inside the triangle \(\lbrace v_i,v_j,c\rbrace\) satisfy the demands \(D^{\prime }\) defined in Algorithm 1. Since \(\mu ^\Delta _{ij}\) denotes the maximum load of a local load-optimization flow serving \(D^{\prime }\) in \(\lbrace v_i,v_j,c\rbrace\), then by the correctness of Lemma 4.3, In Equation (4) holds directly.□

Lemma 4.5.

In Algorithm 2, a load-optimization flow \(f_{\text{final}}\) serving D in a reconfigured network \(N(M)\) under a routing model \(\tau \in \left\lbrace \text{US}, \text{SS}, \text{SN} \right\rbrace\) can be constructed in a runtime of \(O(n^2)\), where the number of demands D is at most \(n^2\).

Proof.

We note that a local load-optimization flow \(f^\Delta _{ij}\) for each triangle \(\lbrace v_i,v_j,c\rbrace\) where \(\lbrace v_i,v_j\rbrace \in \mathcal {E}\), has been computed in Algorithm 1 according to \(\tau\) and \(D^{\prime }\).

Let M be an optimal reconfiguration for the hybrid switch network N and demands D. For each configured (bidirected) link \(\lbrace v_i,v_j\rbrace \in M\), the set \(S^\Delta\) returned by Algorithm 1 contains a load-optimization flow \(f^\Delta _{ij}\) for the triangle \(\lbrace v_i,v_j,c\rbrace\) and \(D^{\prime }\). We first construct the related sub-flow serving an arbitrary demand \(D(v_i,v_j)\) in \(f_{\text{final}}\), where \(v_i\) and \(v_j\) are leaf nodes. First, if \(\lbrace v_i,v_j\rbrace \in M\), then the flow for \(D(v_i,v_j)\) (respectively, \(D(v_j,v_i)\)) is already given in \(f^\Delta _{ij}\) contained in \(S^\Delta\). Second, if there are two configured links \(\lbrace v_i,v_k\rbrace\) and \(\lbrace v_j,v_l\rbrace\) in M, then the flow of \(D(v_i,v_j)\) obtained by merging the sub-flow of size \(D(v_i,v_j)\) in \(f^\Delta _{ik}\) serving \(D^{\prime }(v_i,c)\) and the sub-flow of size \(D(v_i,v_j)\) in \(f^\Delta _{jl}\) serving \(D^{\prime }(c,v_i)\) on the joint center c. If only \(v_i\) is contained in a configured link \(\lbrace v_i,v_l\rbrace \in M\), then the sub-flow serving \(D(v_i,v_j)\) in \(f_{\text{final}}\) is obtained by extending the sub-flow of size \(D(v_i,v_j)\) in \(f^\Delta _{il}\) serving \(D^{\prime }(v_i,c)\) from the destination c to the node \(v_j\). Moreover, for the demand \(D(v_i,c)\) (respectively, \(D(c,v_i)\)), if \(v_i\) is contained in a configured link \(\lbrace v_i,v_k\rbrace \in M\), then the sub-flow serving \(D(v_i,c)\) (respectively, \(D(c,v_i)\)) in \(f_{\text{final}}\) can be found in the local flow \(f^\Delta _{ik}\) in \(S^\Delta\) directly; otherwise, we send its flow directly on the static (directed) link \((v_i,c)\) (respectively, \((c,v_i)\)) in \(f_{\text{final}}\).

Lastly, for each demand, its flow can be constructed in constant time in Algorithm 2. Thus, the running time to construct \(f_{\text{final}}\) relies on the number of demands D, which is \(O(n^2)\).□

4.4 Solving Hybrid Switch Networks Optimally

We now combine our previous results to optimally solve the reconfiguration problem on hybrid switch networks.⁵

Theorem 4.6.

If each reconfigurable link in \(\mathcal {E}\) is only between two leaf nodes, then the \(\tau\)-reconfiguration problem on hybrid switch networks is solved optimally by Algorithm 2 when \(\tau \in \lbrace \text{US}, \text{SS}, \text{SN} \rbrace\).

Proof.

For a hybrid switch network N, let M be an optimal reconfiguration for N, and let \(\mu _{\text{min}}=L_{\text{min-max}}\left(N\left(M\right) \right)\) denote the minimized maximum load of the reconfigured network \(N(M)\). Note that such an optimal reconfiguration M always exists for hybrid switch networks N, when N has at least two leaf nodes. Thus, we prove that Algorithm 2 can find such an optimal solution M.

For a load-optimization flow \(f_{\text{opt}}\) of \(N(M)\), there must be at least a directed link \(e^*\) in \(N(M)\) such that \(L (f_{\text{opt}} (e^*)) =\mu _{\text{min}}\). First, if there is \(e^*\) contained in a triangle \(\lbrace v_{i^*},c,v_{j^*}\rbrace\) in \(N\left(M \right)\), where \(v_{i^*}, v_{j^*}\in V\) are leaf nodes, then it implies that \(\mu _{\text{min}}= \mu ^\Delta _{i^*j^*}\) by Lemma 4.3-4.4, where \(\mu ^\Delta _{i^*j^*}=L_{\text{max}} (f^\Delta _{i^*j^*})\) and \(f^\Delta _{ij}\) denotes a load-optimization flow \(f^\Delta _{ij}\) serving \(D^{\prime }\) computed by Algorithm 1. We can further imply that \(\mu _{\text{min}}= \mu ^\Delta _{i^*j^*}=\max \lbrace \mu ^\Delta _{ij}: (\mu ^\Delta _{ij}, f^\Delta _{ij}) \in S^\Delta \rbrace\), otherwise there must be another triangle \(\lbrace v_{i^{\prime }},c,v_{j^{\prime }}\rbrace\) that has \(\mu ^\Delta _{i^{\prime }j^{\prime }} \gt \mu _{\text{min}}\), contradicting that \(\mu _{\text{min}}\) is the maximum load on \(f_{\text{opt}}\). On the other hand, if \(e^*\) is a static link and no configured link in M is incident with \(e^*\) on a leaf node, then \(\mu _{\text{min}}\) must be already in \(\lbrace L(f_{\text {old}} (e)): e\in \overrightarrow{E} \rbrace\), which also has \(\mu _{\text{min}}\ge \max \lbrace \mu ^\Delta _{ij}: (\mu ^\Delta _{ij}, f^\Delta _{ij}) \in S^\Delta \rbrace\). Thus, \(\mu _{\text{min}}\) must be included in T in Algorithm 3. Since the binary search goes through T exhaustively, then \(\mu _{\min }\) can be always detected and used as a threshold for Algorithm 3 to search for a matching.

Now, we prove that, when each reconfigurable link in \(\mathcal {E}\) is between two leaf nodes in V, given a threshold \(\mu _{\text{min}}\), Algorithm 2 can find an optimal reconfiguration M for a hybrid switch network N and a flow f serving D in \(N(M)\) such that \(L_{\text{max}}\left(f \right) \le \mu _{\text{min}}\).

Before reconfiguration, on the original flow \(f_{\text{old}}\), for each static (bidirected) link \(\lbrace v_i,c\rbrace \in E\), where \(v_i\in V\setminus \lbrace c\rbrace\), if it has \(L(f(v_i,c)) \gt \mu _{min}\) or \(L (f(c,v_i)) \gt \mu _{min}\), then its leaf node \(v_i\) must be contained in a configured link in M, otherwise, the loads on \((v_i,c)\) and \((c,v_i)\) are unchanged after reconfiguration. Thus, in Algorithm 3, we color such nodes by “red” and try to find a matching to cover all “red” nodes. Lemma 4.1 ensures that a matching M covering all “red” nodes in G must be detected if it exists. Due to the way of constructing G, for each \(\lbrace v_i,v_j\rbrace \in M\), Lemma 4.3 implies that the local load-optimization flow \(f^\Delta _{ij}\) in \(\lbrace v_i,v_j,c\rbrace\) serving \(D^{\prime }\) has the maximum load \(L_{\text{max}}(f^\Delta _{ij}) \le \max \lbrace \mu ^\Delta _{ij}: (\mu ^\Delta _{ij}, f^\Delta _{ij}) \in S^\Delta \rbrace \le \mu _{\text{min}}\). Lemma 4.5 guarantees that a load-optimization flow \(f_{\text{final}}\) serving D in the \(N(M)\) can be constructed such that \(L_{\text{max}}(f_{\text{final}}) \le \mu _{\text{min}}\). Note that for any static link \(\lbrace v_k,c\rbrace \in E\), if \(v_k\) is not contained in any configured link of M, then \(L(f_{\text{final}}(v_k,c))=L(f_{\text{old}}(v_k,c)) \le \mu _{\min }\) and \(L(f_{\text{final}}(c,v_k))=L(f_{\text{old}}(c,v_k)) \le \mu _{\min }\).□

We now briefly show that our algorithms also extend to the case where we can create a reconfigurable link to the central packet switch and also bound the runtime:

Theorem 4.7.

The \(\tau\)-reconfiguration problem on hybrid switch networks is solved optimally by Algorithm 2 in a polynomial time \(O ({\beta } \cdot \log n)\), where n is the number of nodes and \(\beta\) denotes the running time of a maximum matching algorithm, when \(\tau \in \left\lbrace \text{US}, \text{SS}, \text{SN} \right\rbrace\).

Proof.

Theorem 4.6 has shown the correctness under the restriction, where each reconfigurable link in \(\mathcal {E}\) must be between two leaf nodes. Now, we will show that a \(\tau\)-reconfiguration problem can still be solved by Algorithm 2 without the restriction.

If \(\mathcal {E}\) contains a reconfigurable link \(\lbrace v_i,c\rbrace\), where \(v_i\) is a leaf node and c is the center, we could create an additional leaf node \(v_{n+1}\) in V. To introduce additional demands, for each \(u\in V\setminus \lbrace v_{n+1}\rbrace\), we define \(D(v_{n+1},u):=0\) and \(D(u,v_{n+1}):=0\). Then for each reconfigurable link \(\lbrace v_i,c\rbrace \in \mathcal {E}\), we remove it from \(\mathcal {E}\) and add a new reconfigurable link \(\lbrace v_i,v_{n+1}\rbrace\) into \(\mathcal {E}\). For the matching M and a load-optimization flow \(f_{\text{final}}\) returned by Algorithm 2, if \(\lbrace v_i,v_{n+1}\rbrace \in M\), then remove it and add \(\lbrace v_i,c\rbrace\) into M, and update \(f_{\text{final}}\) by moving flow on the directed path \((v_i, v_{n+1},c)\) (respectively, \((c, v_{n+1},v_i)\)) to the existing configured link \((v_i,c)\) (respectively, \((c,v_i)\)). If no link in \(\mathcal {E}\) contains the center c, then Theorem 4.6 has proved the correctness.

Now, we assume that there must be at least one reconfigurable link in \(\mathcal {E}\) containing c. Note that the original \(\mu _{\min }\) can be still stored in T after the above processing. Thus, the \(\tau\)-reconfiguration problem on hybrid switch networks is solved optimally by Algorithm 2.

Runtime analysis. Algorithm 1 computes local demands \(D^{\prime }\) and the corresponding optimal flow locally in each triangle \(\lbrace v_i, c,v_j\rbrace\) of \(\lbrace v_i, v_j\rbrace \in \mathcal {E}\), leading to the runtime of \(O(|V|\cdot |\mathcal {E}|)\). The analysis of Algorithm 3 reveals that its runtime is dominated by (1) the binary search over values of T, and (2) finding a maximum matching for each value of T. Hence, given the overhead of binary search \(O(\log |T|) \in O(\log n)\), where \(|T | = |V |+1\), the time cost of Algorithm 3 is \(O(\beta \cdot \log |T|)\), where \(\beta\) denotes the time of computing a maximum matching, e.g., \(\beta = O (|\mathcal {E}||V |^2)\) by Blossom algorithm [26]. Finally, by using Algorithms 1 and 3 as subroutines, Algorithm 2 has its runtime bounded by \(O(|V|\cdot |\mathcal {E}| + \beta \cdot \log |T|)\), dominated by \(O ({\beta } \cdot \log n)\).□

For example the original Blossom algorithm [26] can be used compute a maximum weight matching in \(\beta = O(|E ||V|^2)\), but faster maximum weight matching algorithms exist, for which we refer to the comprehensive overview by Duan and Pettie [24, Tbl. III].

4.5 Bounds and Extensions

Given that we provided optimal algorithms for hybrid switch networks above, we now investigate theoretical performance bounds and extensions. As such, we provide bounds on the improvement of the load after reconfiguration, prove that maximum matching algorithms do not perform well in terms of competitive analysis, and show how our algorithms can be extended to multiple small reconfigurable switches.

Improvement bounds. If the capacities of reconfigurable links are arbitrarily large, in comparison to the static links, then the maximum load after applying reconfiguration can become arbitrarily small, under selected scenarios. Thus, to understand the intrinsic lower bounds of the reconfiguration problem on hybrid switch networks \(N= (V,E,\mathcal {E},C)\), we investigate the case where the capacity function C is uniform, denoted by \((V,E,\mathcal {E},1)\).

For a hybrid network N with uniform capacities, the improvement of the load on an arbitrary static link \(\lbrace u,v\rbrace \in E\) relies on the incremental edge-connectivity imposed by the reconfigured links in M between u and v in \(N\left(M\right)\). If a node u has only one static link \(\lbrace u,v\rbrace \in E\), then the edge-connectivity from u to v can be at most two in \(N\left(M\right)\) for any reconfiguration M, which further implies that the load on edges outgoing from u can at best be split along both edges after performing reconfiguration.

Lemma 4.8.

Given a hybrid switch network \(N=(V,E,\mathcal {E},1)\), demands D and a routing model \(\tau \in \left\lbrace \text{US, UN, SS, SN} \right\rbrace\), for any reconfiguration M of N, we have \(L_{\text{min-max}}\left(N\left(M\right) \right) \ge L_{\text{min-max}}(N(\emptyset))/2\).

Proof.

Given a hybrid switch network \(N=(V,E,\mathcal {E},1)\), demands D and a routing model \(\tau \in \left\lbrace \text{US, UN, SS, SN} \right\rbrace\), let \(M^{*}\subseteq \mathcal {E}\) be an optimal reconfiguration of N and let \(f^{M^*}_{\text{opt}}\) be an arbitrary load-optimization flow for the reconfigured network \(N(M^*)\). Let \(f_{\text{old}}\) be the original flow serving D in N. There must be a leaf node \(v_i\in V\) such that a static (directed) link, w.l.o.g., denoted by \((v_i,c),\) has \(L(f_{\text{old}}(v_i,c)) = L_{\text{min-max}}(N (\emptyset))\). We know that \(v_i\) must be included in a configured link, denoted by \(\lbrace v_i,v_j\rbrace \in M^*\), otherwise we still have that \(L_{\text{min-max}}(N(M^*))= L_{\text{min-max}}(N(\emptyset))\) holds. In the triangle \(\lbrace v_i,v_j,c\rbrace\), there are at most two link-disjoint directed paths from another node in \(\lbrace v_j,c\rbrace\) to the node \(v_i\). We know the size of the flow on the static link \((v_i,c)\) can be at most decreased by half to obtain optimality, which implies \(L_{\text{min-max}}(N(M^*))\ge L(f^{M^*}_{\text{opt}}(v_i,c)) \ge L(f_{\text{old}}(v_i,c)) /2\;\;\). Thus, for any reconfiguration M of N, we know \(L_{\text{min-max}}\left(N\left(M\right) \right) \ge L_{\text{min-max}}(N(\emptyset))/2\).□

Competitivity of matching algorithms. We next investigate the theoretical performance of a maximum matching algorithm, as e.g., utilized in [82]. The idea based on a maximum matching is that for each reconfigurable link \(\lbrace u,v\rbrace \in \mathcal {E}\), we send all flows of demands \(D(u,v)\) and \(D(v,u)\) on links \(\left(u,v\right)\) and \(\left(v,u\right),\) respectively, then to find a maximum matching to maximize total size of flows on a set of configured links M. As it turns out, such an optimization might yield nearly no benefit, even though an optimal algorithm could hit the theoretical lower bound provided in Lemma 4.8.

Lemma 4.9.

For a \(\tau\)-load-optimization reconfiguration problem on a hybrid switch network N, where \(\tau \in \left\lbrace \text{US, SS} \right\rbrace\), a maximum matching algorithm can find a reconfiguration M of N, s.t.,

\begin{equation*} L_{\text{min-max}}(N(\emptyset))-L_{\text{min-max}}\left(N(M) \right)\le \epsilon , \text{ for an arbitrarily small } \epsilon \ge 0, \end{equation*}

but where an optimal reconfiguration \(M^*\) implies \(L_{\text{min-max}}(N(M^*))=L_{\text{min-max}} (N\left(\emptyset) \right)/2\).

Proof.

Recall the definition of segregated routing. Given a small value \(\epsilon \ge 0\), we construct a hybrid switch network \(N=(V,E,\mathcal {E},C)\), where \(V=\lbrace v_1,\ldots ,v_n, a, b,c,d\rbrace\), c is the center, and other nodes are leaves. For any two nodes \(u,v\in V\setminus \lbrace c\rbrace\), we construct a reconfigurable link \(\lbrace u,v\rbrace \in \mathcal {E}\). Here, \(\forall e\in \overrightarrow{E}\cup \overrightarrow{\mathcal {E}}: C(e)=1\). Regarding demands D, for each \(v_i\in \lbrace v_1,\ldots ,v_n\rbrace\), we define \(D(v_i,a)=\epsilon\) and \(D(b,a)=n\epsilon\), \(D(b,d)=n\epsilon\). Clearly, \(L_{\text{min-max}}(N (\emptyset))=2n\epsilon\). For the maximum matching algorithm, two reconfigurable links \(\left\lbrace b,d \right\rbrace \in \mathcal {E}\) and \(\lbrace v_i,a\rbrace \in \mathcal {E}\), where \(i\in \lbrace 1,\ldots ,n\rbrace\), must be included in a reconfiguration M, which gives \(L_{\text{min-max}}(N(M))=(2n-1)\epsilon\). However, by selecting \(\lbrace a,b\rbrace\) into \(M^*\), we can have \(L_{\text{min-max}}(N(M^*))=n\epsilon =L_{\text{min-max}}(N(\emptyset))/2\). Please note that for the above example, the splittable and unsplittable models show the same results.□

Extension to smaller reconfigurable circuit switches. In case the number of ports of a single reconfigurable switch does not suffice for all nodes in the network, our algorithms also extend to the case of multiple smaller reconfigurable switches. We can connect subsets of the nodes to a reconfigurable switch each, e.g., grouped by historical data w.r.t. the traffic demands. Our hybrid switch algorithms in Section 4 then take this subset of possible reconfigurable links to work with and proceed as usual, e.g., by assigning non-allowed links a weight (benefit) of 0 in matchings.

4.6 Practical Considerations

For non-blocking⁶ data-center topologies, where the load-balancing is usually dominated by the last hop, e.g., for incast [69], we can abstract the static topology as a star (tree of depth one) as shown in Figure 2, such that our algorithm can minimize the loads by, e.g., taking away elephant flow from the original static network to high-capacity reconfigurable links. Our solution provides an efficient and optimal way to design reconfigurable networks for existing DCNs to optimize load-balancing, which significantly outperforms conventional methods of implementing reconfigurable links by a maximum weighted matching, e.g., [29, 82], and by a greedy approach, e.g., [44, 91], as we will show in the next section in practical evaluations, beyond the previous theoretical results.

Our solution can be implemented directly and is generally compatible with pre-installed routing configurations of existing data-centers, as it relies on analyzing the matrix of traffic demands to determine which pairs of intensive-communication nodes to be transferred to reconfigurable links. More specifically, after preprocessing on demands, elephant flows can be separately sent on reconfigurable links, and other remaining demands will be still routed through the static network as before, e.g., ECMP, packet-based routing, flowlet-based routing, and so on can be applied in these settings for the remaining flows.

For real-time applications, the reconfiguration delay (time) that reconfigurable links cannot transfer data during their establishing phase might degenerate the performance when traffic pattern changes very significantly in a short interval. However, in general, data-center traffic patterns feature significant temporal locality,⁷ and most transmitted bytes belong to big and more long-lasting elephant flows, which have a large transmission time compared to the reconfiguration time. For example, Roy et al. [70] observed 90% bytes flow in elephant flows, and Griner et al. [41] give examples where a 500 MB flow, whose transmission time is 100 ms, with the reconfiguration time being 15 ms, while many other empirical studies show similar results, e.g., Mellette et al. [61], Venkatakrishnan et al. [80]. Based on these practical observations, we introduce a factor \(\theta \in [0,1]\) to indicate the ratio of reconfiguration time to the interval of a demand in our evaluations and we broadly discuss the results for \(\theta =0\) and \(\theta =0.05\) respectively in Section 5, which reveals the robustness of our algorithm under the interference of reconfiguration time.

Notwithstanding, in general, the problem of how to deal with the non-availability of optical links during reconfiguration is still an open research problem, as discussed by Nance Hall et al. [43, Section 6]: “Ideally, we want a reconfigurable link to exist beforethe traffic appears”, with the additional challenge of these changes being consistent [35, 49].

5 Evaluations

In order to study the performance of our algorithms under realistic workloads, we conducted extensive experiments with a simulator, which we will release together with this article (as open source code). In particular, we benchmark our hybrid switch algorithms against several state-of-the-art maximum matching and greedy baselines, considering a spectrum of packet traces on hybrid switch topologies as in Figure 2. We first describe our methodology in Section 5.1 and then discuss our results in Section 5.2. To facilitate reproducibility, our source code is available at https://gitlab.cs.univie.ac.at/ct-papers/2021-tompecs-load-optimization.

5.1 Methodology

Comparison with related work. We consider the following approaches from related work, used in multiple state-of-the-art articles [43], as described next, and implemented the corresponding algorithms for comparison.

—

First, we compare our hybrid switch network algorithms (denoted by HSN-US/SN) with a Maximum Weight Matching algorithm as a baseline, where routing occurs either on direct reconfigurable links or via the central packet switch. The matching algorithm is employed by many state-of-the-art systems [61, Table 1], also recently e.g., in Chopin [71]. Its use was spearheaded by Helios and c-Through [29, 82] and it is also optimal w.r.t. the average weighted path length [33] in such a routing model.⁸

—

Second, we also compare to a Greedy approach used by, e.g., Halperin et al. [44] and Zheng et al. [91]. For the link e that currently has the highest load, we check for the largest flow that can be rerouted on a direct connection, and offload it from the electrically switched network parts. This process is iterated until the load cannot be reduced further, where different links e can be chosen in each iteration.

Hence, in the following plots, the approaches that correspond to related work are labeled as Max Weight Matching and Greedy, respectively. Lastly, we additionally plot the maximum load on the network before any reconfiguration was applied (labeled as Oblivious).

Traffic workloads. It is known that traffic traces in different networks and running different applications can differ significantly [7, 13, 37, 51, 70]. Thus, we collected a number of real-world and synthetic datasets from which we generate traffic matrices to evaluate and compare the performance of our algorithms. In particular:

—

Data center traces: We consider two data center workloads, based on traces made available by Facebook [28, 70, 90]. The first workload features traces from a cluster running the batch-processing application Hadoop. The second one consists of traces from a cluster running SQL databases. Both workloads differ heavily in their communication patterns and the overall network load. Hence, the structural and temporal patterns of the workloads are quite different [7].

—

HPC traces: We further consider a high performance computing workload, obtained from the CESAR backbone [2] The workload consists of a collection of MPI traffic, which was collected while running the application Nekbone. The application solves poison equations using the conjugate gradient method.

—

Synthetic traces: The synthetic pFabric traces are frequently considered as benchmarks in scientific evaluations [6]. In a nutshell, workloads arrive according to a Poisson process, are embedded in a data center context, and follow a random communication pattern between subsets of nodes. In order to generate traffic traces and produce demand matrices, we use the NS2 simulation script we obtained from the authors, using the parameter \(p=0.5\).

In more detail, for each simulation setting, e.g., \(100-\hbox{1,000}\) or \(\hbox{1,000}-\hbox{3,000}\) nodes, we pre-fetch a sequence of requests and keep it in memory. For example, to observe 3,000 distinct nodes in the case of Facebook’s data center traffic, we have to fetch a much larger traffic sequence, than in the case of 1,000 distinct nodes. Furthermore, to ensure fairness, the fetched traffic sequence does not stop at the last node discovered, but rather goes slightly beyond that, to allow the last discovered node to eventually be observed a few times in subsequent requests. Subsequently, depending on the current amount of nodes n, we only use the requests from the fetched sequence, where traffic occurs between those n nodes. Hence, the computational workload for, e.g., 1,000 nodes is higher in the setting of 1,000\(-\)3,000 nodes in comparison to the setting of 1,000\(-\)3,000 nodes.

Reconfiguration delay. In order to model the reconfiguration delays of optical circuit switches, our approach is the following. We account for the reconfiguration delay by introducing a penalty parameter \(\theta \in [0,1]\), which denotes the percentage of time per traffic sequence a switch needs for reconfiguration. We first compute the optimized load of the network as if no reconfiguration delay applies. Then, we query the network for the optical link load and redistribute \((load * \theta)\) amount of bytes from the optical link to the electrical links. Finally, we query the network again for the maximum load.

Experimental setup. All considered topologies, ranging from 40 to 3,000 nodes,⁹ employ hybrid switch networks as in Figure 2(b).¹⁰ We repeat each setting by running it 5 times and display the averaged results, normalizing the workload in the static topology. For the runtime, we also display the averaged results, normalizing them against the results of our HSN-SN algorithm.

Our simulations were run on a machine with two Intel Xeons E5-2697V3 SR1XF with 2.6 GHz, 14 cores¹¹ each and a total of 128 GB RAM. The host machine was running Ubuntu 18.04.3 LTS.

We implemented the algorithms in Python (3.7.3) leveraging the NetworkX library (2.3). For the implementation of the maximum matching algorithm we used the algorithm provided by NetworkX.

5.2 Results and Discussion

We report on the main results obtained in our simulations based on the different datasets. Figures 4 and 6 summarize our evaluation results in terms of load and runtime for the Facebook traces; Figure 5 shows the corresponding results for the HPC and pFabric traces.

Fig. 4.

Fig. 5.

Potential for load optimization. All algorithms significantly improve the load over the Oblivious baseline and provide relatively stable benefits throughout all scenarios investigated.

We evaluate all algorithms with and without a reconfiguration delay, where the dashed lines in the maximum load plots correspond to the results achieved with a reconfiguration delay applied. The reconfiguration delay penalty \(\theta\) is set to 0.05 in all experiments. Hence, \(5\%\) of the load on an optical link is redistributed to the electrical links to account for the reconfiguration delay.

Among these algorithms, the HSN algorithms typically clearly outperform the others.

More specifically, for the database (Figures 4(a) and 6(a)) clusters, the reduction in the maximum load provided by the HSN-SN algorithm is almost a factor of two throughout the spectrum.

Fig. 6.

For the Hadoop clusters (Figures 4(c) and 6(b)), the performance of HSN-SN slightly decreases, but still achieves \(\approx 60\%\) of the original Oblivious load up until a network size of 1,000 and then stays stable at \(\approx 70\%\) beyond. The three remaining algorithms (Greedy, Max. Weight Matching, and our HSN-US) achieve nearly identical values, with Greedy and HSN-US being slightly better. Above 1,000 nodes, we can observe that their capability to further reduce the load seems to be quite restricted. In some Hadoop workload instances, Max. Weight Matching achieves no or only minimal load reduction results, matching up Lemma 4.9 to practice. Notwithstanding, they always perform significantly worse than HSN-SN, resulting in a comparatively load-increase of \(\approx 60\%\).

Regarding the HPC traces, we can observe similar results as in the Database Cluster, in terms of maximum load reduction. Also for the pFabric traces, our HSN-US algorithm achieves a lower maximum load compared to the Greedy or Max. Weight Matching. Here, the variance is slightly higher than in the other experiments; this matches empirical observations on the complexity of the traces produced by these synthetic traces [7].

In regard to the maximum load reduction, we conclude that our HSN-SN algorithm is quite stable w.r.t. to the number of nodes in the network. In contrast to that, Max. Weight Matching and the Greedy algorithm asymptotically approach the maximum load of the unconfigured network.

With respect to the results achieved while using the reconfiguration delay penalty \(\theta\), we can observe that the maximum link load is slightly higher for all algorithms. However, the simulations show that the reconfiguration delay penalty has a larger impact on our HSN-SN algorithm. The reason for this is that the HSN-SN algorithm is capable of distributing the traffic load more equally between the optical and electrical links. Therefore, redistributing \(5\%\) (\(\theta = 0.05\)) of the load from the optical links to the electrical links results in an approximately \(5\%\) increase of the load on the electrical link, which then carries the maximum load. Compared to that, the other algorithms fail to offload a significant amount of traffic to the optical links. Hence, the reconfiguration delay has a minor influence on the maximum link load because the electrical links already carry the vast amount of traffic.

Runtime performance. The best runtime is generally achieved by the Greedy algorithm, due to its early termination when no link can be added anymore. Our experiments show that in the case of the Greedy algorithm, this is unfortunately happening very early on. Regarding the runtime of the Max. Weight Matching, we want to emphasize that the algorithm is unaware of the underlying problem of reducing the maximum link load. Therefore, a lot of runtime is actually wasted without achieving any further load reduction. Hence, in some cases, e.g., in the larger Facebook clusters, Max. Weight Matching is even slower than HSN-SN. In comparison to Max. Weight Matching, our HSN-US has a similar runtime, while spending all of it searching for the best load reduction matching.

HSN-US is consistently faster than HSN-SN, and the latter features quite a high variance in runtime. Notwithstanding, HSN-US has the benefit of only routing along single paths, which can be beneficial for performance metrics beyond load [72, 87]. On the other hand, such issues can also be alleviated with specialised multipath procotols [23, 68, 83]. Still, in some cases and specific workloads, the routing of related demands becomes easier in the SN model. Hence HSN-SN can even be slightly faster than HSN-US, such as for the Hadoop cluster at 3,000 nodes, due to the fact that the underlying matching problem is identical for both HSN-US/-SN.

Summary. While all algorithms provide load reductions, the extent of these optimizations and the required runtime differ significantly. Our results suggest that the load optimizations provided by HSN-US might prove beneficial over other segregated routing strategies, particularly because of its low runtime which is comparable to that of the Max. Weight Matching. We conclude that when considering both potential load reduction and runtime, HSN-SN provides a better tradeoff than HSN-US.

6 Related Work

Most related work on flow routing in data center networks focuses on non-reconfigurable topologies [64]. That said, many recent works design and evaluate reconfigurable topologies e.g., [17, 29, 37, 44, 55, 56, 59, 60, 61, 67, 80, 81, 82, 85, 86], often showing significant performance gains over static topologies and proving real-world viability. However, the algorithmic complexity of reconfigurable data center networks is mostly unstudied [34], and many fundamental questions remain open [11].

Scheduling traffic matrices with specific skew were investigated in [56, 57, 67, 80], but performance guarantees were only obtained by Venkatakrishnan et al. [80] due to leveraging submodularity, a condition that does not hold in our setting. Similarly, Avin et al. [8, 9, 10] investigate traffic matrices with low entropy, but they require scalable constant reconfigurable degrees and are oblivious to hybrid networks, as in [16, 65], and thus do not translate to the herein considered model.

The idea of leveraging good connectivity in data center contexts arose from utilizing random graphs [75], and later extended into deterministic versions [22, 52, 77]. Xia et al. [86] used this idea to heuristically switch between random graphs and Clos topologies, depending on the traffic pattern, whereas Mellette et al. [60] incorporate it to improve their Rotornet [61] approach: If a flow cannot be delayed respectively be buffered, it gets sent along a short route. Both works of Mellette et al. also have the benefit that their reconfigurations are oblivious to the current traffic pattern, but hence also depend on the same for the resulting performance.

One of the notable works that does not rely on centralized computation is ProjecToR by Ghobadi et al. [37], which instead performs a distributed matching protocol reminiscent to the idea of stable matchings [1]. In their setting, they obtain a \((2+\varepsilon)\) approximation for the weighted latency objective but do not consider load.

The algorithmic complexity of weighted latency was also considered in [32, 33], where already basic topologies and settings turned out to be intractable. On the other hand, finding a single shortest path in a partially reconfigured network can be done efficiently, and hence yields well performing heuristics [31]. Moreover, some routing models can even be solved optimally. Notwithstanding, it is unclear how to transfer these results to a load-optimization setting: in topologies with unfavorable betweenness centrality, shortest path routing can overload popular links with high load.

Load-optimization in reconfigurable data centers was recently studied by Yang et al. [87], who investigated the impact of wireless interference on cross-layer optimization. Different wireless links are modeled as a conflict graph, where the task is to find sufficiently good independent (link) sets, in order to provide an interference-free reconfiguration. We see our work as orthogonal, as we only consider inherently interference-free technologies, and as thus it would be interesting to leverage their results in future work.

Another interesting line of work is by Zheng et al. [91], who study how to enhance the design of Diamond, BCube, and VL2 network topologies with small reconfigurable switches, inspired by Flat-Tree [86]. They target maximum link load as well, and present intractability results on general graphs, although these results do not transfer to specific data center topologies or trees, respectively. Different routing models are not analyzed. Moreover, they propose to reconfigure the network with a greedy algorithm, which however does not come with formal performance guarantees. In evaluations of small network sizes, their combination of greedy algorithm and enhanced network design reduces the maximum load by \(12\%\) on average. We see similar greedy algorithm behavior in our evaluations, where however the greedy algorithm performance decreases to just a few percent of load improvement as the network size grows.

That being said, even though our work is mostly motivated by technologies emerging in data center networks, it also applies to other reconfigurable technologies, as long as they fulfill our model properties. Fundamentally different however are reconfigurable optical wide-area networks, as therein the fiber connectivity is fixed. Hence capacities can be adjusted and alternative failover paths provided, leading to improvements in the scheduling of bulk-transfers [21, 48, 49, 58] and reliability concerns [39, 42, 74, 92].

7 Conclusion

We investigated load minimization in reconfigurable hybrid networks, leveraging the flexibility of emerging programmable physical layers. To this end, we investigated the underlying problem complexity, unveiling that already tree topologies of small height induce intractability for a multitude of routing models, and that one cannot hope for general approximability via submodularity techniques. Notwithstanding, we showed that hybrid switch networks, and in turn, non-blocking data center interconnects, can be optimized efficiently. Trace-driven simulations show that our hybrid switch algorithms significantly outperform a state-of-the-art maximum matching baseline, but also greedy algorithms.

Footnotes

Symmetrical connectivity is the standard industry assumption for static cabling, however for reconfigurable links as well. Outside highly experimental hardware, e.g., [37], off-the-shelf products use full-duplex connections [14, 66] and this model assumption is hence prevalent, even in Free-Space Optics [12] proposals.

In other words, no two links in M are adjacent or share an endpoint, enforced by hardware constraints in practice (exclusive connections between ports). We refer to Hecht [47] for an introduction to the technological background.

We note that in other works with analogous definitions, load might also be denoted by utilization, and load-optimization by load-balancing.

⁴

We note that the non-blocking property can also be restricted to keep distributed routing schemes in mind, we refer to Yuan [89] for an in-depth discussion.

⁵

Recall that the UN model is NP-hard on hybrid switch networks (Section 3.2).

⁶

If the DC topology is blocking, such as, e.g., DCell, Jellyfish, MDCube etc. [53], i.e., in particular in server-centric proposals, we cannot directly apply our algorithms, unless the topologies are augmented to be non-blocking. An extension of our optimal polynomial-time algorithms to general server-centric topologies is unlikely, as we have shown that already simple topologies beyond stars induce intractability.

⁷

This is not always the case for wide-area networks [30].

⁸

Note that a maximum matching algorithm is not optimal regarding path lengths in all topologies. However, when the distances between all nodes are identical in the static network part, a standard maximum matching approach is optimal in hybrid switch networks w.r.t. weighted path length [33].

⁹

See Alistarh et al. [5] w.r.t. the feasibility of 1,000 port optical switches in data centers.

¹⁰

In other words, we assume that the static networks can be abstracted as trees of depth one, due to them being, e.g., non-blocking, such as for fat-trees or Clos topologies in general.

¹¹

However, each algorithm only utilized a single core.

A Deferred Proofs and Algorithms

A.1 Proof of Lemma 4.3 for \(\tau =\text{SN}\)

Lemma A.1.

Given a reconfigured network \(N(M)\), which is a triangle on nodes \(V=\left\lbrace a,b,c\right\rbrace\) with the only configured link \(\left\lbrace a,b\right\rbrace \in \mathcal {E}\), then for demands D, a load-optimization flow \(f_{\text {opt}}\) in \(N(M)\) can be computed in constant time by Algorithm 5 when \(\tau =\text{SN}\).

Proof.

When \(\tau =\text{SN}\), any two distinct demands in D in the triangle \(\lbrace a,b,c\rbrace\) are called related if they share the same source or sink. Let f be an arbitrary flow serving D under \(\tau =\text{SN}\). For any two related demands, e.g., \(D(a,b)\) and \(D(a,c)\), W.L.O.G., we assume \(D(a,b)\) sending a flow of size \(\beta \gt 0\) along \((a,c,b)\) and \(D(a,c)\) sending a flow of size \(\alpha \gt 0\) along \((a,b,c)\) in f; and remaining of \(D(a,b)\) and \(D(a,c)\) are only sent on directed links \((a,b)\) and \((a,c),\) respectively, in f. We call such a routing as interfering for these two related demands. W.L.O.G, we also assume \(\beta \ge \alpha\). The interfering between \(D(a,b)\) and \(D(a,c)\) in f can be canceled by redirecting a flow of size \(\alpha\) of \(D(a,c)\) from its indirect path \((a,b,c)\) to its shortcut \((a,c)\), while forcing \(D(a,b)\) only sending a flow of size \(\beta -\alpha\) along \((a,c,b)\). Clearly, the cancellation would not increase the maximum load of f. Thus, there must be a load-optimization flow \(f^*\) serving D such that no interfering occurs between any two related demands, otherwise we can do the interfering cancellation in \(f^*\).

Now, we need to find the load-optimization flow \(f^*\). Given a triangle N and demands, we will prove that Algorithm 5 can find \(f^*\) in constant time. Clearly, Algorithm 5 terminates in constant time since the number of demands is at most 6. It is clear that the returned flow \(f_{\text{opt}}\) is an interfering-free flow since when a demand \(D(u,v)\) is marked split, all its related demands are rejected for being further splitted. Given an upper-bound \(\mu\), our algorithm guarantees that all directed links have loads no more than \(\mu\). Now, we just need to prove that \(\mu\) found in Algorithm 4 is minimum. We assume that \(\mu ^{\prime }\lt \mu\) is actually the minimized maximum load. Each demand marked as split in Algorithm 5: \(\forall D(u,v)\in D_S\) must send a flow of size \(D(u,v)-\mu ^{\prime }C(u,v)\) to its indirect path, where \(D(u,v)-\mu ^{\prime }C(u,v)\gt D(u,v)-\mu C(u,v)\), otherwise, some links would have loads more than \(\mu ^{\prime }\). Due to the interfering-free requirement, each demand in \(D\setminus D_S\) cannot send its flow to its indirect path. W.L.O.G, let \(D(p,q)\) be the unsplit demand in \(D\setminus D_S\), which has the maximum load \(\mu\) in \(S_\mu\) in Algorithm 4. Since the related demands of \(D(p,q)\), which are marked as split, need to send more flows to their indirect paths, where \((p,q)\) is included. Then the load on the link \((p,q)\) will be larger than \(\mu\), which contradicts the assumption.□

References

[1]

Nobel Prize Outreach AB. 2012. The Prize in Economic Sciences 2012. Retrieved from https://www.nobelprize.org/prizes/economic-sciences/2012/summary/. Accessed 11-11-2021.

Abstract

1 Introduction

1.1 Contributions

2 Model and Preliminaries

2.1 Load Preliminaries

3 Complexity

3.1 Segregated Routing

3.2 Non-segregated Routing

3.3 Non-submodularity

4 Hybrid Switch Networks

4.1 Non-blocking Data Center Topologies

4.2 Red-target Matching

4.3 Selection of Suitable Reconfigurable Links

4.4 Solving Hybrid Switch Networks Optimally

4.5 Bounds and Extensions

4.6 Practical Considerations

5 Evaluations

5.1 Methodology

5.2 Results and Discussion

6 Related Work

7 Conclusion

Footnotes

A Deferred Proofs and Algorithms

A.1 Proof of Lemma 4.3 for \(\tau =\text{SN}\)

References

Cited By

Index Terms

Recommendations

On the Complexity of Non-Segregated Routing in Reconfigurable Data Center Architectures

Uniform-Cost Multi-Path Routing for Reconfigurable Data Center Networks

Routing Algorithms for Recursively-Defined Data Centre Networks

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations