1. Introduction
Exponentially increasing mobile traffic accelerates the deployment of dense small cells operating on the 3 GHz spectrum under legacy macro cells, called a heterogeneous small cell network (HetNet), which offloads congested macro cells and eventually enhances quality of user experience (QoE). User equipments (UEs) can have dual connectivity to the macro eNB (MeNB) and small eNB (SeNB) for control/data bearer splitting or download busting. Such SeNB deployment is costly when backhauling to a network gateway (a MeNB in this paper). Millimeter-wave (mmWave)-based backhauling can reduce deployment efforts and provide gigabit data rates to UEs using huge bandwidths, such as 9 and 10 GHz, available at the 60 GHz band and E-band. Many measurement campaigns and demonstrations at 28, 38, 60 and 73 GHz have already shown the feasibility of mmWave use for mobile communication [
1,
2,
3].
To overcome the short communication range of the mmWave link due to its high pathloss and low penetration, beam forming based on directional antennae and repeaters for amplifying is necessarily considered.
Figure 1 shows the HetNet equipped by a multi-hop backhaul mesh network for long-range backhauling of the mmWave links, in which an SeNB unreachable by the MeNB can access the Internet through multi-hop relays of the SeNBs [
4,
5]. The mmWave-based backhaul mesh networks have several challenges, such as efficient radio resource management (RRM) [
5,
6], interference management [
7,
8], multi-hop routing [
9], and energy saving [
10].
Due to increasing power consumption from excessively deployed SeNBs and mmWave backhaul transmissions, various approaches to save energy in mobile networks have been considered [
11]; these include switching off small and macro cells [
12,
13,
14] or adjusting cell size dynamically [
15,
16], where users of switched-off SeNBs are supported by neighboring SeNBs using remaining resources.
Especially for the HetNets with mmWave-based backhauls, Chen et al. [
17] introduced a user association and power allocation algorithm for energy harvesting and self-backhaul SeNB to maximize energy efficiency. Additionally, Mesodiakaki et al. [
18] studied an energy- and spectrum-efficient user association problem considering mmWave backhauls. Hao et al. [
19] investigated the energy-efficient resource allocation in two-tier massive multiple-input multiple-output (mMIMO) HetNets with wireless backhauls.
Most previous works focus on radio resource allocation to increase spectral and energy efficiency in the HetNets. However, in the mmWave backhaul mesh, a multi-hop routing mechanism determines energy saving, as SeNBs need to be switched on for relaying regardless of the presence of associated users. We establish a fluid model of user traffic in the mmWave-based backhaul mesh and solve the joint optimization problem that minimizes energy consumption while guaranteeing the demanded data rate of each UE [
9]. This problem can be formulated in a non-convex mixed integer linear problem (MILP), known as a NP-hard. When we used the branch-and-cut algorithm of CPLEX to find an optimum in a given HetNet topology, it consumed more than 30 min of calculation time, which is infeasible, as the HetNet topology changed dynamically due to UE mobility. For the online algorithm, previous works [
17,
18,
19] considered heuristic or iterative algorithms, which cannot be guaranteed to find a near optimal solution or can suffer from convergence delays. In this study, we consider a deep reinforcement learning (DRL) algorithm to find a feasible solution of the MILP problem in real time.
Reinforcement learning (RL) [
20] has received much attention for dynamic systems, which can provide a long-term solution considering future rewards. Furthermore, the deep learning technique has recently been applied to overcome the curse of dimensionality as the size of the Markov decision process (MDP) increases in terms of state and action space [
21,
22,
23,
24,
25]. RL based on a deep neural network (DNN) can provide a feasible online solution; feed-forward computation is simple for inference compared to backward computation for training. Thus, many researchers now consider the DRL algorithm to solve NP-hard problems of the wireless communication and networking field.
Recently, many studies about applying DRL to wireless communication problems have been introduced, as in the related work section. Several works [
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37] used DRL to allocate radio resources, transmission power and channels to increase spectral efficiency; additionally, multiple access schemes were also exploited by DRL in [
38,
39,
40,
41]. For energy saving, several studies developed a DRL algorithm for an energy-efficient multi-hop routing protocol or peer-to-peer connectivity in the ad hoc networks of satellites or UAVs [
42,
43], where individual mobile agents learn an optimal policy to maintain connectivity while saving limited power. Refs. [
44,
45,
46] introduced energy-saving mechanisms using DRL, wherein an agent controls the transmission power, association and sleep mode of SeNBs in a HetNet without multi-hop backhauls. To the best of our knowledge, this is the first work that investigates DRL to find the Pareto front of a multi-objective optimization problem of energy saving and throughput maximization in the HetNet with an mmWave-based multi-hop backhaul mesh.
Key motivations of this study are enumerated as below:
There has not been notable research on an energy efficient multi-hop routing algorithm using DRL for an mmWave backhaul mesh of a dense HetNet;
The DRL-based algorithm can be considered to find a Pareto front solution for the dual-objective optimization of energy saving and throughput maximization in the HetNet.
To solve our optimization problem, we adopt a proximal policy optimization (PPO)-based DRL algorithm [
24] which shows typically fast and reliable convergence in the training phase as one of popular policy-based DRL algorithms. The PPO algorithm can provide an online policy for controlling backhaul transmission and SeNB power in HetNets, and it is simple to implement but comparable with the complicated trust region policy optimization (TRPO) [
23] in terms of performance. However, it is a challenge for the PPO algorithm to find an optimum of the multi-objective problem if only the reward sum of conflicting multi-objectives is given to an agent for training. Therefore, we consider a multi-objective reinforcement learning (MORL) approach [
47] to find the Pareto front solutions.
Optimistic linear support (OLS) is proposed for the MORL [
48], in which an outer loop iteratively calls a single-objective solver based on the deep Q-network as a subroutine. In this paper, we propose PPO-based deep optimistic linear support (PDOLS), where the PPO algorithm iteratively solves the scalarized objective problem by a specific weight vector for rewards. In experiments, the proposed PDOLS searched optimal corner weights for multi-objectives efficiently and resulted in similar outcomes to the optimal weights obtained through repeated experiments. Additionally, the PDOLS achieved notable throughput and energy saving compared to the CPLEX results [
9]; the CPLEX achieves a 35% energy savings and a 14 Mbps data rate without blockage, while the PDOLS achieves an almost 28% energy savings and a 13.4 Mbps data rate. Such performance reduction is small, considering the CPLEX execution time and DRL inference time are 30 min vs. 1 s. Furthermore, we improve the PDOLS with a scaled reward (PDOLS-SR) that adjusts the reward values according to the environment, which increases the probability of finding the optimal weight vector.
We highlight our key contributions of this study as below:
We propose a PPO-based online algorithm for the bi-objective problem of energy minimization and throughput maximization;
We propose an integrated framework based on the PPO algorithm and OLS to find the Pareto front of the two objectives;
We demonstrate the feasibility of the proposed online solution based on DRL in a HetNet environment.
The remainder of the paper is organized as follows. We introduce recent works on DRL for wireless networking solutions in
Section 2, and offer an overview of the DRL background in
Section 3. In
Section 4, we establish the multi-objective optimization model for energy saving and throughput maximization in HetNets. We propose the PPO and PDOLS algorithm for the multi-objective optimization problem in
Section 5.
Section 6 shows our experimental results regarding performance of the learning algorithm and HetNet throughput. Finally, we discuss and conclude our study in
Section 7.
2. Related Works
Previously, most of the NP problems in the wireless communication and networking area were solved by linear approximation or heuristic algorithms, such as simulated annealing (SA), generic algorithm (GA), particle swarm optimization (PSO), etc. Recent successes of the DNN technique in computer vision and speech recognition show the possibility of applying large-scale feed-forward neural networks to wireless networking. Therefore, the 1D or 2D convolution neural network (CNN) that is popular for computer vision and image processing was used for wireless channel estimation with MIMO [
49,
50,
51], automatic modulation and coding schemes [
52,
53,
54] and network intrusion detection [
55,
56,
57,
58].
In contrast to the above supervised deep learning, artificial intelligence for controlling dynamics of the wireless networking system needs to be made naturally by past experience in the system. Such dynamic systems can be modelled by the MDP; at each step, a network agent acts based on the state and receives reward feedback for the action, such as successful transmission, packet loss, collision, saving power, etc. Using the collected experience data, the DRL algorithm can effectively find an optimal solution of the wireless networking system. The following studies have demonstrated feasibility of using DRL algorithms for wireless communication and networking during the last several years (refer to the summary in
Table 1).
Wang et al. [
26] proposed a dynamic multi-channel access mechanism based on deep Q-learning. A node selects one multi-channel that has low interference, which returns the maximum reward for the action. Zhong et al. [
27,
28] used the actor-critic algorithm to explore the sensing policy for dynamic channel access and considered a multi-agent model for distributed sensors in a partially observable environment. Naparstek et al. [
29,
30] also proposed DQN-based multi-agents which act based on Q-value independently. Li et al. [
31] applied the DQN for channel sensing, and Liu et al. [
32] proposed a hierarchical deep Q-network (h-DQN) model for cooperative channel sensing, which divides the original problem into separate sub-problems for multi-DRL agents.
Ali et al. [
38] introduced a Q-learning-based MAC protocol in dense WLANs which learns the optimal policy based on channel state and transmission action experience. Yu et al. [
39] investigated a DRL-based MAC protocol for heterogeneous wireless networking which was called deep-reinforcement learning multiple access (DLMA). They established a new multi-dimensional RL framework based on the Q-learning that maximizes sum throughput and provides proportional fairness, even co-existing with TDMA-like ALOHA protocols. Al et al. [
40] studied radio resource scheduling (RRS) in the cellular MAC layer using the DQN. Nisioti et al. [
41] presented a MAC solution for sensor networks based on coordinated reinforcement learning by considering the dependencies among sensors to find the optimal actions.
Zhao et al. [
59] studied user association and radio resource allocation in a HetNet. For a large action space, they considered a multi-agent RL approach and a dueling double deep Q-network (D3QN) to obtain an optimal policy with little computation complexity. Zhang et al. [
60] proposed a DRL algorithm for the association between each IoT device and a cellular user to maximize the sum rate of all the IoT devices in symbiotic radio networks (SRNs). Ding et al. [
61] introduced the user association and power control scheme using the multi-agent DQN to ensure the UE’s quality of service (QoS) requirements.
He et al. [
33] proposed an orchestration framework in vehicular networks with a novel DRL algorithm for the resource allocation of networking, caching and computing resources. Shi et al. [
34] modelled a hierarchical DRL-based multi-DC (drone cell) trajectory planning and resource allocation scheme for high-mobility users. In [
35,
36], the authors also conducted resource allocation for uplink nonorthogonal multiple access (NOMA) systems using a DRL-based algorithm to solve the nonconvex optimization problem. Rahimi et al. [
37] also tried to increase scalability with a hierarchical DRL for joint user association and resource allocation in the NOMA system.
Liu et al. [
43] introduced a novel DRL-based energy-efficient routing protocol called DRL-ER, which avoids the battery energy imbalance of constellations and guarantees a required end-to-end delay bound. Liu et al. [
42] adopted a DRL-based energy-efficient control for coverage and connectivity in UAV communication systems. Du et al. [
62] reviewed and analyzed how to achieve green DRL for radio resource management (RRM). Dai et al. [
63] utilized DRL to design an optimal computation offloading and resource allocation strategy for minimizing energy consumption. El et al. [
44] solved the energy-delay-trade-off (EDT) problem in a HetNet where small cells can switch to different sleep mode levels to save energy while maintaining QoS using the DRL.
To the best of our knowledge, our study is first to develop a PPO-based multi-objective algorithm that controls multi-hop routing and switching on/off SeNBs in HetNets, even though many previous works have applied the DRL algorithm for other optimization problems.
3. Deep Reinforcement Learning (DRL)
This section provides a brief overview of reinforcement learning (RL) and DRL. RL is a popular machine learning algorithm which allows agents to learn optimal behavior through trial-and-error interactions with a dynamic environment. A key strategy of the RL is utilizing statistics to obtain an optimal control decision (policy) in the form of the MDP. The MDP is modelled by , wherein the state space is represented by S, the action space is represented by A, the state transition probability is at a taken action a and a corresponding reward R, and in which the policy as a function specifies an action a in each state s. Therefore, an optimal policy, , maximizes the expected reward for future T steps, , where is a discount factor () for the infinite-horizon discounted model.
For effective agent learning, the estimation of a state-value function for a state s is critical; at a time step t. Additionally, suppose that a certain action, a, is taken in the state s; then, an action-value Q-function can be defined as . According to the Bellman optimality equation, the optimal value function, , can be decomposed recursively as , which tells us that the expected return from the best action is the same as the state value of an optimal policy.
3.1. Deep Q-Learning
As the state and action spaces become larger and continuous, function approximation is mandatory for Q-learning instead of using a legacy tabular form of actions and Q-values. Although the combination of RL and neural networks was considered a long time ago, it is only very recently that DRL algorithms based on deep neural networks (DNNs) has received much attention instead of the linear function approximation [
20,
64]. DNNs represent a function with higher complexity by employing a deep hierarchical layer architecture that constitutes a non-linear information processing unit. Deep learning approximates such a mapping function for statistical curve fitting with labeled training datasets.
The DRL utilizes the training process of the DNN based on datasets which can improve learning speed and performance without the MDP model information (the R and are unknown). The DRL induces a policy based on a value function, , approximated by the DNN, which is trained using the batch of samples that an agent collects by interacting with the environment. In a sequence of discrete time, , the agent selects an -greedy action for the maximum reward given by ; the provides randomness to explore and avoid the local minimum.
Mnih et al. introduced the deep Q-network (DQN) in [
22], which is a seminal work for Q-function approximation based on DNNs. In particular, they addressed and solved two challenges in the DRL; first, the deep learning assumes that the data samples are iid (independent identically distributed), but actually the next state,
, is correlated with the current state,
s, in the MDP. Second, the target model for training is non-stationary, as the model parameters
are updated at every iteration. For this, the DQN adopts an experience-replay buffer for the training and separation of the main and target networks. The DQN updates
of the main network by minimizing temporal-difference errors,
, where
and the state-action value function,
are given by the target and main network, respectively. The target network is periodically updated by the main network.
3.2. Policy Gradient and Actor-Critic
The DQN is limited to high dimensional and continuous action spaces that demand iterative optimization processes at every step. Additionally, discretizing the continuous action values cannot avoid the curse of dimensionality due to a large number of actions, or, probably, loses important information of the action space from quantization.
Therefore, the policy gradient (PG) algorithm is used mostly for high dimensional and continuous actions [
65,
66], which adjusts the model parameter,
, of a policy function in the direction of the stochastic policy gradient (SPG),
.
The PG algorithm [
21] can be implemented by the actor-critic architecture, in which the actor stochastically updates the
of the policy function while the critic evaluates the policy and updates the action-value function approximator,
, in such a direction as to minimize error,
. As the dimension of action spaces increases, deterministic policy gradient (DPG) as a special case of the SPG is efficient to derive only the mean of the state spaces compared to the SPG,
.
6. Experiment
In this section, we evaluate the performance in terms of energy saving and user throughput, comparing algorithms proposed in the previous section. We establish an experimental environment with 1 MeNB and 25 SeNBs that form a backhaul mesh network as depicted in
Figure 3, where the mmWave BH links (i.e., gray dashed lines in
Figure 3) connect the SeNBs to each other or to the MeNB for Internet access. There are only 4 SeNBs reachable to the MeNB, which thus limits the sum rate of all data flows below the sum of their BH link capacity. Therefore, we assume that each UE,
u, demands a maximum 14 Mbps data rate (
) in this experiment with the 100 UEs and last mile 4 SeNBs since those bottleneck BH links (i.e., the purple dot line in
Figure 3) allow 14 Mbps per UE. To support a greater UE data rate, we can increase the BH link bandwidth or place more SeNBs reachable to the MeNB gateway.
A total of 100 UEs are randomly dropped over the MeNB and SeNB coverage area, where the SeNBs are apart by 100 meters and their cell coverage is more than 80 meters. Accordingly, the UEs have more than one SeNB to associate with, in addition to the universal MeNB, depending on their location. Both the MeNB and SeNBs provide microwave link access, denoted by AN links in
Figure 3. The access and BH link is configured as in
Table 3 for our experiment. In our study, the training and model update are performed interactively with the network simulator environment based on parameters specified in 3GPP standard and related works [
68,
69].
We build actor-critic networks using a DNN with 2 hidden layers (64 × 64 perceptrons) of a fully-connected neural network to estimate the policy and value, respectively. The actor network for policy receives the input of the state field and returns the action field as output as defined in
Section 5.2. On the other hand, the critic network for value is designed differently according to the PPO and PDOLS algorithm. Both algorithms receive the same input for the state field, but the PPO-based critic returns only one value, while the PDOLS-based critic returns two values of the dual objectives. Detailed parameters for the DRL are introduced in
Table 4. For this experiment, we used the pyTorch library on a Linux 20.04 server equipped with Intel CPU i7-9700KF, GPU GeForce RTX 2080 and 32 GB RAM.
First, we evaluate the performance of the PPO-based DRL algorithm in the HetNet environment in terms of learning speed and convergence. For this, we configure the weight vector of energy consumption and data rate as
and
, respectively, and the UE demand rate as 14 Mbps.
Figure 4a shows the performance with varying learning rates from 1 × 10
−5 to 3 × 10
−4. The PPO algorithm shows good convergence of reward as training iterations continue, regardless of learning rate. The reward increases exponentially during the initial training iterations and becomes saturated after 40 K training iterations. The higher learning rate accelerates the reward convergence, but it skips over the better local minimum and is trapped in another; when the learning rate increases from 1 ×
to 3 ×
, the converged reward decreases from 0.104 to 0.0899. The loss for the value and policy can be seen in
Figure 4b,c, respectively. The loss of value and policy decreases drastically as the training iterations continue. Policy learning can avoid excessive learning owing to clipping of the PPO, which leads the policy loss to be comparable regardless of the learning rate. Additionally, the value loss follows the policy loss through the actor-critic interactions.
Figure 5 shows evaluations on learning performance with varying reward weights (
,
). For this experiment, we configure the learning rate as 1 ×
, which shows the fastest convergence with the highest reward. In
Figure 5a, rewards from energy consumption and user throughput converge at 50 K training iterations with reward weight (
).
Figure 5b,c show that the energy saving (i.e., 1-consumed energy/maximum energy) and mean data rate converge at different iterations according to the reward weight; the reward convergence is achieved at an average of 80 K training iterations, about 21.5 min on our server for each weight value. To find the optimal solutions, iterative learning for all possible weight vectors is needed. Therefore, the computation delay depends on the granularity of the weight values to explore; this experiment demands a total of 80 K · 7 iterations.
System performance varies with of the energy consumption from 0.2 to 0.8 and of the UE’s data rate, . When is set to , the maximum energy saving is achieved by , while the UE’s data rate is only Mbps as a minimum value, because of their trade-off relationship. Contrarily, the minimum energy saving, , allows the maximum data rate, Mbps, with . Consequently, the optimal weight for maximum reward is found to be and , which results in an energy savings of and a UE data rate of Mbps.
Next, we evaluate the PDOLS algorithm to find the optimal value and weight in a HetNet environment with a varying demand rate and number of UEs. In
Figure 6a, the mean data rate satisfies most of all demand rates except for 14 Mbps: 6, 8, 10, 12, and 13.39 Mbps. The energy savings of the HetNet is inversely proportional to the demand rate: 0.42, 0.37, 0.31, 0.30, and 0.23. For these values, the
of the optimal weight is 0.79, 0.72, 0.65, 0.64 and 0.57, with respect to each data rate.
Figure 6b shows the change of the active SeNBs during the learning procedure. Most of the 25 SeNBs are turned on at the beginning of learning, but after 80K iterations, almost 10–12 SeNBs are switched off according to the UE’s demand rate. For the higher demand rate, more SeNBs are active to support the user traffic. Although the number of active SeNBs is the same for 10, 12, and 14 Mbps, energy consumption increases, especially for the 14 Mbps in
Figure 6a, as power consumption of the active links increases proportionally by user traffic.
We evaluate the performance of the PDOLS again with different numbers of UEs such as 40, 70 and 100, where the demand data rate is configured to be 14 Mbps.
Figure 7a shows that both energy savings and the sum of the data rate increase as the number of UEs decreases. Accordingly, the user demand rate is mostly satisfied, except for 100 UEs. The energy saving is 0.46, 0.38, and 0.2, respectively, for each number of UEs. The corresponding active SeNBs are 6, 10, and 15, as shown in
Figure 7b. Here,
of the optimal weight is found to be 0.8, 0.66, and 0.57 for each case. For 40 UEs, the number of active SeNBs is around 18 initially and decreases to up to 6 SeNBs, as data flows of many UEs use the same multi-hop paths provided by the active SeNBs. Otherwise, isolated UEs that have no path through the SeNBs directly access to the MeNB. Comparing the result of 100 UEs with 6 Mbps, we can conjecture that a higher number of UEs induces network-wide deployment, which consumes more RBs of the MeNB and transmission power for a smaller number of serving UEs.
Figure 8a compares the performance of the proposed algorithms discussed in
Section 5, where the number of UEs and the demand rate are configured as 100 and 14 Mbps. A heuristic algorithm leads the UEs to associate with a less-loaded SeNB and use the shortest path to the MeNB gateway, which performs worse with energy savings of 0.16 and a data rate of 9.14 Mbps than others. Meanwhile, the PPO and PDOLS show comparable results of 0.27, 13.39 Mbps for the PPO and 0.23, 13.79 Mbps for the PDOLS, where the optimal weight for the PPO is selected manually after iterative executions with different weight vectors, while the PDOLS algorithm automatically searches for the optimal weight values. The PDOLS-SR outperforms other algorithms with 0.27 and 13.79 Mbps when the reward is scaled by 1/5.
Figure 8b shows the variation of corner weight in the OLS framework of the PDOLS. In our experiment, the PDOLS-SR conducts the training process 11 times (11 steps in the figure) to find the optimal weight, while the PDOLS does this only 7 times (7 steps). The PDOLS-SR can scavenge and explore more corner weights to find a near-optimal weight close to the PPO weight, 0.6 (the red solid line). The optimal
of the PDOLS-SR is 0.5872, while the
of the PDOLS is 0.5683. Further adjustment for downscaling of the reward, such as 1/10 or 1/15, only increases training time without notable performance enhancement.