Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Sensors 23 02772

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

sensors

Article
Deep Reinforcement Learning-Based Coordinated
Beamforming for mmWave Massive MIMO Vehicular Networks
Pulok Tarafder and Wooyeol Choi *

Department of Computer Engineering, Chosun University, Gwangju 61452, Republic of Korea


* Correspondence: wyc@chosun.ac.kr

Abstract: As a critical enabler for beyond fifth-generation (B5G) technology, millimeter wave
(mmWave) beamforming for mmWave has been studied for many years. Multi-input multi-output
(MIMO) system, which is the baseline for beamforming operation, rely heavily on multiple antennas
to stream data in mmWave wireless communication systems. High-speed mmWave applications face
challenges such as blockage and latency overhead. In addition, the efficiency of the mobile systems is
severely impacted by the high training overhead required to discover the best beamforming vectors
in large antenna array mmWave systems. In order to mitigate the stated challenges, in this paper,
we propose a novel deep reinforcement learning (DRL) based coordinated beamforming scheme
where multiple base stations serve one mobile station (MS) jointly. The constructed solution then
uses a proposed DRL model and predicts the suboptimal beamforming vectors at the base stations
(BSs) out of possible beamforming codebook candidates. This solution enables a complete system
that facilitates highly mobile mmWave applications with dependable coverage, minimal training
overhead, and low latency. Numerical results demonstrate that our proposed algorithm remarkably
increases the achievable sum rate capacity for the highly mobile mmWave massive MIMO scenario
while ensuring low training and latency overhead.

Keywords: deep reinforcement learning; vehicular network; massive MIMO; beamforming; mmWave

1. Introduction
Citation: Tarafder, P.; Choi, W. Deep
Reinforcement Learning-Based
With the recent advancements in 5G, it is not ambitious to expect that 5G will enable
Coordinated Beamforming for
1000× more data traffic than the widely established current 4G standards [1,2]. Foreseeing
mmWave Massive MIMO Vehicular the rise is users in increased traffic demands, facilitating these massive users and serving
Networks. Sensors 2023, 23, 2772. great quality cellular networks require high-frequency waves. Recently, millimeter wave
https://doi.org/10.3390/s23052772 (mmWave) communication has attracted significant interest in designing 5G wireless
communication systems owing to its advantages in reducing spectrum scarcity and enabling
Academic Editor: Paolo Bellavista
high data speeds [3]. The range of the mmWave frequency band lies between 30 GHz
Received: 5 February 2023 to 300 GHz. However, the higher frequency travels very short distances due to their
Revised: 26 February 2023 physical limitations in the spectrum and demonstrates high path loss [4]. Consequently,
Accepted: 27 February 2023 higher frequencies require smaller cellular cells to overcome the challenges such as path
Published: 3 March 2023 loss and blockage [5]. The massive multiple-input multiple-output (mMIMO) can use
hundreds of antennas simultaneously to propagate signal in the same time-frequency
resource and serve tens of users at the same time [6]. The mMIMO techniques can be
utilized to perform highly directional transmissions thanks to the short wavelength of
Copyright: © 2023 by the authors.
mmWave, which makes it physically feasible to equip a lot of antennas at the transceiver
Licensee MDPI, Basel, Switzerland.
in a cellular network and can significantly improve network capacity [7,8]. Under a fairly
This article is an open access article
distributed under the terms and
generic channel model that considers poor channel estimation, pilot contamination, path
conditions of the Creative Commons
loss, and terminal-specific antenna correlation, large-scale antenna systems significantly
Attribution (CC BY) license (https:// increase the achievable upload and download rate [9]. In situations with rapid changes in
creativecommons.org/licenses/by/ propagation, large-scale antenna systems can reliably offer high throughput on both the
4.0/). forward and the reverse link connections [10].

Sensors 2023, 23, 2772. https://doi.org/10.3390/s23052772 https://www.mdpi.com/journal/sensors


Sensors 2023, 23, 2772 2 of 15

Vehicles are getting more sensors as driving becomes increasingly automated, result-
ing in increasingly higher data rates. Beamforming in mMIMO makes it possible to serve
distance users with mmWave, even users that are not stationary. Therefore, the practical
method for large bandwidth connected automobiles is mmWave mMIMO communica-
tion [11]. As a result, mmWave mMIMO systems can serve mobile vehicles effectively,
considering the proper beam is selected. Due to the fundamental differences between
mmWave communications and current microwave-based communication technologies
(e.g., 2.4 GHz and 5 GHz), the mmWave systems present difficulties, such as a high sen-
sitivity to shadowing and a significant signal attenuation [12]. In this paper, in order to
overcome these issues and allow mMIMO environments where a highly non-stationary
active user is present, we introduce a coordinated beamforming scheme utilizing deep
reinforcement learning (DRL) to select the optimal beam for a vehicular communication
system. First, a deep Q-network (DQN) algorithm is created to handle the beam selection
problem as a Markov decision process (MDP). Then, by ensuring that the limitations of the
beam selection matrix are met, our goal is to choose the best beams to maximize the sum
rate for the user served.

1.1. Related Works


There have been few standard traditional approaches for beamforming or beam selection.
In [13], Gao et al. followed an exhausting search approach for beamforming which demon-
strates very high complexities in the system. On the other hand, Pal et al. [14] followed a
different approach that iterated through the users and beams to determine the best possible
beamforming matrices. This approach is also executed with high complexity algorithm.
On the other hand, deep learning (DL) based approaches show promising results in
terms of application complexity and viability. Alkhateeb et al. [15] derived a high mobility
supported mmWave mMIMO-based DL-enabled coordinated beamforming scheme for
an outdoor scenario. To formulate their design, they utilized distributed base stations
(BSs) simultaneously to serve a mobile user. They predicted the optimal beams using the
traditional DL approach and compared the achievable rate performance of their DL method
with the optimal achievable rate of beamforming. Zhang et al. [16] proposed a multi-user
mMIMO coordinated beamforming scheme for heterogeneous networks (HetNets) focusing
on energy efficiency (EE) based on the convolutional neural network (CNN) approach.
They designed and used a multi-user huge MIMO HetNets optimization challenge to
maximize EE with less complexity and compute delay. In order to accomplish end-to-
end autonomous beamforming [17], introduced a constrained deep neural network-based
beamforming technique. This method uses a neural network in place of the beamforming
matrices used in conventional beamforming.
In [18], in-depth experiments for coordinated multipoint transmission at 73 GHz were
carried out in a downtown Brooklyn urban open square setting. The analysis showed that
serving a user jointly at the same time by many BSs can achieve a considerable coverage
improvement. Moreover, another work on BS coordination, where a user is concurrently
given access by many BSs, may be used to generate a significant coverage increase and
is demonstrated by Maamari et al. in an analysis of the performance of heterogeneous
mmWave cellular networks in [19]. Gupta et al. in [20] investigated the scope of a minimum
of one line of sight (LOS) case when the users are served with LOS connections. The results
showed that the density of coordinating BSs should scale with the square of the blockage
density in order to maintain the same LOS connection. Although [18–20] established
how BS coordination significantly increased coverage, they lack the analysis of producing
coordinated beamforming vectors.
In order to enable high-speed, long-range, and reliable transmission in mmWave
60 GHz wireless personal area networks, Wang et al. [21] introduced a beamforming
approach applied in the media access control (MAC) layer on top of various physical layer
(PHY) designs. Ref. [11] suggested a new strategy to lower the overhead for beam alignment
by utilizing dedicated short-range communication (DSRC) and/or sensor information as
Sensors 2023, 23, 2772 3 of 15

side information. Afterward, they provided detailed examples of how to leverage location
data from DSRC to lessen the overhead of beam alignment and tracking in mmWave vehicle-
to-everything (V2X) applications. Conversely, Ref. [22] proposed an algorithm to jointly
optimize the beamforming vectors and power allocation for reconfigurable intelligent
surface (RIS)-based applications. Lin et al. in [23] formulated solutions on the joint design
and optimization of beamforming for hybrid satellite-terrestrial relay networks with RIS
support, and in [24] proposed another methodology for joint beamforming for mmWave
non-orthogonal multiple access (NOMA). Furthermore, the author also investigated secure
energy-efficient beamforming in multibeam satellite systems in [25].
On the other hand, Va et al. [26] proposed a multipath fingerprint database using the
vehicle’s position (for example, as determined by GPS) to gain information on probable
pointing directions for accurate beam alignment. The power loss probability is a parameter
used in the method to measure misalignment precision and is used to enhance candidate
beam selection. Moreover, two candidate beam selection techniques are created, one of which
uses a heuristic, and the other aims to reduce the likelihood of misalignment. Cao et al. [27]
proposed a latency reduction scheme for vehicular network relay selection. In addition, Zhou
et al. [28] proposed a DQN-based algorithm to train and determine the optimal receiver beam
direction with the purpose of maximizing average received signal power.
However, there are various drawbacks to designing beamforming vectors by utilizing
the stated approaches, such as solely based on location data and received signal power. First,
narrow-beam systems may not function effectively with position-acquisition sensors like
GPS because of their poor precision, which is typically in the range of meters. Second, these
technologies are unable to handle indoor applications since GPS sensors perform poorly
inside structures. In addition, the beamforming vectors depend on the environment’s
shape, obstructions, and so on. Furthermore, received signal power can experience severe
penetration power loss because of the vehicle’s metal body. In this paper, we aim to utilize
a DRL-based coordinated approach where we do not encounter the declared challenges
and exhibit better results.

1.2. Contribution
In this paper, for highly mobile mmWave applications, we provide a novel DRL
approach for highly mobile mmWave communication architecture. As part of our suggested
method, a coordinated beamforming system is used, in which a number of BSs concurrently
provide access to a single non-stationary user. In this approach, a DRL network exclusively
utilizes beam patterns and learns how to anticipate the BSs beamforming vectors from the
signals obtained at the scattered BSs. The idea behind this is that the propagated waves
collectively acquired at the scattered BSs indicate a distinctive multi-path signature of
both the user position and its surroundings. There are several benefits to the suggested
approach. First, the suggested technique can accommodate not only LOS but non-LOS
(NLOS) framework without the need for specialized position-acquiring devices because
beamforming prediction is based on the uplink received signals rather than position data.
Second, only received pilots, which may be retrieved with minimal overhead training, are
needed for the determination of the best beams. Furthermore, because the DRL model
trains and responds to any environment, it does not need any training before deployment
in the suggested system. The proposed model also inherits coordination coverage and
reliability improvements since it is coupled with the coordinated beamforming mechanism.
Even though some DRL-based beamforming solutions exist, to the best of our knowledge,
no prior work addressed a coordinated beamforming solution by leveraging DRL where
multiple BSs serve one single mobile user jointly to achieve the highest possible data rate.
The contributions of the proposed beamforming scheme are summarized as follows:
Sensors 2023, 23, 2772 4 of 15

• We develop a simple coordinated beamforming scheme where several BSs employ RF


beamforming and are connected to a central cloud processing unit that uses baseband
processing, which serves a mobile user at once. To increase the platform’s effective
achievable rate, we define a training and design issue for the central baseband pro-
cessing and for BSs RF beamforming vectors. The trade-off between the beamforming
training overhead and the achievable sum rate using the proposed beamforming
vectors is taken into account when determining the effective achievable rate for highly
mobile mmWave systems.
• For the selected system, we construct a fundamental coordinated beamforming tech-
nique that relies on uplink training for creating the RF and baseband beamforming
vectors. The BSs choose their RF beamforming vectors from a predetermined codebook
as part of this baseline approach. The baseband beamforming is then designed by a
central processor to guarantee consistent incorporation by the user. We demonstrate
that the standard beamforming technique achieves the best attainable rates in a few
unique but crucial situations.
• We introduce a system operation of machine learning modeling of a unique combined
DRL and coordinated beamforming solution. In this approach, we incorporate a
reverse autoencoder owing to its capability to handle raw data seamlessly so that
it can reproduce the input data as closely as possible as a neural network for our
DRL model and solve a coordinating beamforming problem. The main concept of the
suggested technique is to anticipate the RF beamforming vectors of the coordinating
BSs using just beam patterns, i.e., with very little training overhead. The proposed
approach also enables minimal coordination overhead harvesting of coordinated
beamforming improvements with wide coverage and low latency, making the method
a viable solution for highly mobile mmWave applications.

2. System Model
In this section, we discuss the chosen frequency-selective coordinated mmWave system
and channel models for our DRL-based coordinated beamforming scheme, where for this
designated system and channel model, the DRL model optimizes the beam selection from a
set of candidate beams by utilizing the exploration and exploitation strategy of the DRL.
We analyze a mmWave-enabled vehicular communication architecture shown in Figure 1,
where N BSs are concurrently providing service to one mobile station (MS). Each BS is
equipped with M a number of antennas, and each BS is linked to a central processing unit
in the cloud. In the interests of simplicity, we assume that each BS utilizes analog-only
beamforming with networks of phase shifters and has a single RF chain [29]. In this paper,
we use the assumption that the MS is equipped with only one antenna.
The signals are precoded using a N × 1 digital precoder fk ∈ C N ×1 for subcarrier k,
k = 1, · · · , K. The frequency domain signals are then converted into the time domain using
N K-point inverse fast Fourier Transforms (IFFTs). Afterward, each BS n performs a time-
domain analog beamforming and then transmits the resulting signal. At the receiver end,
the received signal is converted to the frequency domain using a K-point FFT, presuming
perfect synchronization of frequency and carrier offset. The received signal at kth subcarrier
at nth BS is denoted by
N
yk = ∑ hk,n
T
xk,n + nk , (1)
n =1

where xk,n is the transmitted complex baseband signal, hk,n is the M× 1 channel vector
between the MS and BS, nk ∈ C M×1 is the received noise at the BS with independent and
identically complex (i.i.c.) additive white Gaussian noise (AWGN) distribution with zero
mean and variance σ2 .
Sensors 2023, 23, 2772 5 of 15

BS 2 BS 1

Cloud
processing
unit

MS

BS 3 BS

Figure 1. Downlink mmWave mMIMO vehicular beamforming system.

We consider a L clustered geometric wideband model for our mmWave cellular


channel [30–32]. For each cluster l, it is assumed that l = 1, · · · , L contributes one ray
with a temporal delay øl ∈ R, and azimuth/elevation angles of arrival (AoA) is θl , φl . Let
p(τ ) be a pulse shaping function for TS -spaced signaling assessed at τ seconds, and let ρn
signify the path-loss between the user and the nth BS [33]. The delay-d channel vector in
this model hd,n between the user and the nth BS can be expressed as
s
L
M
hd,n =
ρn ∑ αl p(τ )(dTs − τl )an (θl , φl ) (2)
l =1

where αl is the gain, an (θl , φl ) is the array response vector (θl = azimuth angle, φl = elevation
angle) of the nth BS. Considering the delay-d channel in (2), for subcarrier k, our frequency
domain channel vector hk,n can be formulated as

D−1
2πk
hk,n = ∑ hd,n exp (− j
K
d ). (3)
d =0

Our adopted block-fading channel model {hk,n }kK=1 is considered to remain constant
throughout the channel coherence time, and it is dependent on the user the mobility
and the channel multi-path components [34].

3. Coordinated Beamforming
In this chapter, we introduce a baseline DRL coordinated beamforming approach
for a highly mobile vehicular mmWave communication system as shown in Figure 2. To
present the proposed solution, we first describe the problem formulation, then derive
the novel DRL-based approach for beamforming. In this chapter, we also present the
environment setup, dataset generation, simulation parameters, and performance analysis
for our proposed scheme.
Sensors 2023, 23, 2772 6 of 15

Massive
MIMO
mmWave
Beam
mmWave
Beam

mmWave
Beam

Receiver Receiver

Figure 2. Overview of mMIMO beamforming.

3.1. Problem Statement


For a vehicular mmWave based 5G network, serving any user or MS is challenging
because of the dynamic and varying environment characteristics. When signal interference,
fading effect, and network congestion are considered, which we subsequently describe as
the environment dynamics [35], it becomes much more complicated to serve the receiver
end by maintaining eMBB, mMTC, and URLLC standards. Considering the time-varying
environment of wireless communication, a DRL-based beamforming scheme is most appro-
priate. In the case of DL-based approaches, they struggle to show promising results while
dealing with the stated time-varying environments because they lack the functionality
of learning by good or bad actions. To achieve the highest level of sum rate, reduce the
overhead, and tackle the large RF beamforming vector arrays, an adaptive beam selection
approach such as DRL is best suited for this specific task. With this motivation, in this
paper, we exploit the DRL’s capability of tackling varying environments to maximize the
achievable data rate by selecting the optimal beam for mmWave vehicular networks in a
coordinated approach.
In this paper, considering a set of beamforming vectors {fnBF }nN=1 , our focus is to
formulate a beam selection matrix to optimize the achievable downlink rate of the mmWave
vehicular beamforming system. The user’s maximum achievable rate can be derived as
 
2
1 K N

K k∑
R= log2 1 + SNR ∑ hk,n T BF 
fn . (4)
=1 n =1

3.2. Drl-Based Coordinated Beamforming Framework


We propose a DRL framework that utilizes DQN to train and optimize the beam
selection assignment. Typically, the DQN technique consists of an environment and an
agent using a deep neural network (DNN). The agent, the same as BS in this study, engages
with the environment before performing any action. In the beginning, the agent starts
exploring the environment, moving from one state to another. At that point, it needs more
information about the environment. As the agent explores the environment, it gathers
information and starts to take action by exploiting the environment with the help of
the reward function. In any timestep t, if the current state is St , the agent will receive
an immediate reward Rt assessing the performed action At using the DNN. The agent
also gets to take the next state St+1 as input from the environment in the same timestep.
Sensors 2023, 23, 2772 7 of 15

Depending upon the performed At , the agent receives a reward Rt . If the action taken can
achieve a reasonable sum rate, then the agent will also receive a good Rt . The agent gains
knowledge of its surroundings and develops an ideal beam selection assignment strategy
by foreseeing future events. The DNN algorithm learns this policy π at each timestep as it
continues to move forward with the next timesteps. We formulate our state, action, and
reward functions as follows:
• State: We utilize the channel matrices for all the BSs as the state of our environ-
ment. The complex channel matrices are constructed incorporating the bandwidth,
user position, noise figure, and noise power. If the environment has Z states each
having V number of beams, then, the state space with Z × V can be represented as
S = S˜1 , S˜2 , S˜3 , · · · , S˜Z .
• Action: The goal of the agent is to assign a beam for serving from the action space A.
At each episode for a set of S, the agent has to take Z ∈ A actions while maintaining
one action per V elements from the S. Out of the Z × V, the target of the agent is
choosing a beam that will maximize the data rate.
• Reward: In our reward function, we first derive the data rate for each channel as
follows.  
N 2
Rr = log2 1 + SNR ∑ T BF
hk,n fn . (5)
n =1

For every action the agent takes, we calculate the data rate of the chosen action and
feed it as the reward value. Our aim is to acquire the highest possible cumulative
reward Rmax as it obtains reward for each action, according to
 
K N 2
Rmax = arg max ∑ log2 1 + SNR ∑ T BF
hk,n fn . (6)
k =1 n =1
With this state, action, and reward function, we propose the DNN architecture as the
policy controller for the beam selection, as shown in Figure 3. The DNN takes the place of
the Q-table and calculates the Q-values for each environment state-action pair. Deriving
probabilities for each beam selection for each state space is the primary objective of the
DNN, and this probability can be defined by Q(S, A) of the DQN algorithm. We select the
best beam out of V = 64 candidate beams, coordinately with 4 BSs.

Downsampling Upsampling

512 512
256 256
128 128
64 64
Sigmoid

Output
Input

Linear Linear
Linear Linear
Linear Linear
Linear Linear

Figure 3. Proposed DNN architecture.

3.3. Reverse Autoencoder


Autoencoder is a neural network that can be taught to reconstruct their input [36].
It is a particular kind of neural network that is primarily developed to compress and
Sensors 2023, 23, 2772 8 of 15

meaningfully represent the input before decoding it back so that the reconstructed input
is as similar to the original as possible [37]. Moreover, the autoencoder can handle the
raw input data without any difficulties and is viewed as a component of the unsupervised
learning model [38]. The autoencoder consists of three main components, encoder, code,
and decoder. In addition, for autoencoders, the number of neurons decreases as we go
deep down the hidden layers. However, it increases [39] in reverse autoencoder. We resort
to the newly introduced reverse autoencoder for the DNN segment of our DQN model.
In the encoder, the input layer starts with 2c neurons, and the next hidden layers are
followed by 2c+ p neurons. In this paper, we start the hidden layers with c = 5, and p
refers to the position of the layer. For the code layer, we use the value of c + p = 9 for the
code layer. The decoder portion ends with an output layer and is the exact opposite of the
encoder portion. Because the layers are placed one on top of the other like a sandwich, this
form of structure is referred recognized as a stacked autoencoder. Additionally, each layer
in the autoencoder has its own ReLu activation function.

4. Performance Evaluation
In this section, we evaluate the proposed DRL-based coordinated beamforming ap-
proach in different case studies by comparing it with traditional DL architecture [15]. In
a multicell mmWave mMIMO downlink scenario, a large uniform planar array (UPA) is
installed on a BS. In this paper, we select 4 BS with 32 × 8 UPA, resulting in M = 256
antenna arrays for each BS.
For our methodology, we used the popular publicly available DeepMIMO [40] dataset
generated by the Wireless InSite [41]. The dataset contains the generated beamforming
vectors or predetermined codebooks, denoted as fnBF . These generated codebooks are the
beamforming defining vectors. Along with fnBF , we also use their corresponding channel
matrices as depicted as hk,n in our proposed scheme for defining states and rewards as
discussed in Section 3.2 for optimizing the optimal beamforming vectors.
We used the outdoor scenario of two streets and one intersection with a mmWave
communication operating at 60 GHz. We aim to serve the MS with the best beam, coordi-
nating with 4 BSs. For this adopted scenario, 4 BSs are equipped on the top of 4 lamp posts
to concurrently provide beam coverage for one MS coordinately. The lamps are located
60 m away, side by side. Every BS is installed on the 6 m elevation having 32 × 8 antenna
elements. The MS is incorporated with a single antenna on top of the vehicle. During
the uplink training, we assumed a transmit power of 30 dBm for the MS. The adopted
DeepMIMO parameters for dataset generation and the simulation parameters used in this
work are summarized in Table 1 and Table 2, respectively.

Table 1. Dataset parameters.

Parameters Values
Scenario O1_60
Active BS 3,4,5,6
Receivers R1000–R1300
Frequency band 60 GHz
Bandwidth 500 MHz
Number of OFDM subcarriers 1024
Subcarrier limit 64
Number of paths 5
BS antenna shape 1 × 32 × 8
Receiver antenna shape 1×1×1
Sensors 2023, 23, 2772 9 of 15

Table 2. Simulation parameters for the DRL model.

Parameters Values
Beams per BS distribution 16
Total beams 64
Transmit power 30 dBm
Learning rate (LR) 0.0005
Discount factor (γ) 0.999
Epsilon (e) [1, 0.1, 0.001]
Batch size 96
Number of episodes 250
Data instances 200

4.1. Training
The proposed DNN is gradually trained using a set of training data for each episode.
For every state space S, the state action pair is formulated using the e-greedy policy in
accordance with the output probabilities of DNN. An episode is considered complete when
all state space has been processed by the DNN. For every state space, the exploitation
policy [42] or the policy for taking action can be represented as
(
l argmaxQ(St , Alt ) if e < eth , eth ∈ (0, 1]
at = ,
random action[1, V ] otherwise
(7)
∀l = 1, 2, . . . , Z ∈ A,
∀t = 1, 2, . . . , ins.

After executing alt , the agent will receive the rewards according to (5) and the next state
space St+1 . Afterward, we first determine the loss and then tweak the DNN’s parameters
using back-propagation to train our model. We take an approximation of the optimal Q∗-
values for each state-action pair for St+1 from a separate DNN termed the target DNN [43]
in order to compute the loss. The policy DNN’s settings are used to initialize the target
DNN, which is identical to it. Consequently, for the target DNN input, we use the next
state space St+1 as the input, and finally, the agent chooses optimal Q∗ -values greedily
from the output. We add experience replay memory (ERM) to the DQN to help the optimal
policy converge more steadily [44]. The agent first explores its environment while saving its
current states, actions, rewards and next states (St , At , Rt , St+1 ) as a tuple in the ERM. The
agent then trains the policy DNN using a small batch of tuples from the ERM. Each training
set of data continues to be updated in the ERM. We summarize the system architecture and
working principles of our model in Figure 4 and Algorithm 1.
In the training phase of our model, we used Adam optimizer [45] with a learning
rate of 0.0005. The DRL model minimizes the error of our training in the DNN using the
Smooth L1 loss function [46,47]. If we have a batch of size B, the unreduced loss for two
data points (u, w) can be described as

loss(u, w) = {loss1 , . . . , loss B } T , (8)

where in any loss instance b ∈ B,


(
0.5(ub − wb )2 /β if|ub − wb | < β
lossb = . (9)
|ub − wb | − 0.5 × β otherwise
Sensors 2023, 23, 2772 10 of 15

Algorithm 1 Proposed deep Q-learning algorithm


1:Initialize policy, target DQN with random w, w0
2:Initialize e
3: for episode do
4: for instance do
5: Select a channel matrix and add it to action space At for present state space St
6: Observe immediate reward Rt , next state space St+1
7: Put (St , At , Rt , St+1 ) → ERM
8: Form random sample mini batch of (St , At , Rt , St+1 ) from ERM
9: for each tuple in mini batch do
10: Calculate Q-values
11: Approximate Q∗ -values using target DNN
12: Compute loss from Q and Q∗
13: Optimize w of policy DNN with Adam optimizer
14: 0
w ← w after all time steps
Ensure: Rr ≈ Rmax

Agent
Target DNN Network
Downsampling Upsampling

512 512
Mini Batch

State
Next

256 256
128 128
64 64

Loss Environment
Sigmoid

Output
Input

Calculation
BS 2 BS 1

Linear Linear Target Cloud


processing
Linear Linear
Linear Linear Q-values unit

Linear Linear MS

Downsampling Upsampling
Experience Replay Memory

512 512
256 256

64
128 128
64
Predicted BS 3 BS

Q-values
State

Sigmoid

Output
Input

Optimization
Linear Linear
Linear Linear
Linear Linear Action
Linear Linear

Policy DNN Network State


Next
State
Next
State Action Reward
State Reward

Figure 4. Proposed DRL Framework.

4.2. Performance Analysis


In this subsection, we will evaluate our achieved performance in terms of sum rate
and will compare our rate with the traditional ML approach. Figure 5 represents the
performance analysis of our proposed model having 3 performance matrices. We plot
an effective achievable rate based on our DRL, conventional DL, and optimal data rate.
Even though the utilization of the reverse autoencoder in the DRL model incorporated
higher computation complexity and time complexity in the learning phase of the model.
Sensors 2023, 23, 2772 11 of 15

However, the delays we verified in our simulations are slight in the considered system
environment. In addition, we confirmed that the performance degradation due to the delays
is insignificant in the simulation results. It is clear that our proposed DRL outperforms
the traditional DL model by a large margin and demonstrates suboptimal performance.
In this figure, we did not consider any beam training or latency overhead. For vehicular
mmWave communication, when the user is mobile, one of the most viable communication
overheads is velocity because the connectivity between the BS and the user gets affected by
the velocity. For fast-moving users, it needs fast beam switching from the BS, otherwise,
because of the delay, the user might not get service on time from the BS as it moves away
from its current position.


2SWLPDODFKLHYDEOHUDWH
 3URSRVHG'5/EDVHGVXERSWLPDODFKLHYDEOHUDWH
'/EDVHGDFKLHYDEOHUDWH
$FKLHYDEOHUDWH ESV+]









    
3UHGLFWHGEHDPV
Figure 5. A comparison of effective achievable rates without overhead consideration.

In Figure 6, we compare DRL and DL-based beamforming performance with the


optimal beamforming performance by incorporating overhead. More specifically, we
compared our DRL-based achievable sum rate for all the 3 overhead speed side by side.
The performance was similarly very consistent throughout the plot, and the achievable rate
of declination due to the increased overhead was negligible. In this stage, we consider the
64-beam training overhead, with coherence time at 40 kmph at first. It is visible that, even
though our suboptimal performance experienced a slight decrease, the DRL beamforming
achievable rate is still significantly higher than the DL approach. We also compared the
achievable rate versus different user positions at 80 kmph and 120 kmph in the same
Figure 6. The results followed similar trends. Our DRL-based approach outperformed
the DL approach by a large margin and demonstrated suboptimal performance. As the
user position moved, the achievable rate saw a slight but steady decrease over the period.
However, for the traditional DL-based approach, the performance was inconsistent.
We also compare the performance of our proposed DRL scheme by varying SNR as
shown in Figure 7. It is demonstrated how the performance of our model varies at two
different SNR levels, which are low SNR at 10 dB and high SNR at 30 dB. The previous
results containing 38.65 dB SNR portrayed higher results. We confirmed that after the SNR
was reduced to 30 dB, the initial performance dropped by 14.27% in terms of the average
sum rate for our DRL method. In addition, we have illustrated the performance of our DRL
model at an SNR of 10 dB in this figure. It is noticeable that, for another 20 dB of SNR drop,
the performance declined by another 38.51%.
Sensors 2023, 23, 2772 12 of 15


2SWLPDOUDWH
3URSRVHG'5/EDVHGUDWH
 '/EDVHGUDWH

$YHUDJHDFKLHYDEOHUDWH ESV+] 








NPSK NPSK NPSK NPSK
Figure 6. A comparison of effective achievable rate including overhead consideration.


2SWLPDOUDWH
3URSRVHG'5/EDVHGUDWH
 3URSRVHG'5/EDVHGUDWHNPSK
3URSRVHG'5/EDVHGUDWHNPSK
$YHUDJHDFKLHYDEOHUDWH ESV+]

3URSRVHG'5/EDVHGUDWHNPSK









/RZ615 G% +LJK615 G%
Figure 7. A comparison of effective achievable rate including overhead consideration at high and
low SNR.

Furthermore, in Figure 8, we have illustrated the convergence of our proposed al-


gorithm. The achievable sum rate converges with the time step t in terms of loss. After
approximately 3.2 × 106 iterations, we confirmed that our model converged successfully.
Overall, the performance of our model significantly rises as the SNR increases. Our pro-
posed DRL architecture is robust and flexible in various conditions, such as different SNRs
and different velocities.
Sensors 2023, 23, 2772 13 of 15

Loss Per Iteration 5

DRL Algorithm

3 Deep Q-Network

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4
10 6
Number of Iterations
Figure 8. Loss convergence plot for the proposed DQN-based coordinated beamforming.

5. Conclusions
In this paper, we propose a sub-optimal beam selection scheme with DRL that enables
high mobile applications in mmWave mMIMO systems. The key idea is to utilize the
powerful exploration-exploitation strategy of DRL to derive the optimal beam selection
policy by learning the mapping of the omni-received uplink pilot and sub-optimal beam
mapping. Our proposed scheme guarantees achievable sum rate performance close to
optimal, even if it requires a small training overhead and beam overhead. In addition, the
proposed scheme ensures reliable coverage and shorter latency while serving the beam
towards the highly mmWave mobile user end.

Author Contributions: Conceptualization, W.C.; methodology, P.T. and W.C.; software, P.T.; valida-
tion, P.T. and W.C.; formal analysis, P.T. and W.C.; investigation, P.T.; resources, W.C.; data curation,
P.T. and W.C.; writing—original draft preparation, P.T.; writing—review and editing, W.C.; visualiza-
tion, P.T.; supervision, W.C.; project administration, W.C.; funding acquisition, W.C. All authors have
read and agreed to the published version of the manuscript.
Funding: This work was supported by the research fund from Chosun University, 2022.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Saghezchi, F.B.; Rodriguez, J.; Mumtaz, S.; Radwan, A.; Lee, W.C.; Ai, B.; Islam, M.T.; Akl, S.; Taha, A.E.M. Drivers for 5G: The
‘pervasive connected world’. In Fundamentals of 5G Mobile Networks; John Wiley & Sons: Hoboken, NJ, USA, 2015; pp. 1–27.
2. Chen, T.; Matinmikko, M.; Chen, X.; Zhou, X.; Ahokangas, P. Software defined mobile networks: Concept, survey, and research
directions. IEEE Commun. Mag. 2015, 53, 126–133. [CrossRef]
3. Boccardi, F.; Heath, R.W.; Lozano, A.; Marzetta, T.L.; Popovski, P. Five disruptive technology directions for 5G. IEEE Commun.
Mag. 2014, 52, 74–80. [CrossRef]
Sensors 2023, 23, 2772 14 of 15

4. Busari, S.A.; Huq, K.M.S.; Mumtaz, S.; Dai, L.; Rodriguez, J. Millimeter-Wave Massive MIMO Communication for Future Wireless
Systems: A Survey. IEEE Commun. Surv. Tutorials 2018, 20, 836–869. [CrossRef]
5. Li, Q.; Niu, H.; Papathanassiou, A.T.; Wu, G. 5G Network Capacity: Key Elements and Technologies. IEEE Veh. Technol. Mag.
2014, 9, 71–78. [CrossRef]
6. Larsson, E.G.; Edfors, O.; Tufvesson, F.; Marzetta, T.L. Massive MIMO for next generation wireless systems. IEEE Commun. Mag.
2014, 52, 186–195. [CrossRef]
7. Ghosh, A.; Thomas, T.A.; Cudak, M.C.; Ratasuk, R.; Moorut, P.; Vook, F.W.; Rappaport, T.S.; MacCartney, G.R.; Sun, S.; Nie, S.
Millimeter-Wave Enhanced Local Area Systems: A High-Data-Rate Approach for Future Wireless Networks. IEEE J. Sel. Areas
Commun. 2014, 32, 1152–1163. [CrossRef]
8. Rusek, F.; Persson, D.; Lau, B.K.; Larsson, E.G.; Marzetta, T.L.; Edfors, O.; Tufvesson, F. Scaling Up MIMO: Opportunities and
Challenges with Very Large Arrays. IEEE Signal Process. Mag. 2013, 30, 40–60. [CrossRef]
9. Hoydis, J.; ten Brink, S.; Debbah, M. Massive MIMO in the UL/DL of Cellular Networks: How Many Antennas Do We Need?
IEEE J. Sel. Areas Commun. 2013, 31, 160–171. [CrossRef]
10. Marzetta, T.L. Noncooperative Cellular Wireless with Unlimited Numbers of Base Station Antennas. IEEE Trans. Wirel. Commun.
2010, 9, 3590–3600. [CrossRef]
11. Choi, J.; Va, V.; Gonzalez-Prelcic, N.; Daniels, R.; Bhat, C.R.; Heath, R.W. Millimeter-Wave Vehicular Communication to Support
Massive Automotive Sensing. IEEE Commun. Mag. 2016, 54, 160–167. [CrossRef]
12. Tarafder, P.; Choi, W. MAC Protocols for mmWave Communication: A Comparative Survey. Sensors 2022, 22, 3853. [CrossRef]
[PubMed]
13. Gao, X.; Dai, L.; Chen, Z.; Wang, Z.; Zhang, Z. Near-Optimal Beam Selection for Beamspace MmWave Massive MIMO Systems.
IEEE Commun. Lett. 2016, 20, 1054–1057. [CrossRef]
14. Pal, R.; Srinivas, K.V.; Chaitanya, A.K. A Beam Selection Algorithm for Millimeter-Wave Multi-User MIMO Systems. IEEE
Commun. Lett. 2018, 22, 852–855. [CrossRef]
15. Alkhateeb, A.; Alex, S.; Varkey, P.; Li, Y.; Qu, Q.; Tujkovic, D. Deep Learning Coordinated Beamforming for Highly-Mobile
Millimeter Wave Systems. IEEE Access 2018, 6, 37328–37348. [CrossRef]
16. Zhang, Y.; Zhang, B.; Wang, H.; Zhang, T.; Qian, Y. Deep Learning-based Coordinated Beamforming for Massive MIMO-Enabled
Heterogeneous Networks. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain,
7–11 December 2021; pp. 1–6. [CrossRef]
17. Tao, J.; Wang, Q.; Luo, S.; Chen, J. Constrained Deep Neural Network Based Hybrid Beamforming for Millimeter Wave Massive
MIMO Systems. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai,
China, 20–24 May 2019; pp. 1–6. [CrossRef]
18. MacCartney, G.R.; Rappaport, T.S.; Ghosh, A. Base Station Diversity Propagation Measurements at 73 GHz Millimeter-Wave for
5G Coordinated Multipoint (CoMP) Analysis. In Proceedings of the 2017 IEEE Globecom Workshops (GC Wkshps), Singapore,
4–8 December 2017; pp. 1–7. [CrossRef]
19. Maamari, D.; Devroye, N.; Tuninetti, D. Coverage in mmWave Cellular Networks with Base Station Co-Operation. IEEE Trans.
Wirel. Commun. 2016, 15, 2981–2994. [CrossRef]
20. Gupta, A.K.; Andrews, J.G.; Heath, R.W. Macrodiversity in cellular networks with random blockages. IEEE Trans. Wirel. Commun.
2017, 17, 996–1010. [CrossRef]
21. Wang, J.; Lan, Z.; Woo Pyo, C.; Baykas, T.; Sean Sum, C.; Rahman, M.; Gao, J.; Funada, R.; Kojima, F.; Harada, H.; et al. Beam
codebook based beamforming protocol for multi-Gbps millimeter-wave WPAN systems. IEEE J. Sel. Areas Commun. 2009, 27,
1390–1399. [CrossRef]
22. Niu, H.; Lin, Z.; Chu, Z.; Zhu, Z.; Xiao, P.; Nguyen, H.X.; Lee, I.; Al-Dhahir, N. Joint beamforming design for secure RIS-assisted
IoT networks. IEEE Internet Things J. 2022, 10, 1628–1641. [CrossRef]
23. Lin, Z.; Niu, H.; An, K.; Wang, Y.; Zheng, G.; Chatzinotas, S.; Hu, Y. Refracting RIS-aided hybrid satellite-terrestrial relay
networks: Joint beamforming design and optimization. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 3717–3724. [CrossRef]
24. Lin, Z.; Lin, M.; Wang, J.B.; De Cola, T.; Wang, J. Joint beamforming and power allocation for satellite-terrestrial integrated
networks with non-orthogonal multiple access. IEEE J. Sel. Top. Signal Process. 2019, 13, 657–670. [CrossRef]
25. Lin, Z.; An, K.; Niu, H.; Hu, Y.; Chatzinotas, S.; Zheng, G.; Wang, J. SLNR-based secure energy efficient beamforming in
multibeam satellite systems. IEEE Trans. Aerosp. Electron. Syst. 2022. [CrossRef]
26. Va, V.; Choi, J.; Shimizu, T.; Bansal, G.; Heath, R.W. Inverse Multipath Fingerprinting for Millimeter Wave V2I Beam Alignment.
IEEE Trans. Veh. Technol. 2018, 67, 4042–4058. [CrossRef]
27. Cao, D.; Zheng, B.; Ji, B.; Lei, Z.; Feng, C. A robust distance-based relay selection for message dissemination in vehicular network.
Wirel. Netw. 2020, 26, 1755–1771. [CrossRef]
28. Zhou, X.; Zhang, X.; Chen, C.; Niu, Y.; Han, Z.; Wang, H.; Sun, C.; Ai, B.; Wang, N. Deep Reinforcement Learning Coordinated
Receiver Beamforming for Millimeter-Wave Train-Ground Communications. IEEE Trans. Veh. Technol. 2022, 71, 5156–5171.
[CrossRef]
29. Heath, R.W.; González-Prelcic, N.; Rangan, S.; Roh, W.; Sayeed, A.M. An Overview of Signal Processing Techniques for Millimeter
Wave MIMO Systems. IEEE J. Sel. Top. Signal Process. 2016, 10, 436–453. [CrossRef]
Sensors 2023, 23, 2772 15 of 15

30. Rappaport, T.S.; Sun, S.; Mayzus, R.; Zhao, H.; Azar, Y.; Wang, K.; Wong, G.N.; Schulz, J.K.; Samimi, M.; Gutierrez, F. Millimeter
Wave Mobile Communications for 5G Cellular: It Will Work! IEEE Access 2013, 1, 335–349. [CrossRef]
31. Akdeniz, M.R.; Liu, Y.; Samimi, M.K.; Sun, S.; Rangan, S.; Rappaport, T.S.; Erkip, E. Millimeter Wave Channel Modeling and
Cellular Capacity Evaluation. IEEE J. Sel. Areas Commun. 2014, 32, 1164–1179. [CrossRef]
32. Samimi, M.K.; Rappaport, T.S. Ultra-wideband statistical channel model for non line of sight millimeter-wave urban channels.
In Proceedings of the 2014 IEEE Global Communications Conference, Austin, TX, USA, 8–12 December 2014; pp. 3483–3489.
[CrossRef]
33. Schniter, P.; Sayeed, A. Channel estimation and precoder design for millimeter-wave communications: The sparse way. In
Proceedings of the 2014 48th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 2–5 November
2014; pp. 273–277. [CrossRef]
34. Va, V.; Choi, J.; Heath, R.W. The Impact of Beamwidth on Temporal Channel Variation in Vehicular Channels and Its Implications.
IEEE Trans. Veh. Technol. 2017, 66, 5014–5029. [CrossRef]
35. Sana, M.; De Domenico, A.; Yu, W.; Lostanlen, Y.; Calvanese Strinati, E. Multi-Agent Reinforcement Learning for Adaptive User
Association in Dynamic mmWave Networks. IEEE Trans. Wirel. Commun. 2020, 19, 6520–6534. [CrossRef]
36. McClelland, J.L.; Rumelhart, D.E.; PDP Research Group. Volume 2: Explorations in the Microstructure of Cognition: Psychological
and Biological Models. In Parallel Distributed Processing; MIT Press: Cambridge, MA, USA, 1987; Volume 2.
37. Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. arXiv 2020, arXiv:2003.05991.
38. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114.
39. Ling, C.; Cao, G.; Cao, W.; Wang, H.; Ren, H. IAE-ClusterGAN: A new Inverse autoencoder for Generative Adversarial Attention
Clustering network. Neurocomputing 2021, 465, 406–416. [CrossRef]
40. Alkhateeb, A. DeepMIMO: A Generic Deep Learning Dataset for Millimeter Wave and Massive MIMO Applications. In
Proceedings of the Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 10–15 February 2019; pp. 1–8.
41. Remcom. Wireless InSite. Available online: http://www.remcom.com/wireless-insite (accessed on 1 October 2022).
42. Rezwan, S.; Choi, W. Priority-based joint resource allocation with deep q-learning for heterogeneous NOMA systems. IEEE
Access 2021, 9, 41468–41481. [CrossRef]
43. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approxima-
tion. Adv. Neural Inf. Process. Syst. 1999, 12, 1057–1063.
44. Zhang, S.; Sutton, R.S. A deeper look at experience replay. arXiv 2017, arXiv:1712.01275.
45. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
46. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December
2015; pp. 1440–1448.
47. SmoothL1Loss—PyTorch 1.13 Documentation. Available online: https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1
Loss.html (accessed on 1 October 2022).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like