Sensors 23 02772
Sensors 23 02772
Sensors 23 02772
Article
Deep Reinforcement Learning-Based Coordinated
Beamforming for mmWave Massive MIMO Vehicular Networks
Pulok Tarafder and Wooyeol Choi *
Abstract: As a critical enabler for beyond fifth-generation (B5G) technology, millimeter wave
(mmWave) beamforming for mmWave has been studied for many years. Multi-input multi-output
(MIMO) system, which is the baseline for beamforming operation, rely heavily on multiple antennas
to stream data in mmWave wireless communication systems. High-speed mmWave applications face
challenges such as blockage and latency overhead. In addition, the efficiency of the mobile systems is
severely impacted by the high training overhead required to discover the best beamforming vectors
in large antenna array mmWave systems. In order to mitigate the stated challenges, in this paper,
we propose a novel deep reinforcement learning (DRL) based coordinated beamforming scheme
where multiple base stations serve one mobile station (MS) jointly. The constructed solution then
uses a proposed DRL model and predicts the suboptimal beamforming vectors at the base stations
(BSs) out of possible beamforming codebook candidates. This solution enables a complete system
that facilitates highly mobile mmWave applications with dependable coverage, minimal training
overhead, and low latency. Numerical results demonstrate that our proposed algorithm remarkably
increases the achievable sum rate capacity for the highly mobile mmWave massive MIMO scenario
while ensuring low training and latency overhead.
Keywords: deep reinforcement learning; vehicular network; massive MIMO; beamforming; mmWave
1. Introduction
Citation: Tarafder, P.; Choi, W. Deep
Reinforcement Learning-Based
With the recent advancements in 5G, it is not ambitious to expect that 5G will enable
Coordinated Beamforming for
1000× more data traffic than the widely established current 4G standards [1,2]. Foreseeing
mmWave Massive MIMO Vehicular the rise is users in increased traffic demands, facilitating these massive users and serving
Networks. Sensors 2023, 23, 2772. great quality cellular networks require high-frequency waves. Recently, millimeter wave
https://doi.org/10.3390/s23052772 (mmWave) communication has attracted significant interest in designing 5G wireless
communication systems owing to its advantages in reducing spectrum scarcity and enabling
Academic Editor: Paolo Bellavista
high data speeds [3]. The range of the mmWave frequency band lies between 30 GHz
Received: 5 February 2023 to 300 GHz. However, the higher frequency travels very short distances due to their
Revised: 26 February 2023 physical limitations in the spectrum and demonstrates high path loss [4]. Consequently,
Accepted: 27 February 2023 higher frequencies require smaller cellular cells to overcome the challenges such as path
Published: 3 March 2023 loss and blockage [5]. The massive multiple-input multiple-output (mMIMO) can use
hundreds of antennas simultaneously to propagate signal in the same time-frequency
resource and serve tens of users at the same time [6]. The mMIMO techniques can be
utilized to perform highly directional transmissions thanks to the short wavelength of
Copyright: © 2023 by the authors.
mmWave, which makes it physically feasible to equip a lot of antennas at the transceiver
Licensee MDPI, Basel, Switzerland.
in a cellular network and can significantly improve network capacity [7,8]. Under a fairly
This article is an open access article
distributed under the terms and
generic channel model that considers poor channel estimation, pilot contamination, path
conditions of the Creative Commons
loss, and terminal-specific antenna correlation, large-scale antenna systems significantly
Attribution (CC BY) license (https:// increase the achievable upload and download rate [9]. In situations with rapid changes in
creativecommons.org/licenses/by/ propagation, large-scale antenna systems can reliably offer high throughput on both the
4.0/). forward and the reverse link connections [10].
Vehicles are getting more sensors as driving becomes increasingly automated, result-
ing in increasingly higher data rates. Beamforming in mMIMO makes it possible to serve
distance users with mmWave, even users that are not stationary. Therefore, the practical
method for large bandwidth connected automobiles is mmWave mMIMO communica-
tion [11]. As a result, mmWave mMIMO systems can serve mobile vehicles effectively,
considering the proper beam is selected. Due to the fundamental differences between
mmWave communications and current microwave-based communication technologies
(e.g., 2.4 GHz and 5 GHz), the mmWave systems present difficulties, such as a high sen-
sitivity to shadowing and a significant signal attenuation [12]. In this paper, in order to
overcome these issues and allow mMIMO environments where a highly non-stationary
active user is present, we introduce a coordinated beamforming scheme utilizing deep
reinforcement learning (DRL) to select the optimal beam for a vehicular communication
system. First, a deep Q-network (DQN) algorithm is created to handle the beam selection
problem as a Markov decision process (MDP). Then, by ensuring that the limitations of the
beam selection matrix are met, our goal is to choose the best beams to maximize the sum
rate for the user served.
side information. Afterward, they provided detailed examples of how to leverage location
data from DSRC to lessen the overhead of beam alignment and tracking in mmWave vehicle-
to-everything (V2X) applications. Conversely, Ref. [22] proposed an algorithm to jointly
optimize the beamforming vectors and power allocation for reconfigurable intelligent
surface (RIS)-based applications. Lin et al. in [23] formulated solutions on the joint design
and optimization of beamforming for hybrid satellite-terrestrial relay networks with RIS
support, and in [24] proposed another methodology for joint beamforming for mmWave
non-orthogonal multiple access (NOMA). Furthermore, the author also investigated secure
energy-efficient beamforming in multibeam satellite systems in [25].
On the other hand, Va et al. [26] proposed a multipath fingerprint database using the
vehicle’s position (for example, as determined by GPS) to gain information on probable
pointing directions for accurate beam alignment. The power loss probability is a parameter
used in the method to measure misalignment precision and is used to enhance candidate
beam selection. Moreover, two candidate beam selection techniques are created, one of which
uses a heuristic, and the other aims to reduce the likelihood of misalignment. Cao et al. [27]
proposed a latency reduction scheme for vehicular network relay selection. In addition, Zhou
et al. [28] proposed a DQN-based algorithm to train and determine the optimal receiver beam
direction with the purpose of maximizing average received signal power.
However, there are various drawbacks to designing beamforming vectors by utilizing
the stated approaches, such as solely based on location data and received signal power. First,
narrow-beam systems may not function effectively with position-acquisition sensors like
GPS because of their poor precision, which is typically in the range of meters. Second, these
technologies are unable to handle indoor applications since GPS sensors perform poorly
inside structures. In addition, the beamforming vectors depend on the environment’s
shape, obstructions, and so on. Furthermore, received signal power can experience severe
penetration power loss because of the vehicle’s metal body. In this paper, we aim to utilize
a DRL-based coordinated approach where we do not encounter the declared challenges
and exhibit better results.
1.2. Contribution
In this paper, for highly mobile mmWave applications, we provide a novel DRL
approach for highly mobile mmWave communication architecture. As part of our suggested
method, a coordinated beamforming system is used, in which a number of BSs concurrently
provide access to a single non-stationary user. In this approach, a DRL network exclusively
utilizes beam patterns and learns how to anticipate the BSs beamforming vectors from the
signals obtained at the scattered BSs. The idea behind this is that the propagated waves
collectively acquired at the scattered BSs indicate a distinctive multi-path signature of
both the user position and its surroundings. There are several benefits to the suggested
approach. First, the suggested technique can accommodate not only LOS but non-LOS
(NLOS) framework without the need for specialized position-acquiring devices because
beamforming prediction is based on the uplink received signals rather than position data.
Second, only received pilots, which may be retrieved with minimal overhead training, are
needed for the determination of the best beams. Furthermore, because the DRL model
trains and responds to any environment, it does not need any training before deployment
in the suggested system. The proposed model also inherits coordination coverage and
reliability improvements since it is coupled with the coordinated beamforming mechanism.
Even though some DRL-based beamforming solutions exist, to the best of our knowledge,
no prior work addressed a coordinated beamforming solution by leveraging DRL where
multiple BSs serve one single mobile user jointly to achieve the highest possible data rate.
The contributions of the proposed beamforming scheme are summarized as follows:
Sensors 2023, 23, 2772 4 of 15
2. System Model
In this section, we discuss the chosen frequency-selective coordinated mmWave system
and channel models for our DRL-based coordinated beamforming scheme, where for this
designated system and channel model, the DRL model optimizes the beam selection from a
set of candidate beams by utilizing the exploration and exploitation strategy of the DRL.
We analyze a mmWave-enabled vehicular communication architecture shown in Figure 1,
where N BSs are concurrently providing service to one mobile station (MS). Each BS is
equipped with M a number of antennas, and each BS is linked to a central processing unit
in the cloud. In the interests of simplicity, we assume that each BS utilizes analog-only
beamforming with networks of phase shifters and has a single RF chain [29]. In this paper,
we use the assumption that the MS is equipped with only one antenna.
The signals are precoded using a N × 1 digital precoder fk ∈ C N ×1 for subcarrier k,
k = 1, · · · , K. The frequency domain signals are then converted into the time domain using
N K-point inverse fast Fourier Transforms (IFFTs). Afterward, each BS n performs a time-
domain analog beamforming and then transmits the resulting signal. At the receiver end,
the received signal is converted to the frequency domain using a K-point FFT, presuming
perfect synchronization of frequency and carrier offset. The received signal at kth subcarrier
at nth BS is denoted by
N
yk = ∑ hk,n
T
xk,n + nk , (1)
n =1
where xk,n is the transmitted complex baseband signal, hk,n is the M× 1 channel vector
between the MS and BS, nk ∈ C M×1 is the received noise at the BS with independent and
identically complex (i.i.c.) additive white Gaussian noise (AWGN) distribution with zero
mean and variance σ2 .
Sensors 2023, 23, 2772 5 of 15
BS 2 BS 1
Cloud
processing
unit
MS
BS 3 BS
where αl is the gain, an (θl , φl ) is the array response vector (θl = azimuth angle, φl = elevation
angle) of the nth BS. Considering the delay-d channel in (2), for subcarrier k, our frequency
domain channel vector hk,n can be formulated as
D−1
2πk
hk,n = ∑ hd,n exp (− j
K
d ). (3)
d =0
Our adopted block-fading channel model {hk,n }kK=1 is considered to remain constant
throughout the channel coherence time, and it is dependent on the user the mobility
and the channel multi-path components [34].
3. Coordinated Beamforming
In this chapter, we introduce a baseline DRL coordinated beamforming approach
for a highly mobile vehicular mmWave communication system as shown in Figure 2. To
present the proposed solution, we first describe the problem formulation, then derive
the novel DRL-based approach for beamforming. In this chapter, we also present the
environment setup, dataset generation, simulation parameters, and performance analysis
for our proposed scheme.
Sensors 2023, 23, 2772 6 of 15
Massive
MIMO
mmWave
Beam
mmWave
Beam
mmWave
Beam
Receiver Receiver
K k∑
R= log2 1 + SNR ∑ hk,n T BF
fn . (4)
=1 n =1
Depending upon the performed At , the agent receives a reward Rt . If the action taken can
achieve a reasonable sum rate, then the agent will also receive a good Rt . The agent gains
knowledge of its surroundings and develops an ideal beam selection assignment strategy
by foreseeing future events. The DNN algorithm learns this policy π at each timestep as it
continues to move forward with the next timesteps. We formulate our state, action, and
reward functions as follows:
• State: We utilize the channel matrices for all the BSs as the state of our environ-
ment. The complex channel matrices are constructed incorporating the bandwidth,
user position, noise figure, and noise power. If the environment has Z states each
having V number of beams, then, the state space with Z × V can be represented as
S = S˜1 , S˜2 , S˜3 , · · · , S˜Z .
• Action: The goal of the agent is to assign a beam for serving from the action space A.
At each episode for a set of S, the agent has to take Z ∈ A actions while maintaining
one action per V elements from the S. Out of the Z × V, the target of the agent is
choosing a beam that will maximize the data rate.
• Reward: In our reward function, we first derive the data rate for each channel as
follows.
N 2
Rr = log2 1 + SNR ∑ T BF
hk,n fn . (5)
n =1
For every action the agent takes, we calculate the data rate of the chosen action and
feed it as the reward value. Our aim is to acquire the highest possible cumulative
reward Rmax as it obtains reward for each action, according to
K N 2
Rmax = arg max ∑ log2 1 + SNR ∑ T BF
hk,n fn . (6)
k =1 n =1
With this state, action, and reward function, we propose the DNN architecture as the
policy controller for the beam selection, as shown in Figure 3. The DNN takes the place of
the Q-table and calculates the Q-values for each environment state-action pair. Deriving
probabilities for each beam selection for each state space is the primary objective of the
DNN, and this probability can be defined by Q(S, A) of the DQN algorithm. We select the
best beam out of V = 64 candidate beams, coordinately with 4 BSs.
Downsampling Upsampling
512 512
256 256
128 128
64 64
Sigmoid
Output
Input
Linear Linear
Linear Linear
Linear Linear
Linear Linear
meaningfully represent the input before decoding it back so that the reconstructed input
is as similar to the original as possible [37]. Moreover, the autoencoder can handle the
raw input data without any difficulties and is viewed as a component of the unsupervised
learning model [38]. The autoencoder consists of three main components, encoder, code,
and decoder. In addition, for autoencoders, the number of neurons decreases as we go
deep down the hidden layers. However, it increases [39] in reverse autoencoder. We resort
to the newly introduced reverse autoencoder for the DNN segment of our DQN model.
In the encoder, the input layer starts with 2c neurons, and the next hidden layers are
followed by 2c+ p neurons. In this paper, we start the hidden layers with c = 5, and p
refers to the position of the layer. For the code layer, we use the value of c + p = 9 for the
code layer. The decoder portion ends with an output layer and is the exact opposite of the
encoder portion. Because the layers are placed one on top of the other like a sandwich, this
form of structure is referred recognized as a stacked autoencoder. Additionally, each layer
in the autoencoder has its own ReLu activation function.
4. Performance Evaluation
In this section, we evaluate the proposed DRL-based coordinated beamforming ap-
proach in different case studies by comparing it with traditional DL architecture [15]. In
a multicell mmWave mMIMO downlink scenario, a large uniform planar array (UPA) is
installed on a BS. In this paper, we select 4 BS with 32 × 8 UPA, resulting in M = 256
antenna arrays for each BS.
For our methodology, we used the popular publicly available DeepMIMO [40] dataset
generated by the Wireless InSite [41]. The dataset contains the generated beamforming
vectors or predetermined codebooks, denoted as fnBF . These generated codebooks are the
beamforming defining vectors. Along with fnBF , we also use their corresponding channel
matrices as depicted as hk,n in our proposed scheme for defining states and rewards as
discussed in Section 3.2 for optimizing the optimal beamforming vectors.
We used the outdoor scenario of two streets and one intersection with a mmWave
communication operating at 60 GHz. We aim to serve the MS with the best beam, coordi-
nating with 4 BSs. For this adopted scenario, 4 BSs are equipped on the top of 4 lamp posts
to concurrently provide beam coverage for one MS coordinately. The lamps are located
60 m away, side by side. Every BS is installed on the 6 m elevation having 32 × 8 antenna
elements. The MS is incorporated with a single antenna on top of the vehicle. During
the uplink training, we assumed a transmit power of 30 dBm for the MS. The adopted
DeepMIMO parameters for dataset generation and the simulation parameters used in this
work are summarized in Table 1 and Table 2, respectively.
Parameters Values
Scenario O1_60
Active BS 3,4,5,6
Receivers R1000–R1300
Frequency band 60 GHz
Bandwidth 500 MHz
Number of OFDM subcarriers 1024
Subcarrier limit 64
Number of paths 5
BS antenna shape 1 × 32 × 8
Receiver antenna shape 1×1×1
Sensors 2023, 23, 2772 9 of 15
Parameters Values
Beams per BS distribution 16
Total beams 64
Transmit power 30 dBm
Learning rate (LR) 0.0005
Discount factor (γ) 0.999
Epsilon (e) [1, 0.1, 0.001]
Batch size 96
Number of episodes 250
Data instances 200
4.1. Training
The proposed DNN is gradually trained using a set of training data for each episode.
For every state space S, the state action pair is formulated using the e-greedy policy in
accordance with the output probabilities of DNN. An episode is considered complete when
all state space has been processed by the DNN. For every state space, the exploitation
policy [42] or the policy for taking action can be represented as
(
l argmaxQ(St , Alt ) if e < eth , eth ∈ (0, 1]
at = ,
random action[1, V ] otherwise
(7)
∀l = 1, 2, . . . , Z ∈ A,
∀t = 1, 2, . . . , ins.
After executing alt , the agent will receive the rewards according to (5) and the next state
space St+1 . Afterward, we first determine the loss and then tweak the DNN’s parameters
using back-propagation to train our model. We take an approximation of the optimal Q∗-
values for each state-action pair for St+1 from a separate DNN termed the target DNN [43]
in order to compute the loss. The policy DNN’s settings are used to initialize the target
DNN, which is identical to it. Consequently, for the target DNN input, we use the next
state space St+1 as the input, and finally, the agent chooses optimal Q∗ -values greedily
from the output. We add experience replay memory (ERM) to the DQN to help the optimal
policy converge more steadily [44]. The agent first explores its environment while saving its
current states, actions, rewards and next states (St , At , Rt , St+1 ) as a tuple in the ERM. The
agent then trains the policy DNN using a small batch of tuples from the ERM. Each training
set of data continues to be updated in the ERM. We summarize the system architecture and
working principles of our model in Figure 4 and Algorithm 1.
In the training phase of our model, we used Adam optimizer [45] with a learning
rate of 0.0005. The DRL model minimizes the error of our training in the DNN using the
Smooth L1 loss function [46,47]. If we have a batch of size B, the unreduced loss for two
data points (u, w) can be described as
Agent
Target DNN Network
Downsampling Upsampling
512 512
Mini Batch
State
Next
256 256
128 128
64 64
Loss Environment
Sigmoid
Output
Input
Calculation
BS 2 BS 1
Linear Linear MS
Downsampling Upsampling
Experience Replay Memory
512 512
256 256
64
128 128
64
Predicted BS 3 BS
Q-values
State
Sigmoid
Output
Input
Optimization
Linear Linear
Linear Linear
Linear Linear Action
Linear Linear
However, the delays we verified in our simulations are slight in the considered system
environment. In addition, we confirmed that the performance degradation due to the delays
is insignificant in the simulation results. It is clear that our proposed DRL outperforms
the traditional DL model by a large margin and demonstrates suboptimal performance.
In this figure, we did not consider any beam training or latency overhead. For vehicular
mmWave communication, when the user is mobile, one of the most viable communication
overheads is velocity because the connectivity between the BS and the user gets affected by
the velocity. For fast-moving users, it needs fast beam switching from the BS, otherwise,
because of the delay, the user might not get service on time from the BS as it moves away
from its current position.
2 S W L P D O D F K L H Y D E O H U D W H
3 U R S R V H G '