Adaptive Reinforcement Learning-Based Routing Protocol For Wireless Multihop Networks
Adaptive Reinforcement Learning-Based Routing Protocol For Wireless Multihop Networks
Abstract – This paper presents a research on a topic of traffic profiles. For instance, medium access control
development of an adaptive packet routing scheme for wireless mechanisms, usually located on L2 layer of MANET
multihop networks, based on reinforcement learning protocol implementation, should handle considerable amount
optimization algorithm. of issues coming from radio resource allocation algorithms,
A brief overview of classical approaches for data routing in
high bit- and packet- error probabilities, collisions and
multihop networks is provided, emphasizing main drawbacks of
such algorithms, caused by ineffective hop count routing metric interferences. All the mentioned input conditions are handled
used in traditional multihop routing algorithms. Then, an on L2 layer of IEEE 802.11, 802.15.4, 802.15.1 and many
approach based on reinforcement learning theory is presented, other communication standards, or custom L2 MANET
that has a potential to select more effective routes, relying on implementations.
feedback information from neighboring nodes. An algorithm This paper focuses on effective data routing algorithms,
based on reinforcement learning optimization function is which are built on L3 layer, upon the already chosen and
proposed, as well as additional functions are introduced for implemented L2 standards. Widely used implementations of
initial route weights distribution and dynamic route probability the routing algorithms for wireless multihop networks can be
selection, depending from the current packet loss ratio (PLR)
found in OLSR [4] and B.A.T.M.A.N. [5] routing protocols.
and receive signal strength indicator (RSSI) factors.
The elaborated adaptive routing scheme then has been tested in Some classical routing approaches for MANETs are also
real wireless multihop topology, where a programming described in AODV, DSDV [5] and DSR [7] algorithms.
implementation of the proposed algorithm – RLRP protocol, Those algorithms will be shortly overviewed in the
showed better routing performance characteristics in terms of corresponding section of this article.
PLR and RRT (Route Recovery Time), compared to a It also should be mentioned, that various techniques and
traditional improved proactive scheme of wireless multihop algorithms from Machine Learning and Artificial Intelligence
routing, implemented in widely used B.A.T.M.A.N. (Better fields are being applied more and more commonly to many
Approach to Mobile Ad hoc Networking) protocol. specific telecommunication tasks, including a problem of
effective data routing in wireless multihop networks. For
Index Terms – RL-routing, adaptive routing, RLRP routing
protocol, ad hoc networks, MANET, wireless multihop instance, there is a number of research works which cover
networks, wireless mesh networks. this topic by introducing Reinforcement Learning-based and
the other Machine Learning-based concepts to the routing
problem in WSNs, WMNs and MANETs in general,
I. INTRODUCTION described in the Related Work section. Most of them are
T HE TOPIC of wireless multihop networks development concentrating on providing a generalized concept of adaptive
is becoming more and more important in routing with machine learning algorithms, skipping the
telecommunication industry as well as among the important implementation and testing phases of real network
researchers, especially in a context of Internet of Things protocol development.
(IoT) and Industry 4.0 (Industrial IoT) paradigms. The The adaptive Reinforcement Learning-based routing
following kinds of data transmission networks show a huge algorithm, presented in this paper, has a focus towards
potential to fulfill a demand for many industrial and embedding into real-life scenarios of MANET applications,
customer-based services, such as FireChat messenger [1], or and introduces a reward distribution function, the feedback
an intellectual street illumination system – SmartLighting scheme for adjusting the reward value of the chosen next-hop
project [2], [3]. neighboring node, as well as functions for dynamic neighbor
Wireless multihop networks are often converged into a selections. It also proposes a new, combined routing metric,
general name of MANET, which stands for Mobile Ad hoc based on classical hop-count value, current PLR value and
Networks, and includes many variations – e.g. Wireless RSSI (Received Signal Strength Indicator) of an incoming
Sensor Networks (WSNs), Wireless Mesh Networks packet.
(WMNs), etc. A well-tested programming implementation of the
Currently, real implementations of MANET networks have proposed adaptive routing algorithm is presented as well
some performance drawbacks, primarily caused by unreliable under a name of RLRP – Reinforcement Learning Routing
transmission medium under decentralized and sporadic Protocol, which has been extensively tested in real wireless
mulihop networks and showed better performance results in
209
978-1-5386-7054-5/18/$31.00 ©2018 IEEE
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
XIV Международная научно-техническая конференция АПЭП – 2018
terms of PLR (Packet Loss Rate) and RRT (Route Recovery network nodes increases, therefore, lowering the overall
Time) per selected route, compared to the traditional useful network throughput.
multihop routing approach, realized in widely used Examples of proactive routing schemes are realized in
B.A.T.M.A.N. protocol. DSDV [5]. The improved versions of proactive schemes
were successfully implemented in OLSR [4] and
B.A.T.M.A.N. routing protocols [5].
II. RELATED WORK
210
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
2018 14th International Scientific-Technical Conference APEIE – 44894
III. OVERVIEW OF REINFORCEMENT LEARNING ALGORITHM – reward estimation value on the previous step;
– reward estimation value on the current step;
Reinforcement Learning (RL) is a part of machine learning
– reward value for an action taken on the current step;
theory that introduces a notion of agent, environment and
– step size parameter;
reward, which are meant to optimize a given task within a
– current step number.
certain criteria, actively interacting with environment by
From generalizing the problems the RL theory solves, two
executing the given actions.
main RL-tasks can be highlighted:
In a fundamental work of R. Sutton, devoted to the RL
theory [15], a generalized process of RL interaction is
1) how to update a set of estimation values Q after the
described (Fig. 3), where an agent has some set of actions A,
reward is received?
which the agent is able to choose from and interact with the
given environment. An environment reacts on the action The simplest way is to update the current estimation value
performed by the agent, and sends back a feedback in a form according to sample average method – i.e., by keeping the
of reward, thus, the environment reinforces the agent with current arithmetic average of the estimation value and
additional knowledge about itself. This new knowledge is updating it with the received reward value.
then used by the agent to adapt the selection of the actions
towards the environment in the future, which is usually 2) how to select an action from set A, given sets P and S?
controlled by the introduced estimation value Q.
The most common methods for action selection are:
greedy, e-greedy and softmax methods. The greedy method
implies the selection of the action with the maximum
estimation value all the time. E-greedy method introduces a
small value of e, when the selected action is different from
the one with the maximum estimation value. The softmax
method implies a dynamic change of selection probabilities
of the actions, according to pre-defined probability function
– e.g. a Gibbs-Boltzmann distribution, as mentioned in [15].
211
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
XIV Международная научно-техническая конференция АПЭП – 2018
212
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
2018 14th International Scientific-Technical Conference APEIE – 44894
( )
= + + , (5)
( )= ( ) , (7)
where: ∑
– estimated round trip time value along a direct link
between sender and receiver; where:
– ACK generation delay on receiver side; ( ) – selection probability of action a on step t;
– transmission time variation over a direct link, which ( ) – estimated reward value for selecting action a on
depends from the lower L2 protocol implementation – i.e., step t;
CSMA/CA timeouts, L2 ARQ timeouts, etc. ( ) – estimated reward value for selecting alternative
action b on step t;
– positive temperature parameter.
In a context of the given packet forwarding task, a
temperature parameter τ defines how likely the neighbor with
the maximal estimation reward value is chosen, i.e. it defines
the selection probability of the most attractive node,
according to its estimation value. The parameter τ is
proposed to be changed dynamically, depending on the
current packet loss rate (PLR) on the selected route,
introducing the following τ(PLR) function:
( ) = 10 , ≤ 1,
(8)
( ) = 10 ∙ ∙( − 1) + 10 , > 1,
where:
= ⋅ 100 – packet loss rate, varying in
range [0, 100];
– temperature parameter from Gibbs-Boltzmann
distribution;
– growth coefficient, equaled to 0.5 by default.
An example of the behavior of the route selection
probability function, depending from τ(x), is presented
Fig. 5. Packet forwarding scheme with ACK delay timeout. on Fig. 6. In the given example, a source node has 5 direct
neighbors and has to forward an incoming packet towards the
In the developed packet forwarding scheme at Stage 2, the given destination node.
following reward generation rule at the receiver side is After the path discovery stage, when initial weights
proposed: towards the source and destination nodes are established, the
source node has a list of estimated rewards to all direct
= , ≠ 0, neighbors towards the destination – Q(n). In the given
| |
(6) example with 5 direct neighbors, the list size is equal to 5.
Assuming that after the initialization, the list of weights
= 10 ∙ , = 0, contains the following values, in a dictionary data format of
where: Python programming language:
∑
= – average estimation value at the Q_n = {n1: 50.0, n2: 33.3, n3: 11.1, n4: 44.0, n5: 51.0}
receiver node, towards the given destination address, Using the mentioned Gibbs-Boltzmann distribution, the
calculated from the current route table information of the following action selection probabilities list P_t is calculated.
receiver node; At the step 0, a temperature parameter τ(x) is equal to 10,
– estimation value towards given destination IP via assuming that the initial PLR value is 0:
i-th route in the table;
– total number of estimation values. P_0 = {n1: 0.35, n2: 0.07, n3: 0.01, n4: 0.2, n5: 0.37}
The reward values are ranged from 0 to 100. Initial
negative reward value is equal to -1.
The proposed method for next-hop neighbor selection uses
the softmax method, based on Gibbs-Boltzmann
distribution [15]:
213
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
XIV Международная научно-техническая конференция АПЭП – 2018
={ , }, (9)
where:
– packet is delivered successfully to a next hop;
– packet is lost during transmission;
={ , }, (10)
where:
– select a next-hop node and forward packet there;
– remove the packet.
Corresponding to the sets S and A, define the following
state transition probabilities:
α – transition probability from to
1- α – transition probability from to
β – transition probability from to
1 - β – transition probability from to
The set R has the following entries:
Fig. 6. Selection probability distribution of next-hop neighbors, depending
from PLR.
∈{ , , , }, (11)
Thus, the neighbor 5 will be selected most of the time –
where:
with 37 % of selection probability at the initial step 0.
– reward for successful packet transmission,
With the ongoing process of neighbor selections and
calculated by the formula Error! Reference source not
subsequent packet forwarding, the PLR value is assumed to
found.;
vary in an unpredicted pattern, thus, affecting the reliability
– reward for a single unsuccessful packet
of the established routes. In such scenarios, the temperature τ
transmission, which has a fixed value, equal to:
parameter is modified according to the formula Error!
= −1, = 1;
Reference source not found..
– reward for subsequent unsuccessful packet
E.g., during the packet forwarding phase, at the step t, the
transmissions, calculated by inverted exponential law, which
estimated PLR value is changed from 0 to 20 percent.
allows faster switching to alternative neighbors, according to
Accordingly, the new τ parameter has the new value of:
the following formula:
τ = τ(PLR) = 10 * 0.5 * (20 - 1) + 10 = 105
= ( ) = −1 ∙ , 2; (12)
The new selection probabilities list Pt(a) at the step t will
be updated: – reward for successful packet transmission after
previous fail. This reward value is equal to a positive
P_t = {n1: 0.22, n2: 0.19, n3: 0.15, n4: 0.21, n5: 0.22} reward .
Fig. 7 demonstrates a state transition probabilities graph,
It can be noticed, that at the new step t, the selection depending on the given selected actions. The circles
probabilities of neighbors 1 and 5 have decreased, while the correspond to the states, while the arrows relate to the
chances of selecting the neighbors 2 and 3 increased selected actions, leading to the end state. Each arrow has a
significantly. This modification of selection probabilities
caption in a format {A, , }, where s' – next state. Sum
implies a selection of previously less-attractive routes, since
of transition probabilities for each action is equal to 1.
the overall channel reliability had decreased drastically
(from 0 to 20 %). This allows much more flexible route
selection process under unreliable communication
conditions, making sure that the alternative routes are
explored more frequently.
In the given example, with the initial estimated reward
values equal to Q(n), a dependency between the neighbor
selection probability and the PLR value has the form,
presented on Fig. 4.
214
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
2018 14th International Scientific-Technical Conference APEIE – 44894
0 −1
( )= (15)
0
The values of the defined state transition probabilities
α and β can be estimated experimentally, using the following
relations:
= , (16)
= , (17)
where:
– number of subsequent packets, successfully Fig. 8. RLRP protocol header in TCP/IP stack.
delivered; The size of the RLRP header depends from the received
– number of subsequently lost packets; packet type – e.g. TCP/UDP unicast, broadcast, or RLRP
– total number of transmitted packets. service message. The complete list of RLRP headers is
described in more detail in [16].
VI. PROGRAMMING IMPLEMENTATION The generalized structure of RLRP protocol
implementation as well as more detailed description of each
A programming implementation of the developed programming module can be also found in [16].
RL-based adaptive routing scheme has been developed in a
form of independent routing protocol RLRP – Reinforcement
Learning Routing Protocol [16]. VII. EXPERIMENTAL SETUP
The developed protocol implementation is based on a An experimental setup included a real wireless multi-hop
standard Linux TCP/IP stack with IPv4 and IPv6 addressing network topology, built for the testing purposes of the
support. Moreover, the RLRP protocol is independent from developed routing scheme in a format of RLRP program
L4 and L2/L1 layers of the OSI model, making it universal to implementation. As a reference protocol, used for comparing
employ in wireless multihop network, based on Linux SoC the performance results with RLRP under the same testing
hardware [18]. conditions, B.A.T.M.A.N. routing protocol was used.
The protocol provides two networking interfaces – one for The testing topology is illustrated on Fig. 9. The source
communication with upper application layer, based on node is indirectly connected to the destination node via set of
TCP/UDP transport, and the other one for physical intermediate nodes, providing the initial conditions for the
communication with the neighboring network nodes. At the routing forwarding task.
start-up of RLRP routing daemon under Linux OS Each node in the testing environment consisted of Linux
environment, a virtual tun [19] interface is created, under SoC device [18], equipped with wireless network interface
adhoc0 name. After the corresponding IPv4/IPv6 addresses under 802.11 standard, working in Ad-hoc mode.
are assigned to the adhoc0 interface by means of network
stack of Linux OS (usually, initiated on the upper network
application), the corresponding network application can start
communication via the network using the standard
socket-interfaces [20] of Linux OS. As for the real physical
interface, used for physical transmission of the generated
packets towards the wireless neighbors, any available
interface can be used, such as wlan0, bt0, eth0, etc.
Fig. 8 shows a location of RLRP protocol header in OSI
model. The protocol receives an incoming data from the
upper application layer (including TCP/UDP fields) via
adhoc0 interface, processes it (determines a destination IP Fig. 9. Experimental network topology.
address), inserts its own RLRP header in-between L3 and L2
215
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
XIV Международная научно-техническая конференция АПЭП – 2018
seconds
RLRP protocol
failure event. 50
B.A.T.M.A.N. protocol
These characteristics were measured using the ping 40
network utility [21], which was the upper level application 30
for the tested routing protocols, generating the ICMP traffic 20
towards the destination. The first 2 characteristics (RTT and 10
PLR) were automatically measured by the ping utility, while 0
the RRT value has been measured separately via the
Fig. 11. Route Recovery Time values of RLRP and B.A.T.M.A.N.
dedicated application script, which had the following logic.
When the tested routing protocol has established a route
towards the destination via one of the intermediate nodes, an 80
ICMP reply/request traffic started to flow through the route. 70
216
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
2018 14th International Scientific-Technical Conference APEIE – 44894
decrease the time processing overhead and, therefore, lower [8] C. Perkins, E. Belding-Royer, and S. Das, “Ad hoc On-Demand
Distance Vector (AODV) Routing,” IETF RFC 3561, Jul. 2002.
the average RTT value, making it closer to the other routing
[9] IEEE 802.11s. IEEE Standard for Information Technology –
protocol implementations, written on C programming Telecommunications and Information Exchange Between Systems -
language, such as B.A.T.M.A.N. Local and Metropolitan Area Networks – Specific Requirements - Part
Moreover, the feedback mechanism should also be a point 11: Wireless LAN Medium Access Control (MAC) and Physical Layer
(PHY) Specifications - Amendment 10: Mesh Networking, IEEE
of optimization, making it flexible from the incoming user
Std., 2011.
traffic intensity. Overall, these optimization actions will gain [10] L. Peshkin and V. Savova, “Reinforcement learning for adaptive
in RTT and throughput characteristics of the routed routing,” in Neural Networks, 2002. IJCNN’02. Proceedings of the
connection. 2002 International Joint Conference on, vol. 2. IEEE, 2002, pp. 1825 –
1830.
It also should be noticed, that the developed adaptive
[11] J. Dowling, E. Curran, R. Cunningham, and V. Cahill, “Using
scheme can be used as more generalized approach for feedback in collaborative reinforcement learning to adaptively
solving the other tasks, close to the wireless multi-hop optimize MANET routing,” Systems, Man and Cybernetics, Part A,
routing task. For example, the proposed solution might be IEEE Transactions on, vol. 35, no. 3, pp. 360 – 372, 2005.
[12] Z. Qin, Z. Jia, X. Chen. “Fuzzy Dynamic Programming based Trusted
applied in conjunction with the algorithms of available
Routing Decision in Mobile Ad Hoc Networks“, 2008, The Fifth IEEE
bandwidth estimation for more effective route selection International Symposium on Embedded Computing(SEC ’08). Beijing,
across fat-pipe WAN networks [22]. China , pp.180 – 185.
The other possible application of the developed routing [13] P. Nurmi, "Reinforcement Learning for Routing in Ad Hoc Networks,"
Proc. 5th Inti. Symposium on Modeling and Optimization in Mobile,
scheme can be found in conjunction with stationary mobile
Ad Hoc, and Wireless Networks. IEEE Computer Society, Los
networks, for purposes of remote mobile monitoring [23]. Alamitos, 2007.
[14] R. Desai and B. Patil, “Cooperative reinforcement learning approach
for routing in ad hoc networks,” in Pervasive Computing (ICPC), 2015
X. CONCLUSION International Conference on. IEEE, 2015, pp. 1 – 5.
[15] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction,
This paper presented an application of RL-based MIT Press, Cambridge, MA, 1998.
algorithms to the routing task in wireless multihop [16] https://github.com/dugdmitry/adhoc_routing/wiki.
topologies. As a result, a flexible, reliable, adaptive packet [17] J.N.Davies, V.Grout and R.Picking, "Prediction of Wireless Network
Signal Strength within a Building", Proceedings of the Seventh
forwarding scheme has been developed, which showed International Network Conference (INC), pp. 193 – 207, 2008.
significantly better results in PLR and RRT values, compared [18] BeagleBoard.org, Beagleboard-black. [Online]. Available:
to the classical routing approach, widely used in the current http://beagleboard.org/black.
ad hoc multi-hop networks. [19] Universal TUN/TAP device driver. Copyright (C) 1999-2000 Maxim
Krasnyansky.
The developed RL-based adaptive routing scheme also https://www.kernel.org/doc/Documentation/networking/tuntap.txt.
proposes additional mechanisms for initial route weights [20] Linux Programmer's Manual, http://man7.org/linux/man-
distribution, as well as the functions for dynamic variations pages/man2/socket.2.html.
of negative and positive rewards, generated for the chosen [21] Mike Muuss. "The Story of the PING Program". U.S. Army Research
Laboratory. Archived from the original on 8 September 2010.
packet forwarding action. Based on soft-max selection Retrieved 8 September 2010.
method, a next-hop node selection probabilities function has [22] D. Dugaev, D. Kachan, I. Fedotova Concept of traffic routing in
been elaborated as well. mobile ad-hoc networks based on highly accurate available bandwidth
The developed programming implementation of the estimations // Vestnik SibSUTIS #4, 2015.
[23] D. Dugaev Evaluating Uplink Connection Establishing Time in LTE
proposed scheme – RLRP routing protocol [16] has also been Networks // T-COMM – Telecommunication and Transport. Moscow.
tested under real wireless multihop topologies [2], [3]. #5, November. 2013. (in Russian).
217
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.
XIV Международная научно-техническая конференция АПЭП – 2018
218
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 21:28:29 UTC from IEEE Xplore. Restrictions apply.