A Cooperative Q-learning Approach for Online
Power Allocation in Femtocell Networks
Hussein Saad∗ , Amr Mohamed† and Tamer ElBatt∗
∗ Wireless
Intelligence Network Center (WINC),
Nile University, Cairo, Egypt.
hussein.saad@nileu.edu.eg
telbatt@nileuniversity.edu.eg
Abstract—In this paper, we address the problem of distributed
interference management of cognitive femtocells that share the
same frequency range with macrocells using distributed multiagent Q-learning. We formulate and solve three problems representing three different Q-learning algorithms: namely, centralized, femto-based distributed and subcarrier-based distributed
power control using Q-learning (CPC-Q, FBDPC-Q and SBDPCQ). CPC-Q, although not of practical interest, characterizes the
global optimum. Each of FBDPC-Q and SBDPC-Q works in two
different learning paradigms: Independent (IL) and Cooperative
(CL). The former is considered the simplest form for applying
Q-learning in multi-agent scenarios, where all the femtocells
learn independently. The latter is the proposed scheme in which
femtocells share partial information during the learning process
in order to strike a balance between practical relevance and
performance. In terms of performance, the simulation results
showed that the CL paradigm outperforms the IL paradigm and
achieves an aggregate femtocells capacity that is very close to the
optimal one. For the practical relevance issue, we evaluate the
robustness and scalability of SBDPC-Q, in real time, by deploying
new femtocells in the system during the learning process, where
we showed that SBDPC-Q in the CL paradigm is scalable to large
number of femtocells and more robust to the network dynamics
compared to the IL paradigm.
I. I NTRODUCTION
Femtocells have been recently proposed as a promising
solution to the indoor coverage problem. Although femtocells
offer significant benefits to both the operator and the user, several challenges have to be solved to fully reap these benefits.
One of the most daunting challenges is their interference on
macro-users and other femtocells [1], [2]. Typically, femtocells
are installed by the end user and hence, their number and
positions are random and unknown to the network operator
a priori. Adding to this the typical dynamics of the wireless
environment, a centralized approach to handle the interference
problem can not be feasible which, in turn, calls for distributed
interference management strategies.
Based on these observations, in this paper, we focus on
closed access femtocells [3] working in the same bandwidth
with macrocells (i.e. cognitive femtocells), where the femtocells will be the secondary users who try to perform power
control to maximize their own performance while maintain
the macrocell capacity at certain level. In order to handle
0 This work was made possible by NPRP 5 - 782 - 2 - 322 from the Qatar
National Research Fund (a member of Qatar Foundation).
† Computer
Science and Engineering Department
Qatar University, P.O. Box 2713, Doha, Qatar.
amrm@qu.edu.qa
the interference generated by the femtocells on the macrocell
users, we will use a distributed reinforcement learning [4]
technique called multi-agent Q-learning [5] and [6]. In our
context, a prior model of the environment cannot be achieved
due to 1) the unplanned placement of the femtocells, 2) the
typical dynamics of the wireless environment. In such context,
Q-Learning offers significant advantages to achieve optimal
decision policies through realtime learning of the environment
[7].
In the literature, Q-learning has been used several times to
perform power allocation in femtocell networks. In [8], authors
used independent learning (IL) Q-learning to perform power
allocation in order to control the interference generated by the
femtocells on the macrocell user. In [7], authors introduced a
new concept called docitive femtocells where a new femtocell
can fasten its learning process by learning the policies acquired
by the already deployed femtocells, instead of learning from
scratch. The policies are shared by Q-table exchange between
the femtocells. However, after the Q-tables are exchanged, all
the femtocells take their actions (powers) independently, which
may generate an oscillating behavior in the system. In [9], we
developed a distributed power allocation algorithm called distributed power control using Q-learning (DPC-Q). In DPC-Q,
two different learning paradigms were proposed: independent
learning (IL) and cooperative learning (CL). It was shown
that both paradigms achieve convergence. Moreover, the CL
paradigm outperforms the IL one through achieving higher
aggregate femtocells capacity and better fairness (in terms of
capacity) among the learning femtocells.
However, in [9] we did not evaluate the performance of
DPC-Q against the networks dynamics, specially after convergence. Also, we did not have any benchmarking algorithm to
compare the performance of DPC-Q to. Thus, the contribution
of this paper can be summed up as follows:
• we propose two new Q-learning based power allocation
algorithms: namely, centralized power control using Qlearning (CPC-Q) and femto-based distributed power
control using Q-learning (FBDPC-Q). CPC-Q is used
for benchmarking purposes, where a central controller,
which has all the information about the system (channel
gains of all femtocells, system noise, · · · ), is responsible
for calculating the optimal powers that the femtocells
should use. FBDPC-Q, is proposed because: 1) it gives
978-1-4673-6187-3/13/$31.00 ©2013 IEEE
the operator the flexibility to work on a global base (e.g.
aggregate femtocell capacity instead of subcarrier based
femtocell capacity as in SBDPC-Q), 2) it makes SBDPCQ comparable to CPC-Q.
• we evaluate the robustness and scalability of SBDPC-Q,
in both IL and CL paradigms, against two of the dynamics
that typically exist in the wireless environment: namely,
the random activity of femtocells (when new femtocells
are deployed in the system during the learning process)
and the density of the femtocells in the macrocell coverage area (the number of femtocells that are interfering
on the macro users).
• we compare our proposed SBDPC-Q in both IL and CL
paradigms to the idea of docitive femtocells presented in
[7].
The rest of this paper is organized as follows. In section
II, the system model is described. Section III presents a
brief background about multi-agent Q-learning. In section IV,
the proposed Q-learning based power allocation algorithms
are presented. The simulation scenario and the results are
discussed in section V. Finally the conclusions are drawn in
section VI.
II. S YSTEM M ODEL
We consider a wireless network composed of one macro
cell (with one single transmit and receive antenna Macro
Base Station (MBS)) that coexists with Nf femtocells, each
with one single transmit and receive antenna Femto Base
Station (FBS). The Nf femtocells are placed indoors within
the macrocell coverage area. Both the MBS and the FBSs’
transmit over the same K subcarriers where orthogonal downlink transmission is assumed. Um = 1 and Uf = 1 macro and
femto users are located randomly inside the macro and femto
cells respectively. Femtocells within the same range can share
partial information during the learning process to enhance their
performance.
(k)
(k)
po and pn denote the transmission powers of the MBS
and FBS n on subcarrier k respectively. Moreover, the maximum transmission powers for the MBS and any FBS n are
K
(k)
m
f
m
Pmax
and Pmax
respectively, where k=1 po ≤ Pmax
and
K
(k)
f
k=1 pn ≤ Pmax .
The system performance is analyzed in terms of the capacity
measured in (bits/sec/Hz). The capacity achieved by the MBS
at its associated user on subcarrier k is:
(k) (k)
Co(k) = log2 (1 + Nf
hoo po
(k) (k)
n=1
hno pn + σ 2
)
(1)
(k)
(k) (k)
n′ =1,n′ =n
hnn pn
(k)
(k)
(k) (k)
hn′ n pn′ + hon po + σ 2
The scenario of distributed cognitive femtocells can be
mathematically formulated using stochastic games [10], where
the learning process of each femtocell is described by a task
defined by the quintuple {N, S, A, P, R(s, a)}, where:
• N = {1, 2, · · · , Nf } is the set of agents (i.e. femtocells).
• S = {s1 , s2 , · · · , sm } is the set of possible states that
each agent can occupy, where m is the number of possible
states.
• A = {a1 , a2 , · · · , al } is the set of possible actions that
each agent can perform for each task, where l is the
number of possible actions.
• P is the probabilistic transition function that defines the
probability that an agent transits from one state to another,
given the joint action performed by all agents.
• R(s,
a) is the reward function that determines the reward
fed back to an agent n by the environment when the joint
action a is performed in state s ∈ S.
In the distributed cognitive femtocells scenario, P can not
be deduced due to the dynamics of the wireless environment.
Thus, one of the most famous techniques that calculates
optimal policies without any prior model of the environment
is Q-learning . Q-learning assigns each task of each agent a
Q-table whose entries are known as Q-values Q(sm , al ), for
each state sm ∈ S and action al ∈ A. Thus, the dimension
of this table is m × l. The Q-value Q(sm , al ) is defined to
be the expected discounted reward over an infinite time when
action al is performed in state sm , and an optimal policy is
followed thereafter [5]. The learning process of each agent
n at time t can be described as follows: 1) the agent senses
the environment and observes its current state snm ∈ S, 2)
based on snm , the agent selects its action anl randomly with
probability ǫ or according to: anl = arg maxa∈A Qtn (snm , a)
with probability 1 − ǫ, where Qtn (sm , a) is the row of the Qtable of agent n that corresponds to state snm at time t, and ǫ is
an exploration parameter (a random number) that guarantees
that all the state-action pairs of the Q-table is visited at least
once, 3) the environment makes a transition to a new state
snm′ ∈ S and the agent receives a reward rnt = R(snm , a) due
to this transition, 4) the Q-value is updated using equation 3
and the process is repeated.
n
n
t
n
n
Qt+1
n (sm , al ) :=(1 − α)Qn (sm , al )+
where hoo indicates the channel gain between the trans(k)
mitting MBS and its associated user on subcarrier k; hno
indicates the channel gain between FBS n transmitting on
subcarrier k and the macro user. Finally σ 2 indicates the noise
power. The capacity achieved by FBS n at its associated user
on subcarrier k is:
Cn(k) = log2 (1 + Nf
(k)
where hnn indicates the channel gain between FBS n
(k)
transmitting on subcarrier k and its associated user; hn′ n
′
indicates the channel gain between FBS n transmitting on
subcarrier k and the femto user associated to FBS n.
III. M ULTI - AGENT Q- LEARNING
) (2)
′
α(rnt + γ max
Qtn (snm′ , a ))
′
(3)
a ∈A
where α is called the learning rate and 0 ≤ γ ≤ 1 is the
discount factor that determines how much effect the future
rewards have on the decisions at each moment. It should be
noticed that the reward rnt depends on the joint action a of
all agents not on the individual action al . This is the main
difference between the multi-agent scenario described here
and the single-agent one (when Nf = 1). In the single-agent
case, one of the conditions needed to guarantee that the Qvalues converges to the optimal ones is that: the reward of the
agent must be dependent only on its individual actions (i.e.
the reward function is stationary for each state-action pair)
[5], [11]. However, for the multi-agent scenario, the reward
function is not stationary from the agent point of view, since
it now depends on the actions of other agents. Thus, the
convergence proof used for the single-agent case can not be
used in the multi-agent one.
IV. P OWER A LLOCATION USING Q- LEARNING
In this section, the three proposed Q-learning based power
allocation algorithms will be presented:
A. Subcarrier-based Distributed Power Control Using Qlearning (SBDPC-Q)
SBDPC-Q is a distributed algorithm where multiple agents
(i.e: femtocells) aim at learning a sub-optimal decision policy
(i.e: power allocation) by repeatedly interacting with the
environment. The SBDPC-Q algorithm is proposed in two
different learning paradigms:
• Independent learning (IL): In this paradigm, each agent
learns independently from other agents (i.e: ignores other
agents’ actions and considers other agents as part of the
environment). Although, this may lead to oscillations and
convergence problems, the IL paradigm showed good
results in many applications [8].
• Cooperative learning (CL): In this paradigm, each agent
shares a portion of its Q-table with all other cooperating agents1 , aiming at enhancing the femtocells’ performance. CL is performed as follows: each agent shares
the row of its Q-table that corresponds to its current state
with all other cooperating agents (i.e. femtocells in the
same range). Then, each agent n selects its action anl
according to the following equation:
′
Qn′ (sn , a))
(4)
anl = arg max(
a∈A
1≤n′ ≤Nf
The main idea behind this strategy is explained in details
in [9]. In terms of overhead, if the number of femtocells
is Nf , then the total overhead needed is Nf .(Nf − 1)
messages (each of size l) per unit time (i.e. the overhead
is quadratic in the number of cooperating femtocells).
The agents, states, actions, and reward functions used for
the SBDPC-Q algorithm are defined as follows:
• Agents: F BSn , ∀1 ≤ n ≤ Nf
• States: At time t for femtocell n on subcarrier k, the
= {Itk , Pnt } where Itk ∈ {0, 1}
state is defined as: sn,k
t
indicates the level of interference measured at the macrouser on subcarrier k at time t:
(k)
1, Co < Γo
k
It =
(5)
(k)
0, Co ≥ Γo
1 We assume that the shared row of the Q-table is put in the control bits
of the packets transmitted between the femtocells. The details of the exact
protocol lie out of the scope of this paper.
•
•
where Γo is the target capacity determining the QoS performance of the macrocell. We assume that the macrocell
(k)
reports the value of Co to all FBSs through the backhaul
connection.
Pnt defines the power levels used to quantize the total
power FBS n is using for transmission at time t:
⎧
K
n,k
f
⎪
− A1)
< (Pmax
⎨0,
k=0 pt
K
n
f
f
(6)
Pt = 1, (Pmax
− A2) ≤ k=0 pn,k
≤ Pmax
t
⎪
⎩
K
n,k
f
2,
> Pmax
k=0 pt
where A1 and A2 are arbitrary selected thresholds (several values for A1 and A2 as well as more power levels
were tried through the simulations and the performance
gain between these values was marginal).
Actions: The action here is scalar, where the set of
actions available for each FBS is defined as the set of
possible powers that a FBS can use for transmission on
each subcarrier. In the simulations, a range from −20 to
f
dBm with step of 2 dBm is used.
Pmax
Reward Functions: The reward fed back to agent n on
subcarrier
time t is defined as:
k at (k)
K
(k)
−(Co −Γo )2
f
− e−Cn ,
e
pn,k
≤ Pmax
t
rtn,k =
k=0
K
n,k
f
−2,
> Pmax
k=0 pt
(7)
The rationale behind this reward function is that each
femtocell will aim at maximizing its own capacity while:
1) maintaining the capacity of the macrocell around the
target capacity Γo (convergence is assumed to be within
a range of ±1 bits/sec/Hz from Γo ), 2) not exceeding the
f
allowed Pmax
.
This reward function was compared to the reward function defined in [9]:
K
(k)
o 2
n,k
f
e−(Co −Γ ) ,
≤ Pmax
n,k
k=0 pt
rt =
(8)
K
n,k
f
−1,
> Pmax
k=0 pt
where it was shown that both reward functions maintain
the capacity of the macrocell within the convergence
range. However, reward function 7 was able to achieve
higher aggregate femtocell capacity. In this paper, we
show another advantage for reward function 7, which
is: it learns (explores or reacts to network dynamics)
better than reward function 8 even when the exploration
parameter ǫ is not used. This mainly depends on the initial
value of the Q-values. In this paper, we initially set all
the Q-values to zero. Thus, when ǫ is not used, using
reward function 8 will always feed the agent back with a
f
is not exceeded). Thus,
positive reward (given that Pmax
if initially agent n was in state sn,k
on subcarrier k and
t
took action pn,k
,
the
Q-value
of
this
action
Qn (sn,k , pn,k )
t
will be updated using a positive valued reward, thus this
Q-value will increase with time, and agent n will keep
using the same action forever (since the action is chosen
according to the maximum Q-value). Thus, using ǫ with
reward function 8 is a must to have better exploration
behavior. On the other hand, using reward function 7
may feed the agent back with positive or negative valued
(k)
o 2
TABLE I
TAXONOMY OF THE PROPOSED ALGORITHMS .
(k)
rewards (e−(Co −Γ ) could be smaller than e−Cn ).
Thus, given the same initial conditions, agent n could
receive a negative valued reward after taking action pn,k
t ,
leading to the decrease of its Q-value with time. Once the
Q-value decreases below zero, the agent will take another
action whose Q-value is greater than the decreased one.
Thus, reward function 7 learns (explores) better than
reward function 8.
In this paper, we also evaluate the robustness and scalability
of SBDPC-Q, in both IL and CL paradigms. We believe that
the CL paradigm is much more robust and scalable against the
network dynamics compared to the IL paradigm. The reason
is that after sharing the row of the Q-table, each femtocell
will know the states that all other cooperating femtocells
are occupying, and since a state at a certain moment can
be defined as: how the agent sees the environment at that
moment, each femtocell can implicitly know 1) how all other
femtocells can react to the network dynamics, 2) what actions
other femtocells are going to take. However, if the femtocells
took their actions independently (i.e. IL paradigm), even after
knowing the states of each other, oscillating behaviors that
may not reach convergence may be generated. One way to
overcome this problem is to force the femtocells to make use
of the information shared while taking their actions (i.e. taking
the actions cooperatively: equation 4). This could decrease the
oscillations in the system, making the femtocells more robust
towards the increase of the number of deployed femtocells,
and towards the sudden effect caused by any new deployed
femtocell.
B. Femto-based Distributed Power Control Using Q-learning
(FBDPC-Q)
FBDPC-Q is a multi-agent algorithm whose states, actions,
reward functions are defined for each femtocell over all subcarriers. FBDPC-Q gives the operator the flexibility to work
on a global base (e.g. aggregate femtocell capacity instead of
subcarrier based femtocell capacity as in SBDPC-Q). Thus, it
makes comparing the performance of the SBDPC-Q algorithm
to the global optimal values easier. As SBDPC-Q, FBDPC-Q
works in both IL and CL paradigms. The agents, states, actions
and reward functions used for the FBDPC-Q algorithm are
defined as follows:
• Agents: F BSn , ∀1 ≤ n ≤ Nf
• States: At time t the state is defined as: st = {It } where
It ∈ {0, 1} indicates the level of interference measured
at the macro-user over all subcarriers at time t:
1, Co < β o
It =
(9)
0, Co ≥ β o
K
(k)
where Co =
is the aggregate macrocell
k=1 Co
capacity and β o is the target aggregate macrocell capacity.
• Actions: For FBS n, the set of actions is defined to be
a set of vectors where each vector represents the powers
FBS n is using on all subcarriers.
• Reward Functions: Reward function 7 can be redefined
as:
SBDPC-Q/IL
action is
scalar
SBDPC-Q/CL
action is
scalar
Reaction
to network
dynamics
Inefficient &
non-robust
Efficient &
robust
Scalability
Inefficient at
large Nf
Efficient at
large Nf
Infeasible at
large Nf
Speed of
Convergence
Medium
convergence
Fast
convergence
None
Nf2 − Nf
messages each
of size |A|
Slow
convergence
since |A|
is huge
Huge
Complexity
Overhead
rtn = e−(Co −β
CPC-Q
|A| grows
exponentially
with Nf and K
-
o 2
)
− e−Cn
FBDPC-Q
|A| grows
exponentially
with K
CL is more
efficient &
robust than
IL
CL is more
scalable
than IL
CL is faster
than IL
CL has larger
overhead than
IL
(10)
(k)
k=1 Cn
K
where Cn =
is the aggregate capacity of FBS
n. Note that since FBDPC-Q is not subcarrier based, a
f
is exceeded will never be
power vector in which Pmax
assigned for any FBS. Thus, there is no need to put a
negative reward here as in SBDPC-Q case. The same
goes for CPC-Q.
C. Centralized Power Control Using Q-learning (CPC-Q)
CPC-Q is a centralized power control algorithm used to
evaluate the performance of our proposed FBDPC-Q algorithm. CPC-Q can be regarded as the single-agent version of
the FBDPC-Q, and hence, its convergence to the optimal Qvalues and thus optimal powers is guaranteed. However, using
a centralized controller is not feasible in terms of overhead
in multi-agent scenarios. Thus, CPC-Q works only for small
scale problems. The agent, states, actions and reward functions
used for CPC-Q are defined as follows:
•
•
•
•
Agents: A centralized controller.
States: The same as FBDPC-Q.
Actions: For the central controller, the set of actions
is defined to be a set of matrices where each matrix
represents the powers of all femtocells over all subcarriers. However, the size of this set grows exponentially
with both the number of femtocells and the number
of subcarriers. Thus, forming the matrices (all possible
actions) from a large set of powers such as the one used
in SBDPC-Q will be infeasible2 .
Reward Functions: Since CPC-Q is global, reward
function 7 can be redefined as:
rt = e−(Co −β
where Cf emto
o 2
)
− e−Cf emto
(11)
Nf K
(k)
is defined as Cf emto = n=1 k=1 Cn .
Finally, for the rest of the paper, reward functions 7, 11 and
10 will be referred to as R1, while reward function 8 will be
referred to as R0. The three proposed algorithms are compared
qualitatively in table I.
2 In the simulations, the set of powers used to form the matrices and the
vectors in CPC-Q and FBDPC-Q respectively is: {0, 6, 12} dBm.
V. P ERFORMANCE E VALUATION
Aggregated Femtocell capacity (bits/sec/Hz)
3
A. Simulation Scenario
(k)
hij
=
(−P L)
dij
(12)
where dij is the physical distance between transmitter i and
receiver j, and P L is the path loss exponent. In the simulations
P L = 2 is used. The distances are calculated according to
the following assumptions: 1) The maximum distance between
the MBS and its associated user is set to 1000 meters, 2)
The maximum distance between the MBS and a femto-user is
set to 800 meters, 3) The maximum distance between a FBS
and its associated user is set to 80 meters, 4)The maximum
distance between a FBS and another femtocell’s user is set to
300 meters, 5) The maximum distance between a FBS and the
macro-user is set to 800 meters.
We used MatLab on a cluster computing facility with 300
cores to simulate such scenario, where in the simulations we
set the noise power σ 2 to 10−7 , the maximum transmission
m
to 43 dBm, the maximum
power of the macrocell Pmax
f
transmission power of each femtocell Pmax
to 15 dBm, each
of the power levels A1 and A2 is set to 5 dBm, the learning
rate α to 0.5, the discounted rate γ to 0.9 and the random
number ǫ to 0.1 [7] and [9].
B. Numerical Results
Figure 1(a) shows the aggregate femtocells capacity (as
a function of the number of femtocells) using CPC-Q and
FBDPC-Q with R1 in both IL and CL paradigms. It can
be observed that CL is much better than IL, where from
the figure it can be shown that the aggregate capacity gain
of CPC-Q over FBDPC-Q in case of CL is marginal. Since
CPC-Q is considered the single agent version of FBDPCQ, it should converge to the global optimal values. This is
shown in the figure at small number of femtocells (Nf = 1
and 2). The optimal values are calculated using exhaustive
search over all possible actions, where the optimal value is
defined to be the maximum aggregate capacity the system can
achieve while maintaining the capacity of the macrocell in
the convergence range (±1 bits/sec/Hz from β o ). However,
starting from Nf = 3, CPC-Q begins to be infeasible since
the size of the possible actions set A becomes very large
(at Nf = 5: |A| = 320, 000, 0). So, besides the computational problems, the condition of visiting all state-action pair
becomes infeasible. Thus, getting the optimal value is also
not feasible (that’s why we stopped CPC-Q at Nf = 4).
2
1.5
1
0.5
0
1
2
3
4
5
Number of Femtocells
6
7
(a) Aggregate femtocells capacity versus the number of femtocells.
4
Aggregate Femtocell Capacity (bits/sec/Hz)
We consider a wireless network consisting of one macrocell
serving Um = 1 macro user underlaid with Nf femtocells.
Each femtocell serves Uf = 1 femto-user, which is randomly
located in the femtocell coverage area. All of the macro
and femto cells share the same frequency band composed
of K subcarriers, where orthogonal downlink transmission is
assumed. In the simulations, K will change according to the
algorithm used: for SBDPC-Q, K = 6, while for both CPCQ and FBDPC-Q, K = 3. The channel gain between any
transmitter i and any receiver j on subcarrier k is assumed to
be path-loss dominated and is given by:
Exhaustive Search
CPC−Q
FBDPC−Q_R1_IL
FBDPC−Q_R1_CL
2.5
Exhaustive Search
FBDPC−Q_R1_IL_share
FBDPC−Q_R1_IL_scratch
FBDPC−Q_R1_CL_share
FBDPC−Q_R1_CL_scratch
3.5
3
2
2.5
1.9
1.8
1.7
1.6
2
1.5
1.4
1.3
1.5
1.2
1.1
1340
1350
1360
1370
1380
1390
1
0.5
0
0
500
1000
1500
Q_iterations
(b) Aggregate femtocells capacity versus learning iterations.
Fig. 1. Aggregate femtocells capacity using CPC-Q and FBDPC-Q with R1
in both IL and CL paradigms.
Note also that we stopped the exhaustive search at Nf = 5
due to complexity and memory problems, while FBDPC-Q is
shown at Nf = 6 and 7 just to illustrate the continuity of our
algorithm.
Figures 2 and 3 show the robustness of the proposed
SBDPC-Q algorithm. In these figures, we started with Nf = 5,
then we added a new femtocell after every 4000 iterations3 to
reach Nf = 29 at the 96000th iteration. Finally, we add
another femtocell at the 99000th iteration. The figures show
how SBDPC-Q using the CL paradigm is more robust to the
deployment of new femtocells compared to the IL paradigm.
Moreover, in these figures we compare the performance of
SBDPC-Q to the docitive idea presented in [7]. We investigated two cases: 1) the already deployed femtocells share
their Q-tables with the new femtocells when they first join
the system (suffixed with share on the figure), 2) the new
deployed femtocells starts with a zero initialized Q-tables
(suffixed with scratch on the figure). Figure 2 shows the
macrocell convergence on a certain subcarrier using SBDPCQ with R1 in both IL and CL paradigms, where it can
be observed that the CL paradigm maintains the macrocell
capacity within the range of convergence (6 ± 1 bits/sec/Hz)
and reacts well to the effect of the new deployed femtocells,
without the need to have a learning phase again every time
a new femtocell is deployed, which is a very interesting
observation. It can also be observed that our proposed CL
paradigm converges to the same value regardless the already
deployed femtocells shared its Q-tables with the new ones or
not. So, sharing could be ignored, thus decreasing the overall
overhead. On the other hand, the IL paradigm showed a very
bad reaction to the network dynamics, where 1) convergence
is not attained (i.e. an oscillating behavior is generated), 2) as
Nf increases, IL paradigm may push the macrocell capacity
out of the convergence range when the network becomes
more dense. Thus, CL is more scalable than IL. However,
it can be noticed that the docitive idea is useful in the IL
12
6.6
6.4
11
6.2
6
5.8
Macrocell capacity (bits/sec/Hz)
10
5.6
5.4
9
5.2
5
4.8
8
640
660
680
700
720
740
7
6
5
SBDPC−Q_R1_IL_scratch
SBDPC−Q_R1_CL_scratch
SBDPC−Q_R1_IL_share
SBDPC−Q_R1_CL_share
4
3
2
0
100
200
300
400
500
600
Q_iterations
700
800
900
1000
Fig. 2. Convergence of the macrocell capacity over the Q-iterations on a
certain subcarrier where Nf was initially 5, then incremented until Nf = 30.
Aggregate Femtocell capacity (bits/sec/Hz)
6
SBDPC−Q_R1_IL_scratch
SBDPC−Q_R1_CL_scratch
SBDPC−Q_R1_IL_share
SBDPC−Q_R1_CL_share
2.1
5
2
1.9
1.8
1.7
1.6
4
1.5
1.4
1.3
1.2
1.1
3
620
640
660
680
700
720
740
2
1
0
0
100
200
300
400
Q
500
600
iterations
700
800
900
1000
Fig. 3. Aggregate femtocells capacity over the Q-iterations where Nf was
initially 5, then incremented until Nf = 30.
femtocells, to the ideal value, we used the small scale problem
again. This is shown in figure 1(b), where we started with
Nf = 2 and added an extra femtocell at the 8000th, 12000th
and 13500th iterations4 . Again, it can be observed that CL
achieves aggregate capacity that is very close to the optimal
one while the IL paradigm is far from it.
VI. C ONCLUSION
In this paper, three Q-learning based power allocation
algorithms for cognitive femtocells scenario are presented:
namely, SBDPC-Q, FBDPC-Q and CPC-Q. Although SBDPCQ was presented in previous work as DPC-Q, in this paper
it is extended, in both of its learning paradigms: IL and
CL, to evaluate its performance, robustness and scalability. In
terms of performance, SBDPC-Q is extended to FBDPC-Q and
then compared to CPC-Q, where the simulations showed that
the CL paradigm outperforms the IL and achieves aggregate
femtocell capacity that is very close the optimum one. In
terms of robustness, the CL paradigm was found to be much
more robust against the deployment of new femtocells during
the learning process, where the results showed that the CL
paradigm outperforms the IL paradigm in: 1) maintaining
convergence, 2) learning better (i.e. reacting better to the
network dynamics), especially when a suitable reward function such as the one defined in the simulations is used, 3)
converging to the target capacity regardless the old femtocells
share their experience (i.e. Q-tables) with the new deployed
ones or not and 4) speeding up the convergence. Finally, in
terms of scalability, CL paradigm reacted better to the network
dynamics and maintained convergence, even when the number
of the femtocells is large .
R EFERENCES
paradigm, where sharing the Q-tables of the already deployed
femtocells with the new ones is much better (in terms of
the value that the macrocell capacity oscillates around) than
beginning with zero-initialized (scratch) Q-tables. In terms of
speed of convergence, it can be noticed that, although the
learning process may need large number of iterations initially,
CL decreases the dynamics of the learning process, and hence,
making it faster. This can be noticed from the figure, where
CL converged almost at the 300th iteration, which is much
earlier than the IL paradigm. Also, after the deployment of
each new femtocell, CL took less than 10 iterations only around 0.01 seconds - to re-achieve convergence.
Figure 3 shows the aggregate femtocells capacity over
the learning iterations. It can be noticed that using the CL
paradigm, the aggregate capacity increases as more femtocells
are deployed in the network, while in the IL paradigms,
since convergence is already not maintained, the aggregate
capacity behavior has a sporadic behavior, which indicates
clearly that IL is not efficient to react to the network dynamics.
However, in the CL paradigm from the 64000th to the 72000th
iteration(640th to 720th according to figure’s scale), it can
be noticed that the aggregate capacity decreases. The reason
is that as more femtocells are being deployed, the network
becomes very dense and since using the CL paradigm makes
the cooperating femtocells use the same powers, this may force
the macrocell capacity to violate the range of convergence.
Thus, all the femtocells will have to decrease the power used
to maintain again the macrocell capacity within the range
of convergence leading to the decrease of their aggregate
capacity. Note that at the 64000th and 72000th iterations, ǫ
is already removed, which proves that R1 learns well even
when ǫ is removed.
Finally, in order to compare the aggregate capacity the
CL paradigm achieves, after the incremental deployment of
[1] V. Chandrasekhar, J. Andrews, and A. Gatherer, “Femtocell networks:
a survey,” Communications Magazine, IEEE, vol. 46, September 2008.
[2] S. Saunders, S. Carlaw, A. Giustina et al., Femtocells: Opportunities and
Challenges for Business and Technology. Great Britain: John Wiley
and Sons Ltd, 2009.
[3] P. Xia, V. Chandrasekhar, and J. G. Andrews, “Open vs closed access
femtocells in the uplink,” CoRR, vol. abs/1002.2964, 2010.
[4] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction.
Cambridge MA, MIT press, 1998.
[5] J. C. H. Watkins and P. Dayan, “Technical note Q-learning,” Journal of
Machine Learning, vol. 8, 1992.
[6] J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement
learning by payoff propagation,” J. Mach. Learn. Res., vol. 7, December
2006.
[7] A. Galindo-Serrano, L. Giupponi, and M. Dohler, “Cognition and docition in OFDMA-based femtocell networks,” in Proceeding of the IEEE
Global Telecommunications Conference (GLOBECOM), may 2010.
[8] A. Galindo-Serrano and L. Giupponi, “Distributed Q-learning for interference control in OFDMA-based femtocell networks,” in Proceedings
of the 71st IEEE Vehicular Technology Conference, 2010.
[9] H. Saad, A. Mohamed, and T. ElBatt, “Distributed cooperative Qlearning for power allocation in cognitive femtocell networks,” in
Proceedings of the IEEE 76th Vehicular Technology Conference, Sep.
2012.
[10] A. Burkov and B. Chaib-draa, “Labeled initialized adaptive play qlearning for stochastic games,” in Proceedings of the AAMAS’07 Workshop on Adaptive and Learning Agents (ALAg’07), May 2007.
[11] F. S. Melo, “Convergence of Q-learning: A simple proof,” Institute Of
Systems and Robotics, Tech. Rep., 2001.
3 In figures 2 and 3, ǫ is removed at the 50000th iteration and the figures
were drawn with step = 100 in order to achieve better resolution.
4 In figure 1(b), ǫ is removed at the 12000th iteration and the figure was
drawn with step = 10 in order to achieve better resolution.