Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
A Cooperative Q-learning Approach for Online Power Allocation in Femtocell Networks Hussein Saad∗ , Amr Mohamed† and Tamer ElBatt∗ ∗ Wireless Intelligence Network Center (WINC), Nile University, Cairo, Egypt. hussein.saad@nileu.edu.eg telbatt@nileuniversity.edu.eg Abstract—In this paper, we address the problem of distributed interference management of cognitive femtocells that share the same frequency range with macrocells using distributed multiagent Q-learning. We formulate and solve three problems representing three different Q-learning algorithms: namely, centralized, femto-based distributed and subcarrier-based distributed power control using Q-learning (CPC-Q, FBDPC-Q and SBDPCQ). CPC-Q, although not of practical interest, characterizes the global optimum. Each of FBDPC-Q and SBDPC-Q works in two different learning paradigms: Independent (IL) and Cooperative (CL). The former is considered the simplest form for applying Q-learning in multi-agent scenarios, where all the femtocells learn independently. The latter is the proposed scheme in which femtocells share partial information during the learning process in order to strike a balance between practical relevance and performance. In terms of performance, the simulation results showed that the CL paradigm outperforms the IL paradigm and achieves an aggregate femtocells capacity that is very close to the optimal one. For the practical relevance issue, we evaluate the robustness and scalability of SBDPC-Q, in real time, by deploying new femtocells in the system during the learning process, where we showed that SBDPC-Q in the CL paradigm is scalable to large number of femtocells and more robust to the network dynamics compared to the IL paradigm. I. I NTRODUCTION Femtocells have been recently proposed as a promising solution to the indoor coverage problem. Although femtocells offer significant benefits to both the operator and the user, several challenges have to be solved to fully reap these benefits. One of the most daunting challenges is their interference on macro-users and other femtocells [1], [2]. Typically, femtocells are installed by the end user and hence, their number and positions are random and unknown to the network operator a priori. Adding to this the typical dynamics of the wireless environment, a centralized approach to handle the interference problem can not be feasible which, in turn, calls for distributed interference management strategies. Based on these observations, in this paper, we focus on closed access femtocells [3] working in the same bandwidth with macrocells (i.e. cognitive femtocells), where the femtocells will be the secondary users who try to perform power control to maximize their own performance while maintain the macrocell capacity at certain level. In order to handle 0 This work was made possible by NPRP 5 - 782 - 2 - 322 from the Qatar National Research Fund (a member of Qatar Foundation). † Computer Science and Engineering Department Qatar University, P.O. Box 2713, Doha, Qatar. amrm@qu.edu.qa the interference generated by the femtocells on the macrocell users, we will use a distributed reinforcement learning [4] technique called multi-agent Q-learning [5] and [6]. In our context, a prior model of the environment cannot be achieved due to 1) the unplanned placement of the femtocells, 2) the typical dynamics of the wireless environment. In such context, Q-Learning offers significant advantages to achieve optimal decision policies through realtime learning of the environment [7]. In the literature, Q-learning has been used several times to perform power allocation in femtocell networks. In [8], authors used independent learning (IL) Q-learning to perform power allocation in order to control the interference generated by the femtocells on the macrocell user. In [7], authors introduced a new concept called docitive femtocells where a new femtocell can fasten its learning process by learning the policies acquired by the already deployed femtocells, instead of learning from scratch. The policies are shared by Q-table exchange between the femtocells. However, after the Q-tables are exchanged, all the femtocells take their actions (powers) independently, which may generate an oscillating behavior in the system. In [9], we developed a distributed power allocation algorithm called distributed power control using Q-learning (DPC-Q). In DPC-Q, two different learning paradigms were proposed: independent learning (IL) and cooperative learning (CL). It was shown that both paradigms achieve convergence. Moreover, the CL paradigm outperforms the IL one through achieving higher aggregate femtocells capacity and better fairness (in terms of capacity) among the learning femtocells. However, in [9] we did not evaluate the performance of DPC-Q against the networks dynamics, specially after convergence. Also, we did not have any benchmarking algorithm to compare the performance of DPC-Q to. Thus, the contribution of this paper can be summed up as follows: • we propose two new Q-learning based power allocation algorithms: namely, centralized power control using Qlearning (CPC-Q) and femto-based distributed power control using Q-learning (FBDPC-Q). CPC-Q is used for benchmarking purposes, where a central controller, which has all the information about the system (channel gains of all femtocells, system noise, · · · ), is responsible for calculating the optimal powers that the femtocells should use. FBDPC-Q, is proposed because: 1) it gives 978-1-4673-6187-3/13/$31.00 ©2013 IEEE the operator the flexibility to work on a global base (e.g. aggregate femtocell capacity instead of subcarrier based femtocell capacity as in SBDPC-Q), 2) it makes SBDPCQ comparable to CPC-Q. • we evaluate the robustness and scalability of SBDPC-Q, in both IL and CL paradigms, against two of the dynamics that typically exist in the wireless environment: namely, the random activity of femtocells (when new femtocells are deployed in the system during the learning process) and the density of the femtocells in the macrocell coverage area (the number of femtocells that are interfering on the macro users). • we compare our proposed SBDPC-Q in both IL and CL paradigms to the idea of docitive femtocells presented in [7]. The rest of this paper is organized as follows. In section II, the system model is described. Section III presents a brief background about multi-agent Q-learning. In section IV, the proposed Q-learning based power allocation algorithms are presented. The simulation scenario and the results are discussed in section V. Finally the conclusions are drawn in section VI. II. S YSTEM M ODEL We consider a wireless network composed of one macro cell (with one single transmit and receive antenna Macro Base Station (MBS)) that coexists with Nf femtocells, each with one single transmit and receive antenna Femto Base Station (FBS). The Nf femtocells are placed indoors within the macrocell coverage area. Both the MBS and the FBSs’ transmit over the same K subcarriers where orthogonal downlink transmission is assumed. Um = 1 and Uf = 1 macro and femto users are located randomly inside the macro and femto cells respectively. Femtocells within the same range can share partial information during the learning process to enhance their performance. (k) (k) po and pn denote the transmission powers of the MBS and FBS n on subcarrier k respectively. Moreover, the maximum transmission powers for the MBS and any FBS n are K (k) m f m Pmax and Pmax respectively, where k=1 po ≤ Pmax and K (k) f k=1 pn ≤ Pmax . The system performance is analyzed in terms of the capacity measured in (bits/sec/Hz). The capacity achieved by the MBS at its associated user on subcarrier k is: (k) (k) Co(k) = log2 (1 + Nf hoo po (k) (k) n=1 hno pn + σ 2 ) (1) (k) (k) (k) n′ =1,n′ =n hnn pn (k) (k) (k) (k) hn′ n pn′ + hon po + σ 2 The scenario of distributed cognitive femtocells can be mathematically formulated using stochastic games [10], where the learning process of each femtocell is described by a task defined by the quintuple {N, S, A, P, R(s, a)}, where: • N = {1, 2, · · · , Nf } is the set of agents (i.e. femtocells). • S = {s1 , s2 , · · · , sm } is the set of possible states that each agent can occupy, where m is the number of possible states. • A = {a1 , a2 , · · · , al } is the set of possible actions that each agent can perform for each task, where l is the number of possible actions. • P is the probabilistic transition function that defines the probability that an agent transits from one state to another, given the joint action performed by all agents. • R(s,  a) is the reward function that determines the reward fed back to an agent n by the environment when the joint action a is performed in state s ∈ S. In the distributed cognitive femtocells scenario, P can not be deduced due to the dynamics of the wireless environment. Thus, one of the most famous techniques that calculates optimal policies without any prior model of the environment is Q-learning . Q-learning assigns each task of each agent a Q-table whose entries are known as Q-values Q(sm , al ), for each state sm ∈ S and action al ∈ A. Thus, the dimension of this table is m × l. The Q-value Q(sm , al ) is defined to be the expected discounted reward over an infinite time when action al is performed in state sm , and an optimal policy is followed thereafter [5]. The learning process of each agent n at time t can be described as follows: 1) the agent senses the environment and observes its current state snm ∈ S, 2) based on snm , the agent selects its action anl randomly with probability ǫ or according to: anl = arg maxa∈A Qtn (snm , a) with probability 1 − ǫ, where Qtn (sm , a) is the row of the Qtable of agent n that corresponds to state snm at time t, and ǫ is an exploration parameter (a random number) that guarantees that all the state-action pairs of the Q-table is visited at least once, 3) the environment makes a transition to a new state snm′ ∈ S and the agent receives a reward rnt = R(snm , a) due to this transition, 4) the Q-value is updated using equation 3 and the process is repeated. n n t n n Qt+1 n (sm , al ) :=(1 − α)Qn (sm , al )+ where hoo indicates the channel gain between the trans(k) mitting MBS and its associated user on subcarrier k; hno indicates the channel gain between FBS n transmitting on subcarrier k and the macro user. Finally σ 2 indicates the noise power. The capacity achieved by FBS n at its associated user on subcarrier k is: Cn(k) = log2 (1 + Nf (k) where hnn indicates the channel gain between FBS n (k) transmitting on subcarrier k and its associated user; hn′ n ′ indicates the channel gain between FBS n transmitting on subcarrier k and the femto user associated to FBS n. III. M ULTI - AGENT Q- LEARNING ) (2) ′ α(rnt + γ max Qtn (snm′ , a )) ′ (3) a ∈A where α is called the learning rate and 0 ≤ γ ≤ 1 is the discount factor that determines how much effect the future rewards have on the decisions at each moment. It should be noticed that the reward rnt depends on the joint action a of all agents not on the individual action al . This is the main difference between the multi-agent scenario described here and the single-agent one (when Nf = 1). In the single-agent case, one of the conditions needed to guarantee that the Qvalues converges to the optimal ones is that: the reward of the agent must be dependent only on its individual actions (i.e. the reward function is stationary for each state-action pair) [5], [11]. However, for the multi-agent scenario, the reward function is not stationary from the agent point of view, since it now depends on the actions of other agents. Thus, the convergence proof used for the single-agent case can not be used in the multi-agent one. IV. P OWER A LLOCATION USING Q- LEARNING In this section, the three proposed Q-learning based power allocation algorithms will be presented: A. Subcarrier-based Distributed Power Control Using Qlearning (SBDPC-Q) SBDPC-Q is a distributed algorithm where multiple agents (i.e: femtocells) aim at learning a sub-optimal decision policy (i.e: power allocation) by repeatedly interacting with the environment. The SBDPC-Q algorithm is proposed in two different learning paradigms: • Independent learning (IL): In this paradigm, each agent learns independently from other agents (i.e: ignores other agents’ actions and considers other agents as part of the environment). Although, this may lead to oscillations and convergence problems, the IL paradigm showed good results in many applications [8]. • Cooperative learning (CL): In this paradigm, each agent shares a portion of its Q-table with all other cooperating agents1 , aiming at enhancing the femtocells’ performance. CL is performed as follows: each agent shares the row of its Q-table that corresponds to its current state with all other cooperating agents (i.e. femtocells in the same range). Then, each agent n selects its action anl according to the following equation:  ′ Qn′ (sn , a)) (4) anl = arg max( a∈A 1≤n′ ≤Nf The main idea behind this strategy is explained in details in [9]. In terms of overhead, if the number of femtocells is Nf , then the total overhead needed is Nf .(Nf − 1) messages (each of size l) per unit time (i.e. the overhead is quadratic in the number of cooperating femtocells). The agents, states, actions, and reward functions used for the SBDPC-Q algorithm are defined as follows: • Agents: F BSn , ∀1 ≤ n ≤ Nf • States: At time t for femtocell n on subcarrier k, the = {Itk , Pnt } where Itk ∈ {0, 1} state is defined as: sn,k t indicates the level of interference measured at the macrouser on subcarrier k at time t:  (k) 1, Co < Γo k It = (5) (k) 0, Co ≥ Γo 1 We assume that the shared row of the Q-table is put in the control bits of the packets transmitted between the femtocells. The details of the exact protocol lie out of the scope of this paper. • • where Γo is the target capacity determining the QoS performance of the macrocell. We assume that the macrocell (k) reports the value of Co to all FBSs through the backhaul connection. Pnt defines the power levels used to quantize the total power FBS n is using for transmission at time t: ⎧ K n,k f ⎪ − A1) < (Pmax ⎨0, k=0 pt  K n f f (6) Pt = 1, (Pmax − A2) ≤ k=0 pn,k ≤ Pmax t ⎪  ⎩ K n,k f 2, > Pmax k=0 pt where A1 and A2 are arbitrary selected thresholds (several values for A1 and A2 as well as more power levels were tried through the simulations and the performance gain between these values was marginal). Actions: The action here is scalar, where the set of actions available for each FBS is defined as the set of possible powers that a FBS can use for transmission on each subcarrier. In the simulations, a range from −20 to f dBm with step of 2 dBm is used. Pmax Reward Functions: The reward fed back to agent n on subcarrier time t is defined as:  k at (k) K (k) −(Co −Γo )2 f − e−Cn , e pn,k ≤ Pmax t rtn,k = k=0 K n,k f −2, > Pmax k=0 pt (7) The rationale behind this reward function is that each femtocell will aim at maximizing its own capacity while: 1) maintaining the capacity of the macrocell around the target capacity Γo (convergence is assumed to be within a range of ±1 bits/sec/Hz from Γo ), 2) not exceeding the f allowed Pmax . This reward function was compared to the reward function defined in [9]:  K (k) o 2 n,k f e−(Co −Γ ) , ≤ Pmax n,k k=0 pt rt = (8) K n,k f −1, > Pmax k=0 pt where it was shown that both reward functions maintain the capacity of the macrocell within the convergence range. However, reward function 7 was able to achieve higher aggregate femtocell capacity. In this paper, we show another advantage for reward function 7, which is: it learns (explores or reacts to network dynamics) better than reward function 8 even when the exploration parameter ǫ is not used. This mainly depends on the initial value of the Q-values. In this paper, we initially set all the Q-values to zero. Thus, when ǫ is not used, using reward function 8 will always feed the agent back with a f is not exceeded). Thus, positive reward (given that Pmax if initially agent n was in state sn,k on subcarrier k and t took action pn,k , the Q-value of this action Qn (sn,k , pn,k ) t will be updated using a positive valued reward, thus this Q-value will increase with time, and agent n will keep using the same action forever (since the action is chosen according to the maximum Q-value). Thus, using ǫ with reward function 8 is a must to have better exploration behavior. On the other hand, using reward function 7 may feed the agent back with positive or negative valued (k) o 2 TABLE I TAXONOMY OF THE PROPOSED ALGORITHMS . (k) rewards (e−(Co −Γ ) could be smaller than e−Cn ). Thus, given the same initial conditions, agent n could receive a negative valued reward after taking action pn,k t , leading to the decrease of its Q-value with time. Once the Q-value decreases below zero, the agent will take another action whose Q-value is greater than the decreased one. Thus, reward function 7 learns (explores) better than reward function 8. In this paper, we also evaluate the robustness and scalability of SBDPC-Q, in both IL and CL paradigms. We believe that the CL paradigm is much more robust and scalable against the network dynamics compared to the IL paradigm. The reason is that after sharing the row of the Q-table, each femtocell will know the states that all other cooperating femtocells are occupying, and since a state at a certain moment can be defined as: how the agent sees the environment at that moment, each femtocell can implicitly know 1) how all other femtocells can react to the network dynamics, 2) what actions other femtocells are going to take. However, if the femtocells took their actions independently (i.e. IL paradigm), even after knowing the states of each other, oscillating behaviors that may not reach convergence may be generated. One way to overcome this problem is to force the femtocells to make use of the information shared while taking their actions (i.e. taking the actions cooperatively: equation 4). This could decrease the oscillations in the system, making the femtocells more robust towards the increase of the number of deployed femtocells, and towards the sudden effect caused by any new deployed femtocell. B. Femto-based Distributed Power Control Using Q-learning (FBDPC-Q) FBDPC-Q is a multi-agent algorithm whose states, actions, reward functions are defined for each femtocell over all subcarriers. FBDPC-Q gives the operator the flexibility to work on a global base (e.g. aggregate femtocell capacity instead of subcarrier based femtocell capacity as in SBDPC-Q). Thus, it makes comparing the performance of the SBDPC-Q algorithm to the global optimal values easier. As SBDPC-Q, FBDPC-Q works in both IL and CL paradigms. The agents, states, actions and reward functions used for the FBDPC-Q algorithm are defined as follows: • Agents: F BSn , ∀1 ≤ n ≤ Nf • States: At time t the state is defined as: st = {It } where It ∈ {0, 1} indicates the level of interference measured at the macro-user over all subcarriers at time t:  1, Co < β o It = (9) 0, Co ≥ β o K (k) where Co = is the aggregate macrocell k=1 Co capacity and β o is the target aggregate macrocell capacity. • Actions: For FBS n, the set of actions is defined to be a set of vectors where each vector represents the powers FBS n is using on all subcarriers. • Reward Functions: Reward function 7 can be redefined as: SBDPC-Q/IL action is scalar SBDPC-Q/CL action is scalar Reaction to network dynamics Inefficient & non-robust Efficient & robust Scalability Inefficient at large Nf Efficient at large Nf Infeasible at large Nf Speed of Convergence Medium convergence Fast convergence None Nf2 − Nf messages each of size |A| Slow convergence since |A| is huge Huge Complexity Overhead rtn = e−(Co −β CPC-Q |A| grows exponentially with Nf and K - o 2 ) − e−Cn FBDPC-Q |A| grows exponentially with K CL is more efficient & robust than IL CL is more scalable than IL CL is faster than IL CL has larger overhead than IL (10) (k) k=1 Cn K where Cn = is the aggregate capacity of FBS n. Note that since FBDPC-Q is not subcarrier based, a f is exceeded will never be power vector in which Pmax assigned for any FBS. Thus, there is no need to put a negative reward here as in SBDPC-Q case. The same goes for CPC-Q. C. Centralized Power Control Using Q-learning (CPC-Q) CPC-Q is a centralized power control algorithm used to evaluate the performance of our proposed FBDPC-Q algorithm. CPC-Q can be regarded as the single-agent version of the FBDPC-Q, and hence, its convergence to the optimal Qvalues and thus optimal powers is guaranteed. However, using a centralized controller is not feasible in terms of overhead in multi-agent scenarios. Thus, CPC-Q works only for small scale problems. The agent, states, actions and reward functions used for CPC-Q are defined as follows: • • • • Agents: A centralized controller. States: The same as FBDPC-Q. Actions: For the central controller, the set of actions is defined to be a set of matrices where each matrix represents the powers of all femtocells over all subcarriers. However, the size of this set grows exponentially with both the number of femtocells and the number of subcarriers. Thus, forming the matrices (all possible actions) from a large set of powers such as the one used in SBDPC-Q will be infeasible2 . Reward Functions: Since CPC-Q is global, reward function 7 can be redefined as: rt = e−(Co −β where Cf emto o 2 ) − e−Cf emto (11) Nf K (k) is defined as Cf emto = n=1 k=1 Cn . Finally, for the rest of the paper, reward functions 7, 11 and 10 will be referred to as R1, while reward function 8 will be referred to as R0. The three proposed algorithms are compared qualitatively in table I. 2 In the simulations, the set of powers used to form the matrices and the vectors in CPC-Q and FBDPC-Q respectively is: {0, 6, 12} dBm. V. P ERFORMANCE E VALUATION Aggregated Femtocell capacity (bits/sec/Hz) 3 A. Simulation Scenario (k) hij = (−P L) dij (12) where dij is the physical distance between transmitter i and receiver j, and P L is the path loss exponent. In the simulations P L = 2 is used. The distances are calculated according to the following assumptions: 1) The maximum distance between the MBS and its associated user is set to 1000 meters, 2) The maximum distance between the MBS and a femto-user is set to 800 meters, 3) The maximum distance between a FBS and its associated user is set to 80 meters, 4)The maximum distance between a FBS and another femtocell’s user is set to 300 meters, 5) The maximum distance between a FBS and the macro-user is set to 800 meters. We used MatLab on a cluster computing facility with 300 cores to simulate such scenario, where in the simulations we set the noise power σ 2 to 10−7 , the maximum transmission m to 43 dBm, the maximum power of the macrocell Pmax f transmission power of each femtocell Pmax to 15 dBm, each of the power levels A1 and A2 is set to 5 dBm, the learning rate α to 0.5, the discounted rate γ to 0.9 and the random number ǫ to 0.1 [7] and [9]. B. Numerical Results Figure 1(a) shows the aggregate femtocells capacity (as a function of the number of femtocells) using CPC-Q and FBDPC-Q with R1 in both IL and CL paradigms. It can be observed that CL is much better than IL, where from the figure it can be shown that the aggregate capacity gain of CPC-Q over FBDPC-Q in case of CL is marginal. Since CPC-Q is considered the single agent version of FBDPCQ, it should converge to the global optimal values. This is shown in the figure at small number of femtocells (Nf = 1 and 2). The optimal values are calculated using exhaustive search over all possible actions, where the optimal value is defined to be the maximum aggregate capacity the system can achieve while maintaining the capacity of the macrocell in the convergence range (±1 bits/sec/Hz from β o ). However, starting from Nf = 3, CPC-Q begins to be infeasible since the size of the possible actions set A becomes very large (at Nf = 5: |A| = 320, 000, 0). So, besides the computational problems, the condition of visiting all state-action pair becomes infeasible. Thus, getting the optimal value is also not feasible (that’s why we stopped CPC-Q at Nf = 4). 2 1.5 1 0.5 0 1 2 3 4 5 Number of Femtocells 6 7 (a) Aggregate femtocells capacity versus the number of femtocells. 4 Aggregate Femtocell Capacity (bits/sec/Hz) We consider a wireless network consisting of one macrocell serving Um = 1 macro user underlaid with Nf femtocells. Each femtocell serves Uf = 1 femto-user, which is randomly located in the femtocell coverage area. All of the macro and femto cells share the same frequency band composed of K subcarriers, where orthogonal downlink transmission is assumed. In the simulations, K will change according to the algorithm used: for SBDPC-Q, K = 6, while for both CPCQ and FBDPC-Q, K = 3. The channel gain between any transmitter i and any receiver j on subcarrier k is assumed to be path-loss dominated and is given by: Exhaustive Search CPC−Q FBDPC−Q_R1_IL FBDPC−Q_R1_CL 2.5 Exhaustive Search FBDPC−Q_R1_IL_share FBDPC−Q_R1_IL_scratch FBDPC−Q_R1_CL_share FBDPC−Q_R1_CL_scratch 3.5 3 2 2.5 1.9 1.8 1.7 1.6 2 1.5 1.4 1.3 1.5 1.2 1.1 1340 1350 1360 1370 1380 1390 1 0.5 0 0 500 1000 1500 Q_iterations (b) Aggregate femtocells capacity versus learning iterations. Fig. 1. Aggregate femtocells capacity using CPC-Q and FBDPC-Q with R1 in both IL and CL paradigms. Note also that we stopped the exhaustive search at Nf = 5 due to complexity and memory problems, while FBDPC-Q is shown at Nf = 6 and 7 just to illustrate the continuity of our algorithm. Figures 2 and 3 show the robustness of the proposed SBDPC-Q algorithm. In these figures, we started with Nf = 5, then we added a new femtocell after every 4000 iterations3 to reach Nf = 29 at the 96000th iteration. Finally, we add another femtocell at the 99000th iteration. The figures show how SBDPC-Q using the CL paradigm is more robust to the deployment of new femtocells compared to the IL paradigm. Moreover, in these figures we compare the performance of SBDPC-Q to the docitive idea presented in [7]. We investigated two cases: 1) the already deployed femtocells share their Q-tables with the new femtocells when they first join the system (suffixed with share on the figure), 2) the new deployed femtocells starts with a zero initialized Q-tables (suffixed with scratch on the figure). Figure 2 shows the macrocell convergence on a certain subcarrier using SBDPCQ with R1 in both IL and CL paradigms, where it can be observed that the CL paradigm maintains the macrocell capacity within the range of convergence (6 ± 1 bits/sec/Hz) and reacts well to the effect of the new deployed femtocells, without the need to have a learning phase again every time a new femtocell is deployed, which is a very interesting observation. It can also be observed that our proposed CL paradigm converges to the same value regardless the already deployed femtocells shared its Q-tables with the new ones or not. So, sharing could be ignored, thus decreasing the overall overhead. On the other hand, the IL paradigm showed a very bad reaction to the network dynamics, where 1) convergence is not attained (i.e. an oscillating behavior is generated), 2) as Nf increases, IL paradigm may push the macrocell capacity out of the convergence range when the network becomes more dense. Thus, CL is more scalable than IL. However, it can be noticed that the docitive idea is useful in the IL 12 6.6 6.4 11 6.2 6 5.8 Macrocell capacity (bits/sec/Hz) 10 5.6 5.4 9 5.2 5 4.8 8 640 660 680 700 720 740 7 6 5 SBDPC−Q_R1_IL_scratch SBDPC−Q_R1_CL_scratch SBDPC−Q_R1_IL_share SBDPC−Q_R1_CL_share 4 3 2 0 100 200 300 400 500 600 Q_iterations 700 800 900 1000 Fig. 2. Convergence of the macrocell capacity over the Q-iterations on a certain subcarrier where Nf was initially 5, then incremented until Nf = 30. Aggregate Femtocell capacity (bits/sec/Hz) 6 SBDPC−Q_R1_IL_scratch SBDPC−Q_R1_CL_scratch SBDPC−Q_R1_IL_share SBDPC−Q_R1_CL_share 2.1 5 2 1.9 1.8 1.7 1.6 4 1.5 1.4 1.3 1.2 1.1 3 620 640 660 680 700 720 740 2 1 0 0 100 200 300 400 Q 500 600 iterations 700 800 900 1000 Fig. 3. Aggregate femtocells capacity over the Q-iterations where Nf was initially 5, then incremented until Nf = 30. femtocells, to the ideal value, we used the small scale problem again. This is shown in figure 1(b), where we started with Nf = 2 and added an extra femtocell at the 8000th, 12000th and 13500th iterations4 . Again, it can be observed that CL achieves aggregate capacity that is very close to the optimal one while the IL paradigm is far from it. VI. C ONCLUSION In this paper, three Q-learning based power allocation algorithms for cognitive femtocells scenario are presented: namely, SBDPC-Q, FBDPC-Q and CPC-Q. Although SBDPCQ was presented in previous work as DPC-Q, in this paper it is extended, in both of its learning paradigms: IL and CL, to evaluate its performance, robustness and scalability. In terms of performance, SBDPC-Q is extended to FBDPC-Q and then compared to CPC-Q, where the simulations showed that the CL paradigm outperforms the IL and achieves aggregate femtocell capacity that is very close the optimum one. In terms of robustness, the CL paradigm was found to be much more robust against the deployment of new femtocells during the learning process, where the results showed that the CL paradigm outperforms the IL paradigm in: 1) maintaining convergence, 2) learning better (i.e. reacting better to the network dynamics), especially when a suitable reward function such as the one defined in the simulations is used, 3) converging to the target capacity regardless the old femtocells share their experience (i.e. Q-tables) with the new deployed ones or not and 4) speeding up the convergence. Finally, in terms of scalability, CL paradigm reacted better to the network dynamics and maintained convergence, even when the number of the femtocells is large . R EFERENCES paradigm, where sharing the Q-tables of the already deployed femtocells with the new ones is much better (in terms of the value that the macrocell capacity oscillates around) than beginning with zero-initialized (scratch) Q-tables. In terms of speed of convergence, it can be noticed that, although the learning process may need large number of iterations initially, CL decreases the dynamics of the learning process, and hence, making it faster. This can be noticed from the figure, where CL converged almost at the 300th iteration, which is much earlier than the IL paradigm. Also, after the deployment of each new femtocell, CL took less than 10 iterations only around 0.01 seconds - to re-achieve convergence. Figure 3 shows the aggregate femtocells capacity over the learning iterations. It can be noticed that using the CL paradigm, the aggregate capacity increases as more femtocells are deployed in the network, while in the IL paradigms, since convergence is already not maintained, the aggregate capacity behavior has a sporadic behavior, which indicates clearly that IL is not efficient to react to the network dynamics. However, in the CL paradigm from the 64000th to the 72000th iteration(640th to 720th according to figure’s scale), it can be noticed that the aggregate capacity decreases. The reason is that as more femtocells are being deployed, the network becomes very dense and since using the CL paradigm makes the cooperating femtocells use the same powers, this may force the macrocell capacity to violate the range of convergence. Thus, all the femtocells will have to decrease the power used to maintain again the macrocell capacity within the range of convergence leading to the decrease of their aggregate capacity. Note that at the 64000th and 72000th iterations, ǫ is already removed, which proves that R1 learns well even when ǫ is removed. Finally, in order to compare the aggregate capacity the CL paradigm achieves, after the incremental deployment of [1] V. Chandrasekhar, J. Andrews, and A. Gatherer, “Femtocell networks: a survey,” Communications Magazine, IEEE, vol. 46, September 2008. [2] S. Saunders, S. Carlaw, A. Giustina et al., Femtocells: Opportunities and Challenges for Business and Technology. Great Britain: John Wiley and Sons Ltd, 2009. [3] P. Xia, V. Chandrasekhar, and J. G. Andrews, “Open vs closed access femtocells in the uplink,” CoRR, vol. abs/1002.2964, 2010. [4] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction. Cambridge MA, MIT press, 1998. [5] J. C. H. Watkins and P. Dayan, “Technical note Q-learning,” Journal of Machine Learning, vol. 8, 1992. [6] J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement learning by payoff propagation,” J. Mach. Learn. Res., vol. 7, December 2006. [7] A. Galindo-Serrano, L. Giupponi, and M. Dohler, “Cognition and docition in OFDMA-based femtocell networks,” in Proceeding of the IEEE Global Telecommunications Conference (GLOBECOM), may 2010. [8] A. Galindo-Serrano and L. Giupponi, “Distributed Q-learning for interference control in OFDMA-based femtocell networks,” in Proceedings of the 71st IEEE Vehicular Technology Conference, 2010. [9] H. Saad, A. Mohamed, and T. ElBatt, “Distributed cooperative Qlearning for power allocation in cognitive femtocell networks,” in Proceedings of the IEEE 76th Vehicular Technology Conference, Sep. 2012. [10] A. Burkov and B. Chaib-draa, “Labeled initialized adaptive play qlearning for stochastic games,” in Proceedings of the AAMAS’07 Workshop on Adaptive and Learning Agents (ALAg’07), May 2007. [11] F. S. Melo, “Convergence of Q-learning: A simple proof,” Institute Of Systems and Robotics, Tech. Rep., 2001. 3 In figures 2 and 3, ǫ is removed at the 50000th iteration and the figures were drawn with step = 100 in order to achieve better resolution. 4 In figure 1(b), ǫ is removed at the 12000th iteration and the figure was drawn with step = 10 in order to achieve better resolution.