Abstract
We evaluate the performance of Whittle index policy for restless Markovian bandit. It is shown in Weber and Weiss (J Appl Probab 27(3):637–648, 1990) that if the bandit is indexable and the associated deterministic system has a global attractor fixed point, then the Whittle index policy is asymptotically optimal in the regime where the arm population grows proportionally with the number of activation arms. In this paper, we show that, under the same conditions, this convergence rate is exponential in the arm population, unless the fixed point is singular (to be defined later), which almost never happens in practice. Our result holds for the continuous-time model of Weber and Weiss (1990) and for a discrete-time model in which all bandits make synchronous transitions. Our proof is based on the nature of the deterministic equation governing the stochastic system: We show that it is a piecewise affine continuous dynamical system inside the simplex of the empirical measure of the arms. Using simulations and numerical solvers, we also investigate the singular cases, as well as how the level of singularity influences the (exponential) convergence rate. We illustrate our theorem on a Markovian fading channel model.
Similar content being viewed by others
Notes
The most efficient algorithm to test indexability and compute the index can be found in Gast et al [13]. For a given model with d states, the complexity of this algorithm is \(o(d^3)\).
If two states or more had the same index, to specify an index policy, one would need a tie-breaking rule. Our proof would work if the tie-breaking rule defines a strict order of the states.
The code and parameters to reproduce all experiments and figures of the paper are available in a Git repository https://gitlab.inria.fr/phdchenyan/code_ap2021.
Recall that \(\phi \) is an application from \(\Delta ^d\) to \(\Delta ^d\). This means in particular that all the rows of all matrices \(\textbf{K}_i\) sum to 1. Therefore, each of these matrices have an eigenvalue 1. When we write "the norm of all eigenvalues of \(\textbf{K}_i\) is smaller than 1", we mean 1 is an eigenvalue of \(\textbf{K}_i\) and has multiplicity one; all other eigenvalues must be of norm strictly less than 1.
In what follows, we write \(-0.4 \dots \) to mean a number that approaches \(-0.4\).
We refer to our Git repository for a more thorough numerical exploration of this case.
The rates \(Q^{a^n(t)}_{ij}\) are well defined because bandits evolve independently and that the probability that two arms evolve at the same time is 0.
References
Aalto, S., Lassila, P., Osti, P.: Whittle index approach to size-aware scheduling with time-varying channels. In: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 57–69 (2015)
Ansell, P., Glazebrook, K.D., Nino-Mora, J., et al.: Whittle’s index policy for a multi-class queueing system with convex holding costs. Math. Methods Oper. Res. 57(1), 21–39 (2003)
Avrachenkov, K.E., Borkar, V.S.: Whittle index policy for crawling ephemeral content. IEEE Trans. Control Netw. Syst. 5(1), 446–455 (2016)
Blondel, V.D., Bournez, O., Koiran, P., et al.: The stability of saturated linear dynamical systems is undecidable. J. Comput. Syst. Sci. 62(3), 442–462 (2001)
Brown, D.B., Smith, J.E.: Index policies and performance bounds for dynamic selection problems. Manag. Sci. 66, 3029–3050 (2020)
Darling, R., Norris, J.: Differential equation approximations for Markov chains. Probab. Surv. 5, 37–79 (2008)
Duff, M.O.: Q-learning for bandit problems. In: Proceedings of the Twelfth International Conference on International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML’95, pp. 209–217 (1995)
Duran, S., Verloop, M.: Asymptotic optimal control of markov-modulated restless bandits. In: International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2018), vol 2. ACM : Association for Computing Machinery, Irvine, US, pp. 7:1–7:25 (2018)
Gast, N.: Expected Values Estimated via Mean-Field Approximation are 1/N-Accurate. In: ACM SIGMETRICS/ International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’17, Urbana-Champaign, United States, p. 26 (2017)
Gast, N., Van Houdt, B.: A refined mean field approximation. In: Proceedings of the ACM on Measurement and Analysis of Computing Systems 1(28) (2017)
Gast, N., Bortolussi, L., Tribastone, M.: Size expansions of mean field approximation: transient and steady-state analysis. In: 2018–36th International Symposium on Computer Performance, Modeling, Measurements and Evaluation, Toulouse, France, pp. 1–2 (2018)
Gast, N., Latella, D., Massink, M.: A refined mean field approximation of synchronous discrete-time population models. Perform. Eval. 126, 1–21 (2018)
Gast, N., Gaujal, B., Khun, K.: Computing whittle (and gittins) index in subcubic time. arXiv preprint arXiv:2203.05207 (2022)
Gast, N., Gaujal, B., Yan, C.: Lp-based policies for restless bandits: necessary and sufficient conditions for (exponentially fast) asymptotic optimality (2022)
Gittins, J., Glazebrook, K., Weber, R.: Multi-armed Bandit Allocation Indices. John Wiley & Sons, Hoboken (2011)
Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B 148–177 (1979)
Hodge, D.J., Glazebrook, K.D.: On the asymptotic optimality of greedy index heuristics for multi-action restless bandits. Adv. Appl. Probab. 47(3), 652–667 (2015)
Hu, W., Frazier, P.: An asymptotically optimal index policy for finite-horizon restless bandits (2017)
Kifer, Y.: Random Perturbations of Dynamical Systems. Progress in Probability. Birkhäuser, Boston (1988)
Kurtz, T.G.: Strong approximation theorems for density dependent Markov chains. Stoch. Process. Appl. 6(3), 223–240 (1978)
Larranaga, M., Ayesta, U., Verloop, I.M.: Dynamic control of birth-and-death restless bandits: application to resource-allocation problems. IEEE/ACM Trans. Netw. 24(6), 3812–3825 (2016)
Lattimore, T., Szepesvári, C.: Bandit Algorithms. Cambridge University Press, Cambridge (2020)
Liu, K., Zhao, Q.: Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Trans. Inf. Theory 56(11), 5547–5567 (2010)
Meshram, R., Manjunath, D., Gopalan, A.: On the whittle index for restless multiarmed hidden Markov bandits. IEEE Trans. Autom. Control 63(9), 3046–3053 (2018)
Niño-Mora, J., Villar, S.S.: Sensor scheduling for hunting elusive hiding targets via whittle’s restless bandit index policy. In: International Conference on NETwork Games, Control and Optimization (NetGCooP 2011). IEEE, pp. 1–8 (2011)
Ouyang, W., Eryilmaz, A., Shroff, N.B.: Asymptotically optimal downlink scheduling over Markovian fading channels. In: 2012 Proceedings IEEE INFOCOM, IEEE, pp. 1224–1232 (2012)
Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of optimal queuing network control. Math. Oper. Res. 293–305 (1999)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. John Wiley & Sons Inc, New York (1994)
Raghunathan, V., Borkar, V., Cao, M., et al.: Index policies for real-time multicast scheduling for wireless broadcast systems. In: IEEE INFOCOM 2008-The 27th Conference on Computer Communications, IEEE, pp. 1570–1578 (2008)
Verloop, M.: Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Probab. 26(4), 1947–1995 (2016)
Weber, R.R., Weiss, G.: On an index policy for restless bandits. J. Appl. Probab. 27(3), 637–648 (1990)
Weber, R.R., Weiss, G.: Addendum to: On an index policy for restless bandits. Adv. Appl. Probab. 23(2), 429–430 (1991)
Whittle, P.: Restless bandits: activity allocation in a changing world. J. Appl. Probab. 25A, 287–298 (1988)
Ying, L.: Stein’s method for mean field approximations in light and heavy traffic regimes. POMACS 1(1), 1–27 (2017)
Zayas-Caban, G., Jasin, S., Wang, G.: An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Adv. Appl. Probab. 51(3), 745–772 (2019)
Zhang, X., Frazier, P.I.: Restless bandits with many arms: beating the central limit theorem (2021)
Zhang, X., Frazier, P.I.: Near-optimality for infinite-horizon restless bandits with many arms. arXiv preprint arXiv:2203.15853 (2022)
Acknowledgements
This work was supported by the ANR project REFINO (ANR-19-CE23–0015). Chen YAN would like to express his gratitude to Maaike Verloop for her warm hospitality at Toulouse INP during November 2019, and for the numerous engaging discussions on Whittle indices throughout his stay there.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proof of Theorem 1
Proof
Let \(\textbf{m}^*\) be the fixed point of \(\phi \). As \(\textbf{P}^0\), \(\textbf{P}^1\) are rational, each coordinate of \(\textbf{m}^*\) is a rational number. Let \(\{ N_k \}_{k \ge 0}\) be a sequence of increasing integers that goes to \(\infty \), such that for all \(k \ge 0\) and all \(1 \le i \le d\), \(m^*_i N_k\) and \(\alpha N_k\) are integers. We then fix an N from this sequence \(\{ N_k \}_{k \ge 0}\). Recall that \(m_i N\) is the number of arms in state i in configuration \(\textbf{m}\) and that \(S_n(t)\) is the state of arm n at time t. We use \(\textbf{S}(t)\) to denote the state vector of the N arms system at time t. Let \(\textbf{S}^*\) be a state vector corresponds to configuration \(\textbf{m}^*\) with N arms. This is possible as \(m^*_i N\) is an integer for all \(i\in \{1, \dots , d\}\).
Note that in configuration \(\textbf{m}^*\) (i.e., state vector \(\textbf{S}^*\)), an optimal action \(\textbf{a}^*\) under the relaxed constraint (5) will activate exactly \(\alpha N\) arms. As \(\textbf{a}^*\) is sub-optimal compared to an optimal policy for the original N arms problem (2)-(3), we have
where in the above equation the function \(V:\textbf{S}\rightarrow \mathbb {R}\) is the bias of the MDP. The first line corresponds to Bellman’s equation (see e.g., Equation 8.4.2 in Chapter 8 of Puterman [28]), the second line is because \(\textbf{a}^*\) is a valid action for the N-arms MDP but might not be the optimal action, and the last line is because \(\sum _{n=1}^{N} R^{a^*_n}_{S^*_n}=V^{(N)}_{\textrm{rel}}(\alpha )=N V^{(1)}_{\textrm{rel}}(\alpha )\).
We hence obtain
In the following, we bound \(\mathbb {E}_{\textbf{a}^*}[V\big (\textbf{S}(1)\big ) - h(\textbf{S}^*)]\). This will be achieved in two steps.
Step One
We define for two state vectors \(\textbf{y}\), \(\textbf{z}\) the distance
which counts the number (among the N arms) of arms that are in different states between those two vectors. Such distance satisfies the property that for all \(\textbf{y}\) and \(\textbf{z}\) such that \(\delta (\textbf{y},\textbf{z}) = k\), we can find a sequence of state vectors \(\textbf{z}_1, \textbf{z}_2,\ldots , \textbf{z}_{k-1}\) that verify \(\delta (\textbf{y},\textbf{z}_1) = \delta (\textbf{z}_1, \textbf{z}_2)=\cdots = \delta (\textbf{z}_{k-2},\textbf{z}_{k-1}) = \delta (\textbf{z}_{k-1}, \textbf{z}) = 1\). In what follows, we show that there exists \(C>0\) independent of N such that for all state vectors \(\textbf{y}\) and \(\textbf{z}\), the bias function h(.) satisfies:
In view of the above property of \(\delta \), we only need to prove this for \(\delta (\textbf{y},\textbf{z})=1\), i.e.,
Let \(\textbf{y}\), \(\textbf{z}\) be two state vectors such that \(\delta (\textbf{y},\textbf{z})=1\), and assume without loss of generality that it is arm 1 that are in different states: \(y_1\ne z_1\) and \(y_n=z_n\) for \(n\in \{2\dots N\}\). We use a coupling argument as follows: We consider two trajectories of the N arms system, \(\textbf{Y}\) and \(\textbf{Z}\), that start, respectively, in state vectors \(\textbf{Y}(0)=\textbf{y}\) and \(\textbf{Z}(0)=\textbf{z}\). Let \(\pi ^*\) be the optimal policy of the N arms MDP, and suppose that we apply \(\pi ^*\) to the trajectory \(\textbf{Z}\). At time t, the action vector will be \( \pi ^*(\textbf{Z}(t))\). We couple the trajectories \(\textbf{Y}\) and \(\textbf{Z}\) by applying the same action vectors \(\pi ^*(\textbf{Z}(t))\) for \(\textbf{Y}\) and keeping \(Y_n(t)=Z_n(t)\) for arms \(n\in \{2\dots N\}\). The \(\textbf{Z}\) trajectory follows an optimal trajectory, hence Bellman’s equation is satisfied: for any \(T>0\), we have:
Since \(\textbf{Y}\) follows a possibly sub-optimal trajectory, we have:
Recall that the matrices \(\textbf{P}^0,\textbf{P}^1\) are such that a bandit is recurrent and aperiodic. This shows that the mixing time of a single arm is bounded (independently of N): for any policy \(\pi \in \Pi \)
Because of the coupling, for \(0 \le t \le T\) and \(\ 1 \le n \le N\), \(Y_n(t) \ne Z_n(t)\) is only possible for \(n=1\). Furthermore, as the mixing time of an arm is bounded, for T large enough, there is a positive probability, say at least \(p > 0\), that \(Y_1 (T) = Z_1 (T)\). Hence, with probability smaller than \(1-p\) we have \(\delta \big (\textbf{y}(T), \textbf{z}(T)\big ) = 1\), conditional on \(\textbf{Y}(0)= \textbf{y}\) and \(\textbf{Z}(0)= \textbf{z}\).
Let \(r:= 2 \max _{1 \le i \le d, a\in \{0,1\}} | R^a_i |\). Subtracting (17) in (18) gives
This being true for all \(\textbf{y}, \textbf{z}\) with \(\delta (\textbf{y},\textbf{z})=1\), it implies that \(\max _{\textbf{U},\textbf{V}: \ \delta (\textbf{U},\textbf{V})=1} \big \{ |h(\textbf{U})-h(\textbf{V})| \big \} \le T \cdot r / p\), and we can take the constant \(C:=T \cdot r/p\).
Step Two
Recall that the state vector \(\textbf{S}^*\) corresponds to the optimal (relaxed) configuration \(\textbf{m}^*\). We now prove that
with a constant D independent of N, where \( \textbf{S}(1) \) is the random vector conditional on \(\textbf{S}(0) = \textbf{S}^*\) under action vector \(\textbf{a}^*\).
Indeed, let \(\textbf{x}^*:= \textbf{m}^* N\), and denote \(\textbf{X}:= \textbf{m}(1) N\) to be the random d-vector, with \(\textbf{m}(1)\) the random configuration corresponds to \(\textbf{S}(1)\). For each \(1 \le i \le d\), we may write
where \(B_{i,j}^a \sim Binomial (x^*_{j,a}, P^{a}_{ji})\) for \(1 \le j \le d\), \(a \in \{ 0,1 \}\); and \(x^*_{j,0} + x^*_{j,1} = x^*_j\), with \(x^*_{j,a}\) representing the number of arms in state j taking action a, when optimal action vector \(\textbf{a}^*\) is applied to state vector \(\textbf{S}^*\).
By stationarity, we have
and
Consequently, we can bound
with a constant D independent of N.
To summarize, we have
hence
which implies that \(V^{(N)}_{\textrm{opt}} (\alpha )/ N \rightarrow V^{(1)}_{\textrm{rel}}(\alpha )\) when N goes to \(+\infty \). Moreover, from (19), the convergence rate is at least as fast as \(\mathcal {O}(1/ \sqrt{N})\). \(\square \)
B Proof of Lemma 2
In this appendix, we prove Lemma 2. We first show the piecewise affine property in Lemma 6, which gives (i) and (ii). We then show the uniqueness of fixed point from a bijective property in Lemma 7, from which we conclude (iii).
Lemma 6
(Piecewise affine) \(\phi \) is a piecewise affine continuous function, with d affine pieces.
Proof
Let \(\textbf{m}\in \Delta ^d\) be a configuration and recall \(s(\textbf{m})\in \{1, \dots , d\}\) is the state such that \(\sum _{i=1}^{s(\textbf{m})-1}m_i\le \alpha < \sum _{i=1}^{s(\textbf{m})}m_i\). When the system is in configuration \(\textbf{m}\) at time t, WIP will activate all arms that are in states 1 to \(s(\textbf{m})-1\) and not activate any arm in states \(s(\textbf{m})+1\) to d. Among the \(Nm_{s(\textbf{m})}\) arms in state \(s(\textbf{m})\), \(N(\alpha -\sum _{i=1}^{s(\textbf{m})-1}m_i)\) of them will be activated and the rest will not be activated.
This implies that the expected number of arms in state j at time \(t+1\) will be equal to
It justifies the expression (8). Note that (8) can be reorganized to
Consequently \(\phi (\textbf{m}) = \textbf{m}\cdot \textbf{K}_{s(\textbf{m})} + \textbf{b}_{s(\textbf{m})}\), where \(\textbf{b}_{s(\textbf{m})} = \alpha (\textbf{P}^1_{s(\textbf{m})} - \textbf{P}^{0}_{s(\textbf{m})})\), and \(\textbf{K}_{s(\textbf{m})} = \) \( \begin{pmatrix} \textbf{P}^1_1 - \textbf{P}^1_{s(\textbf{m})} + \textbf{P}_{s(\textbf{m})}^0 \\ \textbf{P}^1_2 - \textbf{P}^1_{s(\textbf{m})} + \textbf{P}_{s(\textbf{m})}^0 \\ ... \\ \textbf{P}^1_{s(\textbf{m})-1} - \textbf{P}^1_{s(\textbf{m})} + \textbf{P}^0_{s(\textbf{m})} \\ \textbf{P}^0_{s(\textbf{m})} \\ \textbf{P}^0_{s(\textbf{m})+1} \\ ... \\ \textbf{P}^0_{d} \end{pmatrix} \).
Let \(\mathcal {Z}_i:= \{\textbf{m}\in \Delta ^d \mid s(\textbf{m})=i\}\). The above expression of \(\phi \) implies that this map is affine on each zone \(\mathcal {Z}_i\). There are d such zones with \(1 \le i \le d\). It is clear from the expression that \(\phi (\textbf{m})\) is continuous on \(\textbf{m}\). \(\square \)
Lemma 7
(Bijectivity) Let \(\pi (s,\theta ) \in \Pi \) be the policy that activates all arms in states \(1,\dots ,s-1\), does not activate arms in states \(s+1, s+2, \dots , d\), and that activates arms in state s with probability \(\theta \). Denote by \(\tilde{\alpha }(s,\theta )\) the proportion of time that the active action is taken using policy \(\pi (s,\theta )\). Then, the function \((s,\theta )\mapsto \tilde{\alpha }(s,\theta )\) is a bijective map from \(\{1 \dots d \} \times [0,1)\) to [0, 1).
Proof
The following proof is partially adapted from the proof of (Weber and Weiss, [31], Lemma 1). For a given \(\nu \in \mathbb {R}\), denote by \(\gamma (\nu )\) the value of the subsidy-\(\nu \) problem, i.e.,
We defined similarly \(\gamma _{\pi } (\nu )\) as the value under policy \(\pi \) for a such subsidy-\(\nu \) problem. Note that for fixed \(\pi \), the function \(\gamma _{\pi } (\nu )\) is affine and increasing in \(\nu \).
By definition of indexability, \(\gamma (\nu ) = \max _{\pi \in \Pi } \gamma _{\pi } (\nu )\) is a piecewise affine, continuous and convex function of \(\nu \): it is affine on \((-\infty ;\nu _d]\), on \([\nu _1;+\infty )\) and on all \([\nu _s;\nu _{s-1}]\) for \(s\in \{2\dots d\}\).
Moreover, for \(s\in \{2\dots d-1\}\) and \(\nu \in [\nu _s;\nu _{s-1}]\), the optimal policy of (21) is to activate all arms up to state \(s-1\). Hence,
Similarly, and as \(\tilde{\alpha }(s+1,0)=\tilde{\alpha }(s,1)\), for \(\nu \in [\nu _{s+1};\nu _{s}]\) we have:
Consequently
The convexity of \(\gamma (\nu )\) implies that \(1-\tilde{\alpha }(s,0) > 1-\tilde{\alpha }(s,1)\), hence \(\tilde{\alpha }(s,1) > \tilde{\alpha }(s,0)\).
Now suppose that \(\textbf{m}^0\) and \(\textbf{m}^1\) are the equilibrium distributions of policies \(\pi (s,0)\) and \(\pi (s,1)\). Let \(0< \theta < 1\). The equilibrium distribution \(\textbf{m}^{\theta }\) induced by \(\pi (s,\theta )\) is then a linear combination of \(\textbf{m}^0\) and \(\textbf{m}^1\), namely \(\textbf{m}^{\theta } = p\cdot \textbf{m}^0 + (1-p)\cdot \textbf{m}^1\), with
Hence,
and
Observe that \(\tilde{\alpha }(s,\theta )\) is the ratio of two affine functions of \(\theta \), hence is monotone as \(\theta \) ranges from 0 to 1; but as \(\tilde{\alpha }(s,1) > \tilde{\alpha }(s,0)\), it is monotonically increasing. We hence obtain
which concludes the proof. \(\square \)
We are now ready to finish the proof of Lemma 2(iii). Let \(\textbf{m}\) be a fixed point of the continuous map \(\phi \) (that exists by Brouwer’s fixed-point theorem). Under configuration \(\textbf{m}\), all arms that are in states from 1 to \(s(\textbf{m})-1\) are activated, and a fraction \(\theta (\textbf{m})=(\alpha -\sum _{i=1}^{s(\textbf{m})-1} m_i)/m_{s(\textbf{m})}\) of the arms that are in state \(s(\textbf{m})\) are activated. This shows that \(\textbf{m}\) also corresponds to the stationary distribution of the policy \(\pi (s(\textbf{m}),\theta (\textbf{m}))\). The proportion of activated arms of this policy is \(\tilde{\alpha }(s(\textbf{m}),\theta (\textbf{m}))=\alpha \). Consequently, if \(\textbf{m}'\) is another fixed point of \(\phi \), then \(\textbf{m}'\) would have to be the stationary distribution of some other policy of the form \(\pi (s',\theta ')\), with \(\tilde{\alpha }(s',\theta ') = \alpha \). As the function \((s,\theta )\mapsto \tilde{\alpha }(s,\theta )\) is a bijection, this implies that \(s' = s(\textbf{m})\) and \(\theta '=\theta (\textbf{m})\). Hence, the fixed point of \(\phi \) is unique.
C Proof of Theorem 3
In this appendix, we explain technical details of the proof of our main result Theorem 3. In the following, we denote by \(\mathcal {B}(\textbf{m}^*, r)\) the ball centered at \(\textbf{m}^*\) with radius r.
Theorem 8
Under the same assumptions as in Theorem 3, and assume that \(\textbf{M}^{(N)}(0)\) is already in stationary regime. Then there exists two constants \(b,c >0\) such that
-
(i)
\(\Vert \mathbb {E}[\textbf{M}^{(N)} (0)] - \textbf{m}^* \Vert \le b \cdot e^{-cN}\);
-
(ii)
\(\mathbb {P}\left[ \textbf{M}^{(N)}(0)\not \in \mathcal {Z}_{s(\textbf{m}^*)}\right] \le b \cdot e^{-cN}\).
Let us first explain how Theorem 8 implies Theorem 3. To show this, we first prove that:
Lemma 9
Assume that bandits are indexable, and let \(\rho (\textbf{m})\) be the instantaneous arm-averaged reward of WIP when the system is in configuration \(\textbf{m}\). Then:
-
(i)
\(\rho \) is piecewise affine on each of the zone \(\mathcal {Z}_i\) and for all \(\textbf{m}\in \Delta ^d\):
$$\begin{aligned} \rho (\textbf{m})=&\sum _{i=1}^{s(\textbf{m})-1} m_i R^1_{i} + \left( \alpha -\sum _{i=1}^{s(\textbf{m})-1}m_i\right) R^1_{s(\textbf{m})} + \left( \sum _{i=1}^{s(\textbf{m})}m_i - \alpha \right) R^0_{s(\textbf{m})}\nonumber \\&+ \sum _{i=s(\textbf{m})+1}^{d} m_i R^0_{i}. \end{aligned}$$(22) -
(ii)
\(\rho (\textbf{m}^*)=V^{(1)}_{\textrm{rel}}(\alpha )\).
Proof
Let \(\textbf{m}\in \Delta ^d\) be a configuration and recall \(s(\textbf{m})\in \{1, \dots , d\}\) is the state such that \(\sum _{i=1}^{s(\textbf{m})-1}m_i\le \alpha < \sum _{i=1}^{s(\textbf{m})}m_i\). Similarly to our analysis of Lemma 6, when the system is in configuration \(\textbf{m}\), WIP will activate all arms that are in states 1 to \(s(\textbf{m})-1\). This will lead an instantaneous reward of \(\sum _{i=1}^{s(\textbf{m})-1}Nm_iR^1_i\). WIP will not activate arms that are in states \(s(\textbf{m})+1\) to d. This will lead an instantaneous reward of \(\sum _{i=s(\textbf{m})+1}^{d}Nm_iR^0_i\). Among the \(Nm_{s(\textbf{m})}\) arms in state \(s(\textbf{m})\), \(N(\alpha -\sum _{i=1}^{s(\textbf{m})-1}m_i)\) of them will be activated and the rest will not be activated. This shows that \(\rho (\textbf{m})\) is given by (22).
For (ii), recall that \(\textbf{m}^*\) is the unique fixed point, and consider a subsidy-\(\nu _{s(\textbf{m}^*)}\) MDP, where \(\nu _{s(\textbf{m}^*)}\) is the Whittle index of state \(s(\textbf{m}^*)\). Denote by L the value of this MDP:
By definition of Whittle index, any policy of the form \(\pi (s(\textbf{m}^*),\theta )\) defined in Lemma 7 is optimal for (23). Moreover, if \(\theta ^*\) is such that \(\tilde{\alpha }(s(\textbf{m}^*),\theta ^*)=\alpha \), then such a policy satisfies the constraint (5): \(\lim _{T\rightarrow \infty }\frac{1}{T} \sum _{t=0}^{T-1}\mathbb {E}\left[ a_n(t)\right] = \alpha \). This shows that \(L=V^{(1)}_{\textrm{rel}}(\alpha )\) and as all arms are identical, we have \(N \cdot V^{(1)}_{\textrm{rel}}(\alpha )=V^{(N)}_{\textrm{rel}}(\alpha )\), and \(\pi (s(\textbf{m}^*),\theta ^*)\) is an optimal policy for the relaxed constraint (5).
It remains to show that the reward of policy \(\pi (s(\textbf{m}^*),\theta ^*)\) is \(\rho (\textbf{m}^*)\). This comes from the fact that the steady-state of the Markov chain induced by this policy is \(\textbf{m}^*\), and \(\pi (s(\textbf{m}^*), \theta ^*)\) is such that \(\alpha N\) arms are activated on average. Indeed, the arm-averaged reward of this policy is:
As the proportion of activated arms is \(\alpha \), we have \(\sum _{i=1}^{s(\textbf{m}^*)-1} m^*_i + \theta ^* m^*_{s(\textbf{m}^*)}=\alpha \). Hence, (24) coincides with the expression of \(\rho (\textbf{m}^*)\) in (22), and \(\rho (\textbf{m}^*)= L = V^{(1)}_{\textrm{rel}}(\alpha )\). This concludes the proof of Lemma 9. \(\square \)
By definition, the performance of WIP is \(V^{(N)}_{\textrm{WIP}} (\alpha )= N \cdot \mathbb {E}\left[ \rho (\textbf{M}^{(N)}(0))\right] \). Hence from Lemma 9 we have
By linearity of \(\rho \) and Theorem 8(i), the first term inside the above expectation is exponentially small; by Theorem 8(ii) and since the rewards are bounded, the second term is also exponentially small.
In the rest of the section, we first prove a few technical lemma and conclude by proving Theorem 8.
1.1 C.1 Hoeffding’s inequality (for one transition)
Lemma 10
(Hoeffding’s inequality) For all \(t \in \mathbb {N}\), we have
where the random vector \({\varvec{\epsilon }}^{(N)} (t+1)\) is such that
and for all \(\delta >0\):
Proof
Since the N arms evolve independently, we may apply the following form of Hoeffding’s inequality: Let \(X_1\), \(X_2\),..., \(X_N\) be N independent random variables bounded by the interval [0, 1], and define the empirical mean of these variables by \(\overline{X}:= \frac{1}{N} (X_1 + X_2 +\cdots + X_N)\), then
More precisely, for a fixed \(1 \le j \le d\), we have
where for \(1 \le i \le d\), \(\ 1 \le k \le N \cdot M^{(N)}_i(t)\), the \(U_{i,k}\)’s are in total N independent and identically distributed uniform (0, 1) random variables, and \(P_{ij}(\textbf{m})\) is the probability for an arm in state i goes to state j under WIP, when the N arms system is in configuration \(\textbf{m}\).
By definition, we have
Hence,
and
where the last inequality comes from the above form of Hoeffding’s inequality. \(\square \)
1.2 C.2 Hoeffding’s inequality (for t transitions)
Lemma 11
There exists a positive constant K such that for all \(t \in \mathbb {N}\) and for all \(\delta > 0\),
Proof
Since \(\phi \) is a piecewise affine function with finite affine pieces, in particular \(\phi \) is K-Lipschitz: there is a constant \(K > 0\) such that for all \(\textbf{m}_1, \textbf{m}_2 \in \Delta ^d\):
Let \(t \in \mathbb {N}\) and \(\textbf{m}\in \) be fixed, we have
By iterating the above inequality, we obtain
where for each \(0 \le s \le t\), we have by lemma 10: for all \(\delta > 0\),
Hence, using the union bound, we obtain
and this ends the proof of Lemma 11. \(\square \)
1.3 C.3 Exponential stability of \(\textbf{m}^*\)
Lemma 12
Under the assumptions of Theorem 3, there exists constants \(b_1,b_2>0\) such that for all \(t\ge 0\) and all \(\textbf{m}\in \Delta ^d\):
Proof
As \(\phi \) is locally stable, for all \(\varepsilon >0\), there exists \(\delta >0\) such that if \(\Vert \textbf{m}-\textbf{m}^*\Vert \le \delta \), then for all \(t \ge 0\): \(\Vert \Phi _t(\textbf{m})-\textbf{m}^*\Vert \le \varepsilon \). Recall that for all \(\textbf{m}\in \mathcal {Z}_{s(\textbf{m}^*)}\), we have \(\phi (\textbf{m})=(\textbf{m}-\textbf{m}^*) \cdot \textbf{K}_{s(\textbf{m}^*)}+\textbf{m}^*\). We choose \(\varepsilon >0\) so that \(\mathcal {B}(\textbf{m}^*,\varepsilon )\subset \mathcal {Z}_{s(\textbf{m}^*)}\).
Let us now show that there exists \(T>0\) such that for all \(\textbf{m}\in \Delta ^d\), \(\Phi _T(\textbf{m})\in \mathcal {B}(\textbf{m}^*,\varepsilon )\). We shall reason by contradiction: If this is not true, then there exists a sequence of \(t \in \mathbb {N}\) that goes to infinity and a corresponding \(\{\textbf{m}_t\}_{t}\) such that \(\Vert \Phi _t(\textbf{m}_t)-\textbf{m}^*\Vert \ge \varepsilon \). As \(\Delta ^d\) is a compact space, there exists a subsequence of \(\{\textbf{m}_t\}_{t}\) (denoted again as \(\{\textbf{m}_t\}_{t}\)) that converges to an element \(\bar{\textbf{m}}\). On the other hand, as \(\textbf{m}^*\) is an attractor, there exists \(T_1\) such that \(\Phi _{T_1}(\bar{\textbf{m}})\in \mathcal {B}(\textbf{m}^*,\delta /2)\). And since \(\Phi _{T_1}(\cdot )\) is continuous, there exists \(\eta >0\) such that if \(\Vert \textbf{m}-\bar{\textbf{m}}\Vert \le \eta \), then \(\Vert \Phi _{T_1}(\textbf{m})-\Phi _{T_1}(\bar{\textbf{m}})\Vert \le \delta /2\). As \(\{\textbf{m}_t\}_{t}\) converges to \(\bar{\textbf{m}}\), there exists \(T_2\) such that for \(t\ge T_2\), we have \(\Vert \textbf{m}_t - \bar{\textbf{m}}\Vert \le \eta \). Consequently for \(t \ge T_2\), we have
Hence, for \(t \ge \max (T_1,T_2)\), by our choice of \(\varepsilon \) and \(\delta \) from the local stability of \(\phi \), we deduce that
This gives a contradiction! Consequently, there exists T such that for all \(\textbf{m}\in \Delta ^d\), \(\Phi _T(\textbf{m})\in \mathcal {B}(\textbf{m}^*,\varepsilon )\). This implies in particular that \(\textbf{K}_{s(\textbf{m}^*)}\) is a stable matrix: the modules of all its eigenvalues are smaller than one. Moreover, we have for all \(\textbf{m}\in \Delta ^d\) and \(t\ge T\):
As \(\mathcal {Z}_{s(\textbf{m}^*)}\) is a stable matrix, this implies that (25) holds for all \(\textbf{m}\in \Delta ^d\). \(\square \)
1.4 C.4 Proof of Theorem 8
We are now ready to prove the main theorem.
Proof
The proof consists of several parts.
1.4.1 C.4.1 Choice of a neighborhood \(\mathcal {N}\)
The fixed point \(\textbf{m}^*\) is in zone \(\mathcal {Z}_{s(\textbf{m}^*)}\) in which \(\phi \) can be written as
As \(\textbf{m}^*\) is not singular, let \(\mathcal {N}_1\) be a neighborhood of \(\textbf{m}^*\) included in \(\mathcal {Z}_{s(\textbf{m}^*)}\). Since \(\textbf{m}^*\) is locally stable, \(\textbf{K}_{s(\textbf{m}^*)}\) is a stable matrix. We can therefore choose a smaller neighborhood \(\mathcal {N}_2 \subset \mathcal {N}_1\) so that \(\Phi _t (\mathcal {N}_2) \subset \mathcal {N}_1\) for all \(t\ge 0\). That is, the image of \(\mathcal {N}_2\) under the maps \(\Phi _{t\ge 0}\) remains inside \(\mathcal {N}_1\). This is possible by stability of \(\textbf{m}^*\). We next choose a neighborhood \(\mathcal {N}_3 \subset \mathcal {N}_2\) and a \(\delta > 0\) so that \((\phi (\mathcal {N}_3))^{\delta } \subset \mathcal {N}_2\), that is, the image of \(\mathcal {N}_3\) under \(\phi \) remains inside \(\mathcal {N}_2\) and it is at least to a distance \(\delta \) away from the boundary of \(\mathcal {N}_2\). We finally fix \(r > 0\) so that the intersection \(\mathcal {B}(\textbf{m}^*, r) \cap \Delta ^d \subset \mathcal {N}_3\), and we choose our neighborhood \(\mathcal {N}\) as
Note that the choice of r and \(\delta \) is independent of N. From (ii) of Lemma 12, we denote furthermore by \(\tilde{T}:= T(r/2)\) the finite time such that for all \(\textbf{m}\in \Delta ^d\), \(\Phi _{\tilde{T}+1} (\textbf{m}) \in \mathcal {B}(\textbf{m}^*, r/2)\).
1.4.2 C.4.2 Definition and properties of the function G.
Following the generator approach used for instance in Gast et al [12]. For \(\textbf{m}\in \Delta ^d\), define \(G: \Delta ^d \rightarrow \mathbb {R}^d\) as
By using Lemma 12, for all \(\textbf{m}\in \Delta ^d\) we have \(\Vert G(\textbf{m}) \Vert \le \sum _{t=0}^{\infty } b_1 \cdot e^{-b_2t} \cdot \Vert \textbf{m}- \textbf{m}^* \Vert < \infty \). This shows that the function G is well defined and bounded. Denote by \( \overline{G}:= \sup _{\textbf{m}\in \Delta ^d} \Vert G(\textbf{m}) \Vert <\infty \).
By our choice of \(\mathcal {N}_2\) defined above, for all \(t\ge 0\) and \(\textbf{m}\in \mathcal {N}_2\) we have:
Hence, for all \(\textbf{m}\in \mathcal {N}_2\), we have
where the last equality holds because \(\textbf{K}_{s(\textbf{m}^*)}\) is a stable matrix. Hence, in \(\mathcal {N}_2\), \(G(\textbf{m})\) is an affine function of \(\textbf{m}\).
From the definition of function G, we see that for all \(\textbf{m}\in \Delta ^d\):
Hence,
In the following, we bound (27) and (28) separately.
1.4.3 C.4.3 Bound on (27)
As G is bounded by \(\overline{G}\), we have
We are left to bound \(\mathbb {P}\left[ \textbf{M}^{(N)} (0) \notin \mathcal {N}\right] \). Let \(u:= \big ( \frac{r}{2(1 + K + K^2 +\cdots + K^{\tilde{T})}} \big )^2\), where K is the Lipschitz constant of \(\phi \). We have by Lemma 11:
This shows that
where the last equality comes from our choice of \(\tilde{T} = T(r/2))\).
1.4.4 C.4.4 Bound on (28)
By Lemma 10, we have
By our choice of \(\mathcal {N}\) and \(\delta \), for the first part of the above expectation, i.e., when the event \(\{ \Vert {\varvec{\epsilon }}^{(N)} (1) \Vert < \delta \} \) occurs, \(\phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1)\) will remain in \(\mathcal {N}_2\), hence \(G\big ( \phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1) \big )\) takes the same affine form as \(G(\phi (\textbf{m}))\). Consequently
For the second part of the above expectation,
So finally
1.4.5 C.4.5 Conclusion of the proof
To summarize, we have obtained by (29):
and
where b, c can be taken as \(b:= (2\overline{G}+1)(\tilde{T}+2)\), \(c:= \min (\delta ^2, u)\), and this concludes the proof of Theorem 8. \(\square \)
D Proof of Theorem 5
Recall that \(\textbf{M}^{(N)}(t)\) is the configuration of the system at time t, which means that \(M^{(N)}_i(t)\) is the fraction of arms that are in state i at time t. Let \(\textbf{e}_i\) be the d-dimensional vector that has all its component equal to 0 except the ith one that equals 1. The process \(\textbf{M}^{(N)}\) is a continuous-time Markov chain that jumps from a configuration \(\textbf{m}\) to a configuration \(\textbf{m}+\frac{1}{N}(\textbf{e}_j-\textbf{e}_i)\) when an arm jumps from state i to state j. For \(i<s(\textbf{m})\), this occurs at rate \(Nm_iQ^1_{ij}\) as all of these arms are activated. For \(i>s(\textbf{m})\), this occurs at rate \(Nm_iQ^0_{ij}\) as these arms are not activated. For \(i=s(\textbf{m})\), this occurs at rate \(N\big ((\alpha -\sum _{k=1}^{s(\textbf{m})-1} m_k)Q^1_{ij} + (\sum _{k=1}^{s(\textbf{m})} m_k-\alpha )Q^0_{ij}\big )\). Let us define:
The process \(\textbf{M}^{(N)}\) jumps from \(\textbf{m}\) to \(\textbf{m}+(\textbf{e}_j-\textbf{e}_i)/N\) at rate \(N \lambda _{ij}(\textbf{m})\). This shows that \(\textbf{M}^{(N)}\) is a density-dependent population process as defined in Kurtz [20]. It is shown in Kurtz [20] that, for any finite time t, the trajectories of \(\textbf{M}^{(N)}(t)\) converge to the solution of a differential equation \(\dot{\textbf{m}}=f(\textbf{m})\) as N grows, with \(f(\textbf{m}):= \sum _{i\ne j}\lambda _{ij}(\textbf{m})(\textbf{e}_j-\textbf{e}_i)\). The function \(f(\textbf{m})\) is called the drift of the system. It should be clear that \(f(\textbf{m})=\tau (\phi (\textbf{m})-\textbf{m})\), where \(\phi \) is defined for the discrete-time version of our continuous-time bandit problem.
For \(t \ge 0\), denote by \(\Phi _t \textbf{m}\) the value at time t of the solution of the differential equation that starts in \(\textbf{m}\) at time 0, it satisfies
Following Gast and Van Houdt [10]; Ying [34], we denote by \(L^{(N)}\) the generator of the N arms system and by \(\Lambda \) the generator of the differential equation. They associate to each almost-everywhere differentiable function h two functions \(L^{(N)}h\) and \(\Lambda h\) that are defined as
with Dh being the differential of function h. The function \(\Lambda h \) is defined only on points \(\textbf{m}\) for which \(h(\textbf{m})\) is differentiable. Remark that if \(h(\textbf{m})\) is an affine function in \(\textbf{m}\), i.e., \(h(\textbf{m}) = \textbf{m}\cdot \textbf{B}+ \textbf{b}\), with \(\textbf{B}\) a d-dimensional matrix and \(\textbf{b}\) a d-dimensional vector, then \(\big (L^{(N)}h\big )(\textbf{m}) = \big (\Lambda h \big ) (\textbf{m}) = f(\textbf{m}) \cdot \textbf{B}\).
Now the analogue of Theorem 8(i) in the continuous-time case is
Theorem 13
Under the same assumptions as in Theorem 5, and assume that \(\textbf{M}^{(N)} (0)\) is already in stationary regime. Then there exists two constants \(b,c >0\) such that
Note first that similarly, Theorem 13 implies Theorem 5.
Proof
Define the continuous-time version of function G as
As for the discrete-time case, our assumptions imply that the unique fixed point is an exponentially stable attractor and a result similar to Lemma 12 can be obtained for the continuous-time case. This implies that the function G is well-defined, continuous and bounded.
Recall that the function f is affine in \(\mathcal {Z}_{s(\textbf{m}^*)}\): since if \(\textbf{m}\in \mathcal {Z}_{s(\textbf{m}^*)}\), then \(\phi (\textbf{m})=(\textbf{m}-\textbf{m}^*)\textbf{K}+\textbf{m}^*\) where K is a \(d\times d\) matrix, and \(f(\textbf{m})= \tau (\phi (\textbf{m})-\textbf{m}) = \tau (\textbf{m}-\textbf{m}^*)(\textbf{K}-\textbf{I})\). Now suppose \(\textbf{m}\in \Delta ^d\) is such that \(\Phi _t \textbf{m}\) remains inside \(\mathcal {Z}_{s(\textbf{m}^*)}\) for all \(t\ge 0\), then
So as for the discrete-time case, \(G(\textbf{m})\) is an affine function of \(\textbf{m}\), with affine factor \(\textbf{B}:= \frac{1}{\tau }(\textbf{K}-\textbf{I})^{-1}\).
As \(\textbf{m}^*\) is non-singular, it is at a positive distance from the other zones \(\mathcal {Z}_i \ne \mathcal {Z}_{s(\textbf{m}^*)}\) and we therefore define \(\delta := \min _{i\ne s(\textbf{m}^*)} d(\textbf{m}^*,\mathcal {Z}_i)/2>0\), where \(d(\cdot \, \cdot )\) is the distance under \(\Vert \cdot \Vert \)-norm. We then choose a neighborhood \(\mathcal {N}_1:= \mathcal {B}(\textbf{m}^*, \epsilon _1) \cap \Delta ^d\) of \(\textbf{m}^*\) such that for all \(t \ge 0 \) and all initial condition \(\textbf{m}\in \mathcal {N}_1\), \(\Phi _t(\textbf{m}) \in \mathcal {B}(\textbf{m}^{*}, \delta )\). This is possible by the exponentially stable attractor property of \(\textbf{m}^*\). Following Theorem 3.2 of Gast [9], we have
where \(\mathcal {N}:= \mathcal {B}(\textbf{m}^*, \epsilon _1/2) \cap \Delta ^d\). Let \(N_0:= \lceil 2/\epsilon _1 \rceil \). For \(N\ge N_0\), \(\textbf{m}\in \mathcal {N}\) verifies additionally that \(\Phi _t \big ( \textbf{m}+ \frac{\textbf{e}_j-\textbf{e}_i}{N}\big ) \in \mathcal {Z}_{s(\textbf{m}^*)}\) for all \(1 \le i \ne j \le d\) and \(t\ge 0\). Hence, G is locally affine and for all \(m\in \mathcal {N}\) and \(N \ge N_0\), we have:
This shows that the first term of (30) is equal to zero.
For the second term, note that both G and \(\Lambda G\) are continuous functions defined on the compact region \(\Delta ^d\), hence they are both bounded, while \(L^{(N)}G\) grows at most linearly with N. Hence, we can choose constants \(u,v > 0\) independent of N such that:
We are left to bound \(\mathbb {P} \big ( \textbf{M}^{(N)} (0) \notin \mathcal {N} \big )\) exponentially from above. This could be done by using the (unnamed) proposition on page 644 of Weber and Weiss [31]. Yet, we were not able to find the paper referenced for the proof of this proposition. Hence, we provide below a direct proof of this. To achieve this, we rely on an exponential martingale concentration inequality, borrowed from Darling and Norris [6], which in our situation can be stated as
Lemma 14
Fix \(T > 0\). Let K be the Lipschitz constant of drift f, denote \(\lambda := \max _{i,j} \lambda _{ij}\), and \(c_1:= e^{-2KT} / 18T\). If \(\epsilon > 0\) is such that
then we have
The above lemma plays the role of Lemma 11 in discrete-time case. Note that its original form stated as Theorem 4.2 in Darling and Norris [6] is under a more general framework, which considered a continuous-time Markov chain with countable state-space evolves in \(\mathbb {R}^d\), and discussed a differential equation approximation to the trajectories of such Markov chain. As such, the right-hand side of (34) has an additional term \(\mathbb {P} (\Omega _0^c \cup \Omega _1^c \cup \Omega _2^c)\), with \(\Omega _i^c\) being the complementary of \(\Omega _i\). In our case, \(\Omega _0 = \Omega _1 = \Omega \) trivially holds; while the analysis of \(\Omega _2\) is more involved. However, as remarked before the statement of Theorem 4.2 in Darling and Norris [6], if the maximum jump rate (in our case \(N \lambda \)) and the maximum jump size (in our case 1/N) of the Markov chain satisfy certain inequality, which in our situation can be sated as (33), then \(\Omega _2 = \Omega \). Note that the constraint (33) is satisfied as long as \(\epsilon \) is sufficiently small, and consequently \(\mathbb {P} (\Omega _0^c \cup \Omega _1^c \cup \Omega _2^c) = 0\).
Now let \(\epsilon >0\) be such that \(\mathcal {B}(\textbf{m}^*, 2\epsilon ) \cap \Delta ^d \subset \mathcal {N}\). The uniform global attractor assumption on \(\textbf{m}^{*}\) ensures that there exists \(T>0\) such that for all \(\textbf{m}\in \Delta ^d\) and \(t\ge T\): \(\Phi _t \textbf{m}\in \mathcal {B}(\textbf{m}^*,\epsilon )\). Let such T and \(\epsilon \) be as in Lemma 14 that verify additionally (33). This is possible as the right-hand side of (33) converges to 0 when \(\epsilon \) is small and T is large.
We then have:
So in summary, (30)-(31) gives
Moreover, for any \(c'>0\) and \(0<c<c'\), \(N \cdot e^{-c'N}=\mathcal {O}(e^{-cN})\), so the right-hand side of (35) can be bounded by a term of the form \(b \cdot e^{-cN}\). This concludes the proof of Theorem 13. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gast, N., Gaujal, B. & Yan, C. Exponential asymptotic optimality of Whittle index policy. Queueing Syst 104, 107–150 (2023). https://doi.org/10.1007/s11134-023-09875-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11134-023-09875-x