Bandits
Bandits
Bandits
We consider the following Bernoulli K-armed bandit setting. Let µ ∈ [0, 1]K . At each time t ≥ 1, a learner
chooses It ∈ {1, . . . , K} and receives reward Xt (It ) ∼ B(µIt ). In all the experiments, we will consider a
Bernoulli bandit with K P= 2 arms and means µ1 = 0.6 and µ2 = 0.5. The goal of the learner is to maximize
n
his cumulative reward t=1 Xt (It ). To do so, the learner minimizes his cumulative regret defined as
" n #
X
∗
Rn = nµ − E Xt (It ) , where µ∗ = max µi .
1≤i≤K
t=1
We are look for algorithms that achieve sublinear regret Rn = o(n) so that the average reward of the learner
tends to the optimal reward µ∗ .
PK Pn
1. Prove that the regret satisfies Rn = i=1 ∆i E Ti (n) , where Ti (n) = t=1 1It =i and ∆i = µ∗ − µi .
3. Implement FTL.
4. Using a horizon of n = 100, run 1000 simulations of your implementation of Follow-the-Leader on the
Bernoulli bandit above and record the (random) regret, Rn , in each simulation.
5. Plot the results using a histogram and explain the result of the Figure.
6. Rerun the experiments until n = 1000 while saving the regrets for each t = {1, . . . , n}. Plot the average
regret obtained in the 1000 simulations as a function of t.
7. Explain the plot. Do you think Follow the Leader is a good algorithm? Why/why not?
8. Implement ETC.
1
9. Using horizon of n = 100, run 1000 simulations for m = {1, 2, 5, 10, 15, 20, 25, 30, 40} and plot the
average regret as a function of m. What is the best choice of m?
10. For the different choices of m, rerun the experiment of question 7. Is Explore-Then-Commmit a good
algorithm?
11. We consider the case K = 2, we assume without loss of generality that the first arm is optimal (i.e.,
µ1 = µ∗ ). We assume that n ≥ 2m.
(a) Show that
E[Ti (n)] = m + (n − 2m)P µbi (2m) ≥ max µ bi (2m)
j6=i
m∆2i
bi (2m) − µi − (b
P µ µ1 (2m) − µ1 ) ≥ ∆i ≤ exp − .
4
Rn ≤ ∆ +
∆
where is a constant up to log factors. Do you recover a value close to the one obtained by the
experiments?
(f) Remark that the previous
√ bound in unbounded when ∆2 tends to zero. Show that the worst case
bound is Rn = O( n) regardless of the value of ∆.