Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Bandits

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

TD – Multi-Armed Bandits

COURS D’APPRENTISSAGE, ECOLE NORMALE SUPERIEURE, OCTOBRE 2018

Aude Genevay (aude.genevay@ens.fr)

We consider the following Bernoulli K-armed bandit setting. Let µ ∈ [0, 1]K . At each time t ≥ 1, a learner
chooses It ∈ {1, . . . , K} and receives reward Xt (It ) ∼ B(µIt ). In all the experiments, we will consider a
Bernoulli bandit with K P= 2 arms and means µ1 = 0.6 and µ2 = 0.5. The goal of the learner is to maximize
n
his cumulative reward t=1 Xt (It ). To do so, the learner minimizes his cumulative regret defined as
" n #
X

Rn = nµ − E Xt (It ) , where µ∗ = max µi .
1≤i≤K
t=1

We are look for algorithms that achieve sublinear regret Rn = o(n) so that the average reward of the learner
tends to the optimal reward µ∗ .
PK Pn
1. Prove that the regret satisfies Rn = i=1 ∆i E Ti (n) , where Ti (n) = t=1 1It =i and ∆i = µ∗ − µi .
 

2. Implement a Bernoulli bandit environment in Python.

The Follow the leader algorithm (FTL)


The “Follow-the-Leader” algorithm (FTL) chooses each action once and subsequently chooses the action
with the largest average observed so far. Ties should be broken randomly:
t−1
 1 X
It ∈ arg max bi (t − 1) ,
µ where bi (t − 1) :=
µ Xs (Is )1Is =i .
1≤i≤K Ti (t − 1) s=1

3. Implement FTL.
4. Using a horizon of n = 100, run 1000 simulations of your implementation of Follow-the-Leader on the
Bernoulli bandit above and record the (random) regret, Rn , in each simulation.
5. Plot the results using a histogram and explain the result of the Figure.
6. Rerun the experiments until n = 1000 while saving the regrets for each t = {1, . . . , n}. Plot the average
regret obtained in the 1000 simulations as a function of t.
7. Explain the plot. Do you think Follow the Leader is a good algorithm? Why/why not?

The Explore-then-Commit algorithm (ETC)


The Explore-then-Commit algorithm starts by exploring all arms m times (i.e., during the first mK rounds)
before choosing the action maximizing µbi (mK) for the remaining rounds. Formally, it chooses

It =
i  if (t mod K) + 1 = i and t ≤ mK . (1)
arg maxi µbi (mK) if t > mK

8. Implement ETC.

1
9. Using horizon of n = 100, run 1000 simulations for m = {1, 2, 5, 10, 15, 20, 25, 30, 40} and plot the
average regret as a function of m. What is the best choice of m?
10. For the different choices of m, rerun the experiment of question 7. Is Explore-Then-Commmit a good
algorithm?
11. We consider the case K = 2, we assume without loss of generality that the first arm is optimal (i.e.,
µ1 = µ∗ ). We assume that n ≥ 2m.
(a) Show that 
E[Ti (n)] = m + (n − 2m)P µbi (2m) ≥ max µ bi (2m)
j6=i

(b) Show that  


bi (2m) ≥ max µ
P µ bi (2m) ≤ P µ
bi (2m) − µi − (b
µ1 (2m) − µ1 ) ≥ ∆i
j6=i

(c) Using Chernoff’s inequality prove that

m∆2i
 

bi (2m) − µi − (b
P µ µ1 (2m) − µ1 ) ≥ ∆i ≤ exp − .
4

(d) Conclude that


m∆22
 
Rn ≤ m∆2 + n∆2 exp − .
4
(e) Optimize the bound in m assuming n is large and show a bound of the form


Rn ≤ ∆ +

where  is a constant up to log factors. Do you recover a value close to the one obtained by the
experiments?
(f) Remark that the previous
√ bound in unbounded when ∆2 tends to zero. Show that the worst case
bound is Rn = O( n) regardless of the value of ∆.

The Upper-Confidence-Bound algorithm (UCB)


A drawback of ETC is that the optimal value of m depends on ∆ which is unknown in advance. Furthermore,
all arms are sampled the same number of rounds during the exploration stage. Furthermore, one would want
to sample more the arms that are close to be optimal while very bad arms should be quickly detected and
stopped being explored. These drawbacks are solved by the Upper-Confidence-Bound algorithm that assigns
to each arm a value called upper-confidence bound that with high probability is an upper-bound of unknown
mean of the arm. UCB chooses action
 s 
4 log n
It ∈ arg max µ bi (t − 1) + .
1≤i≤K Ti (t − 1)
PK √
It is possible to show that Rn ≤ 3 i=1 ∆i + i:∆i >0 16∆
log n
P
i
. In the worst case, this implies also Rn = O( n)
up to log factors.
12. Implement UCB.
13. Rerun the experiment of question 7 for UCB and compare the cumulative regret of UCB with the ones
obtained by ETC (for different m) and FTL.

You might also like