Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

Bandit Algorithms

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Bandit Algorithms

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Bandit Algorithms

Multi-arm bandits are one of the fundamental problems in reinforcement learning. The
problem can be described as follows: there are multiple slot machines (arms), each
with a different reward distribution, and the objective is to find the machine with the
highest expected reward by playing them sequentially. This problem is commonly
encountered in many real-world applications, such as advertising, healthcare, and
finance, where the agent needs to choose the best option out of a set of alternatives.
Multi-arm bandits are widely studied in the literature and have been shown to have
important applications in various fields.

Types of Multi-Arm Bandits


There are two main types of multi-arm bandits: stochastic and adversarial. In
stochastic multi-arm bandits, the rewards of the arms are generated from a fixed
probability distribution, which is unknown to the agent. In adversarial multi-arm
bandits, the rewards of the arms are chosen by an adversary, who tries to make the
problem more challenging for the agent. The adversary can be either deterministic or
randomized.

Algorithms for Multi-Arm Bandits


There are many algorithms for solving the multi-arm bandit problem, each with its
own strengths and weaknesses. The most well-known algorithms are epsilon-greedy,
UCB1 and Thompson Sampling.

Epsilon-Greedy: Epsilon-greedy is a simple algorithm that selects the arm with the

highest estimated reward with probability 1-epsilon and selects a random arm with

probability epsilon. The value of epsilon is usually chosen to balance exploration and

exploitation. If epsilon is set to zero, the algorithm always selects the arm with the

highest estimated reward, which can lead to suboptimal solutions if the estimates are

inaccurate. On the other hand, if epsilon is set to one, the algorithm always selects a

random arm, which can lead to inefficient exploration.

UCB: UCB (Upper Confidence Bound) is an algorithm that balances exploration and

exploitation by selecting the arm with the highest upper confidence bound (UCB)
estimate. The UCB estimate consists of two terms: the estimated reward and the

confidence interval. The confidence interval is proportional to the square root of the

logarithm of the number of times the arm has been played. The UCB algorithm has
been shown to have good performance in both stochastic and adversarial

environments.

Thompson Sampling: Thompson Sampling is a Bayesian algorithm that updates

its beliefs about the reward distribution of each arm after each play. The algorithm

then samples a reward from the updated distribution and selects the arm with the

highest sample. The Thompson Sampling algorithm has been shown to have good

performance in stochastic environments, but its performance in adversarial

environments needs to be better understood.

Extensions of Multi-Arm Bandits

There are many extensions of the multi-arm bandit problem, such as contextual

multi-arm bandits, collaborative filtering, and dynamic pricing. In contextual multi-

arm bandits, the rewards of the arms depend not only on the arm played but also on a

context vector that is observed by the agent. In collaborative filtering, the agent has to

recommend items to users based on their preferences. In dynamic pricing, the agent

has to set the price of a product to maximize revenue.

You might also like