Bandit Algorithms
Bandit Algorithms
Multi-arm bandits are one of the fundamental problems in reinforcement learning. The
problem can be described as follows: there are multiple slot machines (arms), each
with a different reward distribution, and the objective is to find the machine with the
highest expected reward by playing them sequentially. This problem is commonly
encountered in many real-world applications, such as advertising, healthcare, and
finance, where the agent needs to choose the best option out of a set of alternatives.
Multi-arm bandits are widely studied in the literature and have been shown to have
important applications in various fields.
Epsilon-Greedy: Epsilon-greedy is a simple algorithm that selects the arm with the
highest estimated reward with probability 1-epsilon and selects a random arm with
probability epsilon. The value of epsilon is usually chosen to balance exploration and
exploitation. If epsilon is set to zero, the algorithm always selects the arm with the
highest estimated reward, which can lead to suboptimal solutions if the estimates are
inaccurate. On the other hand, if epsilon is set to one, the algorithm always selects a
UCB: UCB (Upper Confidence Bound) is an algorithm that balances exploration and
exploitation by selecting the arm with the highest upper confidence bound (UCB)
estimate. The UCB estimate consists of two terms: the estimated reward and the
confidence interval. The confidence interval is proportional to the square root of the
logarithm of the number of times the arm has been played. The UCB algorithm has
been shown to have good performance in both stochastic and adversarial
environments.
its beliefs about the reward distribution of each arm after each play. The algorithm
then samples a reward from the updated distribution and selects the arm with the
highest sample. The Thompson Sampling algorithm has been shown to have good
There are many extensions of the multi-arm bandit problem, such as contextual
arm bandits, the rewards of the arms depend not only on the arm played but also on a
context vector that is observed by the agent. In collaborative filtering, the agent has to
recommend items to users based on their preferences. In dynamic pricing, the agent