Professional Documents
Culture Documents
VD BACML09
VD BACML09
1 Introduction
Poker playing computer bots can be divided into two categories. There are the
game-theoretic bots, that play according to a strategy that gives rise to an ap-
proximate Nash equilibrium. These bots are difficult, or impossible to beat, but
are also not able to exploit possible non-optimalities in their opponents. Game
theoretic strategy development also becomes very complex as the number of
opponents increases, so most bots that try to play Nash equilibria, do so in a
heads-up setting, i.e., with only one opponent. The other type of bot is the ex-
ploiting bot that employs game tree search and opponent modeling techniques to
discover and exploit possible weaknesses in opponents. In this paper, we focus
on the second type of bot.
Research in computer Poker has been mainly dealing with the limit variant of
Texas Hold’em. In limit Poker, bet sizes are fixed as well as the total number of
times a player can raise the size of the pot. Only recently [1], attention has started
shifting to the no-limit variant and the increased complexity that it brings. In
fact, all exploiting bots developed so far, deal with the heads-up limit version of
the game. The limit game tree consists of about 1018 nodes and modern bots are
capable of traversing the complete tree in search of optimized actions through a
full-width depth-first search [2]. Gilpin et al.[1] estimate the game tree in heads-
up no-limit Texas Hold’em (with bet-size discretization) to reach a node-count
of about 1071 .
2
Texas Hold’em Poker is played by two or more people who each hold two hidden
cards. The game starts with a betting round where people can invest money in
a shared pot by putting in a bet. Alternatively, they can check and let the next
player make the first bet. Players must match every bet made by investing an
equal amount (calling) or by investing even more (raising). When a player fails
to do so and gives up, he folds. The betting round ends when everybody has
folded or called. To make sure each game is played, the first two players in the
first round are forced to place bets, called small and big blinds. The big blind is
double the amount of chips of the small blind.
Next, a set of three community cards (called the Flop) is dealt to the table,
followed by a new betting round. These cards are visible to every player and
3
the goal is to combine them with the two hidden cards to form a combination
of cards (also called a hand). One card is added to the set of community cards
twice more (the first one called the Turn and the last called the River) to make
a total of five, each time initiating a new betting round. Finally, the player with
the best hand that hasn’t folded wins the game and takes the pot.
From a game tree perspective, this gives rise to the following node types:
Leaf nodes: Evaluation of the expected value of the game in these nodes is
usually trivial. In Poker, we choose to employ the expected sum of money
a player has after the game is resolved. In case the player loses, this is the
amount of money he didn’t bet. In case the player wins, it is that amount
plus the current pot, with a possible distribution of the pot if multiple players
have similar poker hands. Leafs can also take probability distributions over
currently unknown cards into account.
Decision nodes: These are the nodes where the bot itself is in control of the
game. In non-deterministic, fully observable games, these would be the nodes
where the child that maximizes the expected outcome is selected. While in
Poker, it can be beneficial to not always select the optimal action and remain
somewhat unpredictable, they should still be treated as maximization nodes
during the game tree search.
Chance nodes: These nodes are comparable to those encountered in expecti-
max games. In Poker, they occur when new cards are dealt to the table. In
these nodes, children are selected according to the probability distribution
connected to the stochastic process steering the game at these points.
Opponent nodes: In complete information games, the opponent will try to
minimize the value in the nodes where he can make a choice. However,
in poker, we do not know the evaluation function of the opponent, which
will typically be different because the opponent has different information.
Neither does the opponent know our evaluation function. Instead of treating
the opponent nodes as min-nodes one can therefore consider these nodes
as chance nodes. The big difference is that the probability distribution in
opponent nodes is unknown and often not static. The probability distribution
over the different options the opponent can select from can be represented
(and learned) as an opponent-model.1
These nodes make up a miximax game tree that models the expected value
of each game state. Because this tree is too big, it is impossible to construct. We
can only approximate the expected values using an incomplete search procedure
like MCTS.
1. P (Ai |A0 . . . Ai−1 , C0 . . . Ci ): This model predicts the actions of all players,
taking previous actions and dealt community cards into account.
2. P (H|A0 . . . An , C0 . . . Cn ): This model predicts the hand cards of all active
players at showdown (represented as step n).
The first probability can be easily estimated using observations made during the
game, as all relevant variables can be observed. This is only possible because we
leave hand cards out of the first model. Predicting the exact amount of a bet or
a raise is difficult. We decided to deal with this by treating minimal bets and
all-in bets (i.e. when a player bets his entire stack of money) as separate cases
and discretizing the other values.
In the end we need to know who won the game and therefore, we need to
model the hand cards with the second probability. This one can be harder to
model as “mucking” (i.e. throwing away cards without showing them to other
players) can hide information about the cards of players in a loosing position.
To make abstraction of the high number of possible hand-card combinations, we
reduce the possible outcomes to the rank of the best hand that the cards of each
active player result in [5] and again discretize these ranks into a number of bins.
This approach has the advantage that mucked hands still hold some information
and can be attributed to bins with ranks lower than that of the current winning
hand.
The actual models are learned using the Weka toolkit [6] from a dataset
of a million games played in an online casino. Each action in the dataset is an
instance to learn the first model from. Each showdown is used to learn the second
model. The game state at the time of an action or showdown is represented by
a number of attributes. These attributes include information about the current
game, such as the round, the pot and the stack size of the player involved. It
also includes statistics of the player from previous games, such as the bet and
fold frequencies for each round and information about his opponents, such as the
average raise frequency of the active players at the table. With these attributes,
Weka’s M5P algorithm [7] was used to learn a regression tree that predicts the
probabilities of each action or hand rank bin at showdown.
The use of features that represent windowing2 statistics about previous games
of a player in a model learned on a non-player-specific data set has the advantage
of adapting the opponent model to the current opponent(s). It is important to
note that there are usually not sufficient examples from a single individual player
to learn something significant. We circumvent this problem by learning what is
2
Statistics are kept over a limited number of past games to adapt to changing game
circumstances.
5
common for all players from a large database of games. At the same time, to
exploit player-specific behavior, we will learn a function from the game situation
and player-specific statistics to a prediction of his action. A key issue here is to
select player-specific statistics which are easy to collect with only a few observed
games, while capturing as much as possible about the players’ behavior.
Selection: Starting from the root, the algorithm selects in each stored node the
branch it wants to explore further until it reaches a stored leaf (This is not
necessarily a leaf of the game tree). The selection strategy is a parameter of
the MCTS approach.
Expansion: One (or more) leafs are added to the stored tree as child(ren) of
the leaf reached in the previous step.
Simulation: A sample game starting from the added leaf is played (using a
simple and fast game-playing strategy) until conclusion. The value of the
reached result (i.e. of the reached game tree leaf) is recorded. MCTS does
not require an evaluation heuristic, as each game is simulated to conclusion.
Backpropagation: The estimates of the expected values V ∗ (P ) = E[r(P )]
(and selection counter T(P )) of each recorded node P on the explored path
is updated according to the recorded result. The backpropagation strategy
is also a parameter of the MCTS approach.
The CrazyStone selection strategy [8] chooses options according to the proba-
bility that they will return a better value than the child with the highest expected
value. Besides the estimated expected value E[r(P )] each node P also stores the
standard deviation of the estimates of the expected value σV̂ ,P .
3
Regret in a selection task is the difference in cumulative returns compared to the
return that could be attained when using the optimal strategy.
7
where cbest is the option with the highest expected value. Under the assumption
that values follow a Gaussian distribution, this formula approximates the target
probability.
The CrazyStone-selection is only applicable for deterministic minimax-
trees. Non deterministic nodes cause each node to have a probability distribution
over expected rewards that does not converge to a single value. This stops σ(n)
from converging to 0 and causes the number of samples using non-optimal options
to remain non-negligible.
For both chance and opponent nodes the expected value of child-nodes are given
a weight proportional to the amount of samples used:
This is simply the average of all values sampled through the current node.
In decision nodes, we have a choice how to propagate expected values. The only
requirement is that the propagated value converges to the maximal expected
value as the number of samples rises. Two simple strategies present themselves:
– Use the average sampling result as it is used in mixed nodes: when using the
correct selection strategy, for example UCT, the majority of samples will
eventually stem from the child with the highest expected value and this will
force equation 3 to converge to the intended value. Before convergence, this
strategy will usually underestimate the true expected value, as results from
non-optimal children are also included.
– Use the maximum of the estimates of the expected values of all child nodes:
As, in the limit, V̂ (P ) will converge to the true value V ∗ (P ), the maximum
of all estimates will also converge to the correct maximum. This approach
will usually overestimate the true expected value as any noise present on the
sampling means is also maximized.
8
where C is a constant and σV̂ ,ci is the standard error on V̂ . We call this strategy
UCT+. Here, the first term is again the exploitation term and the second
term an exploration term. In contrast to Equation (1) it takes into account the
uncertainty based on the actually observed samples of the child node.
Let P be a node of the search tree. We will denote with V ∗ (P ) the value of
P , i.e. the expected value of the gain of the player under perfect play. Unfor-
tunately, we do not know yet these values, nor the optimal strategies that will
achieve them. Assume therefore that we have estimates V̂ (P ) of the V ∗ (P ). It
is important to know how accurate our estimates are. Therefore, let
σV̂ ,P 2 = E[(V̂ (P ) − V ∗ (P ))2 ].
For example, if P is a leaf node, and if we have J samples x1 . . . xJ of the
value of P , then we can estimate
J
1X
V̂ (P ) = xj
J j=1
ei = V̂ (ci ) − V ∗ (ci )
If the estimates V̂ (ci ) are obtained by independent sampling, then the errors ei
are independent and we can write
Z n
n Y
V̂ (P ) = (max(V̂ (ci ) − ei )) P(ei )de1 . . . den
e1 ...en i=1
i=1
and
n
X
σV̂2 ,P = pi σV̂2 ,i
i=1
Both the total expected value and its standard error are weighed by pi .
In chance nodes, pi describes the stochastic process. In opponent nodes, the
probability is provided by the opponent model.
5 Experiments
5.1 Setup
Ideally, we would let our bots play against strong existing bots. Unfortunately,
the few no-limit bots that exist are not freely available. Such a comparson would
not really be fair as existing bots only play heads-up poker, whereas our bot
is more generally applicable. Because it is the first exploiting no-limit bot, a
comparison of exploitative behaviour would also be impossible.
We implemented the following poker playing bots to experiment with:
StdBot and NewBot both use the opponent model that’s described in Sec-
tion 2. RuleBotHandOnly and RuleBotBestHand don’t use any opponent
model. In our experiments, we each time let the bots (with certain parameters)
play a large number of games against each other. Each player starts every game
with a stack of 200 small bets (sb). The contant C in Equation (1) is always
equal to 50 000. To reduce the variance in the experiments, players are dealt
each other’s cards in consecutive games. We compute the average gain of a bot
per game in small bets.
11
150
NewBot
100
Average profit [sb/game]
50
0 RuleBotBestHand
−50
RuleBotHandOnly
−100
To test the new backpropagation algorithm, we let NewBot that uses the max-
imum distribution backpropagation strategy play against StdBot that uses
the standard backpropagation algorithm, where a sample-weighted average of
the expected value of the children is propagated. To avoid overestimation, the
maximum distribution only replaces the standard algorithm if a node has been
sampled 200 times. Both bots were allowed 1 second computation time for each
decision and were otherwise configured with the same settings. Figure 2 shows
the average profit of both bots.
The backpropagation algorithm we propose is significantly superior in this
experiment. It wins with 3 small bets per game, which would be a very large profit
among human players. Using other settings for the UCT selection algorithm in
12
20
15
10
Average Profit [sb/game]
NewBot
5
−5
StdBot
−10
−15
−20
0 0.5 1 1.5 2 2.5
Games 4
x 10
other experiments, the results were not always definitive and sometimes the
proposed strategy even performed slightly worse that the standard algorithm.
Therefore, it is hard to draw general conclusions for question Q2 based on these
experiments. Further parameter tuning for both approaches is necessary before
any strong claims are possible either way.
−1
−2
−3
−4
0 500 1000 1500 2000 2500 3000 3500
Computation Time [ms]
Fig. 3. Average profit of NewBot against StdBot depending on the calculation time.
Bot using Equation (5) for sampling with C equal to 2. Both bots are allowed
to sample the game tree 25 000 times. All else being equal, the new UCT+ sam-
ple selection strategy clearly outperforms the standard UCT. As for Q2, not all
parameter settings gave conclusive results and more parameter tuning is needed.
6 Conclusions
We’ve introduced MCTS in the context of Texas Hold’em Poker and the miximax
game trees it results in. Using a new backpropagation strategy that explicitly
models sample distributions and a new sample selection strategy, we developed
the first exploiting multi-player bots for no-limit Texas Hold’em. In a number of
14
6
NewBot with UCT+
4
Average Profit [sb/game]
−2
−4
−8
0 2 4 6 8 10 12 14
Games 4
x 10
Fig. 4. Average profit of NewBot using UCT+ against NewBot using standard UCT.
experimental evaluations, we studied some of the conditions that allow the new
strategies to result in a measurable advantage. An increase in available com-
putation time translates in additional profit for the new approach. We expect a
similar trend as the complexity of the opponent model surpasses the complexity
of the backpropagation algorithm. In future work, we want to explore several di-
rections. First, we want to better integrate the opponent model with the search.
Second, we want to find a good way to deal with the opponent nodes as min-
imization nodes as well as non-deterministic nodes. Finally, we want to shift
the goal of the search from finding the optimal deterministic action to finding
the optimal randomized action (taking into account the fact that information is
hidden).
Acknowledgements
Jan Ramon and Kurt Driessens are post-doctoral fellows of the Fund for Scientific
Research (FWO) of Flanders.
References
1. Gilpin, A., Sandholm, T., Sørensen, T.: A heads-up no-limit Texas Hold’em
poker player: discretized betting models and automatically generated equilibrium-
finding programs. In: Proceedings of the 7th international joint conference on
Autonomous agents and multiagent systems-Volume 2, International Foundation
for Autonomous Agents and Multiagent Systems Richland, SC (2008) 911–918
15
2. Billings, D.: Algorithms and assessment in computer poker. PhD thesis, Edmonton,
Alta., Canada (2006)
3. Kocsis, L., Szepesvari, C.: Bandit based monte-carlo planning. Lecture Notes in
Computer Science 4212 (2006) 282
4. Russell, S., Norvig, P.: Artificial intelligence: a modern approach. Prentice Hall
(2003)
5. Suffecool, K.: Cactus kev’s poker hand evaluator.
http://www.suffecool.net/poker/evaluator.html (July 2007)
6. Witten Ian, H., Eibe, F.: Data Mining: Practical machine learning tools and tech-
niques. Morgan Kaufmann, San Francisco (2005)
7. Wang, Y., Witten, I.: Induction of model trees for predicting continuous classes.
(1996)
8. Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search.
Lecture Notes in Computer Science 4630 (2007) 72
9. Gelly, S., Wang, Y.: Exploration exploitation in go: UCT for Monte-Carlo go. In:
Twentieth Annual Conference on Neural Information Processing Systems (NIPS
2006). (2006)
10. Chaslot, G., Winands, M., Herik, H., Uiterwijk, J., Bouzy, B.: Progressive strategies
for monte-carlo tree search. New Mathematics and Natural Computation 4(3)
(2008) 343
11. Van Lishout, F., Chaslot, G., Uiterwijk, J.: Monte-Carlo Tree Search in Backgam-
mon. Computer Games Workshop (2007) 175–184
12. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed
bandit problem. Machine Learning 47(2-3) (2002) 235–256