How AlphaGo Works

Presenter: Shane
(Seungwhan)
Moon
PhD
student
Language
Technologies
Institute,
School
of
Computer
Science
Carnegie
Mellon
University
3/2/2016
How
it
works

AlphaGo vs
European
Champion
(Fan
Hui 2-‐Dan)
October
5
– 9,
2015
<Official
match>
-‐ Time
limit:
1
hour
-‐ AlphaGo Wins (5:0)
*
rank

AlphaGo vs
World
Champion
(Lee
Sedol 9-‐Dan)
March
9
– 15,
2016
<Official
match>
-‐ Time
limit:
2
hours
Venue:
Seoul,
Four
Seasons
Hotel
Image
Source: Josun
Times Jan
28th
2015

Lee
Sedol
Photo
source: Maeil
Economics 2013/04
wiki

Computer
Go
AI – Definition
s (state)
d
=
1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
=
(e.g.
we
can
represent
the
board
into
a
matrix-‐like
form)
*
The
actual
model
uses
other
features
than
board
positions
as
well

Computer
Go
AI
– Definition
s (state)
d
=
1 d
=
2
a (action)
Given
s,
pick
the
best
a
Computer
Go
Artificial

Intelligence
s a s'

Computer
Go
AI – An
Implementation
Idea?
d
=
1 d
=
2
…
How
about
simulating
all
possible
board
positions?

Computer
Go
AI
– An
Implementation
Idea?
d
=
1 d
=
2
…
d
=
3
…
…
…
…

Computer
Go
AI
– An
Implementation
Idea?
d
=
1 d
=
2
…
d
=
3
…
…
…
…
… d
=
maxD
Process
the
simulation
until
the
game
ends,
then
report
win
/
lose
results

Computer
Go
AI
– An
Implementation
Idea?
d
=
1 d
=
2
…
d
=
3
…
…
…
…
… d
=
maxD
Process
the
simulation
until
the
game
ends,
then
report
win
/
lose
results
e.g. it
wins
13
times
if
the
next
stone
gets
placed
here
37,839
times
431,320
times
Choose
the
“next
action
/
stone”
that
has
the
most
win-‐counts
in
the
full-‐scale
simulation

This
is
NOT
possible;
it
is
said
the
possible
configurations
of
the
board
exceeds
the
number
of
atoms
in
the
universe

Key: To
Reduce Search
Space

Reducing
Search
Space
1.
Reducing
“action
candidates”
(Breadth
Reduction)
d
=
1 d
=
2
…
d
=
3
…
…
…
… d
=
maxD
Win?
Loss?
IF
there
is
a
model
that
can
tell
you
that
these
moves
are
not
common
/
probable
(e.g.
by
experts,
etc.)
…

Reducing
Search
Space
1.
Reducing
“action
candidates”
(Breadth
Reduction)
d
=
1 d
=
2
…
d
=
3
…
… d
=
maxD
Win?
Loss?
Remove
these
from
search
candidates
in
advance (breadth
reduction)

Reducing
Search
Space
2.
Position
evaluation
ahead
of
time
(Depth
Reduction)
d
=
1 d
=
2
…
d
=
3
…
… d
=
maxD
Win?
Loss?
Instead
of
simulating
until
the
maximum
depth ..

Reducing
Search
Space
2.
Position
evaluation
ahead
of
time
(Depth
Reduction)
d
=
1 d
=
2
…
d
=
3
…
V
=
1
V
=
2
V
=
10
IF
there
is
a
function
that
can
measure:
V(s):
“board
evaluation
of
state
s”

Reducing
Search
Space
1. Reducing
“action
candidates”
(Breadth
Reduction)
2. Position
evaluation
ahead
of
time
(Depth
Reduction)

1.
Reducing
“action
candidates”
Learning:
P
(
next
action
|
current
state
)
=
P
(
a
|
s
)

1.
Reducing
“action
candidates”
(1) Imitating
expert
moves
(supervised
learning)
Current
State
Prediction

Model
Next
State
s1 s2
s2 s3
s3 s4
Data:
Online
Go experts (5~9
dan)
160K games, 30M
board
positions

1.
Reducing
“action
candidates”
(1) Imitating
expert
moves
(supervised
learning)
Prediction
Model
Current
Board Next
Board

1.
Reducing
“action
candidates”
(1) Imitating
expert
moves
(supervised
learning)
Prediction
Model
Current
Board Next
Action
There
are
19
X
19
=
361
possible
actions
(with
different
probabilities)

1.
Reducing
“action
candidates”
(1) Imitating
expert
moves
(supervised
learning)
Prediction
Model
0 0
0 0 0
0 0 0 0
0 0
0 0 0 1 0 0 0
0 -‐1 0 0 1 -‐1 1 0 0
0 1 0 0 1 -‐1 0 0 0
0 0
0 0 -‐1 0 0 0 0
0 0
0 0 0
0 0 0 0
0 -‐1 0 0 0
0 0 0 0
0 0
0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s af:
s à a
Current
Board Next
Action

1.
Reducing
“action
candidates”
(1) Imitating
expert
moves
(supervised
learning)
Prediction

Model
0 0
0 0 0
0 0 0 0
0 0
0 0 0 1 0 0 0
0 -‐1 0 0 1 -‐1 1 0 0
0 1 0 0 1 -‐1 0 0 0
0 0
0 0 -‐1 0 0 0 0
0 0
0 0 0
0 0 0 0
0 -‐1 0 0 0
0 0 0 0
0 0
0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:
s à p(a|s)
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0

0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.2 0.1 0 0
0 0 0 0 0 0.4
0.2 0 0
0 0 0 0 0 0.1

0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
p(a|s) aargmax
Current
Board Next
Action

1.
Reducing
“action
candidates”
(1) Imitating
expert
moves
(supervised
learning)
Prediction

Model
0 0
0 0 0
0 0 0 0
0 0
0 0 0 1 0 0 0
0 -‐1 0 0 1 -‐1 1 0 0
0 1 0 0 1 -‐1 0 0 0
0 0
0 0 -‐1 0 0 0 0
0 0
0 0 0
0 0 0 0
0 -‐1 0 0 0
0 0 0 0
0 0
0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:
s à p(a|s) p(a|s) aargmax
Current
Board Next
Action

1.
Reducing
“action
candidates”
(1) Imitating
expert
moves
(supervised
learning)
Deep
Learning
(13
Layer
CNN)
0 0
0 0 0
0 0 0 0
0 0
0 0 0 1 0 0 0
0 -‐1 0 0 1 -‐1 1 0 0
0 1 0 0 1 -‐1 0 0 0
0 0
0 0 -‐1 0 0 0 0
0 0
0 0 0
0 0 0 0
0 -‐1 0 0 0
0 0 0 0
0 0
0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:
s à p(a|s)
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0

0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.2 0.1 0 0
0 0 0 0 0 0.4
0.2 0 0
0 0 0 0 0 0.1

0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
p(a|s) aargmax
Current
Board Next
Action

Convolutional
Neural
Network
(CNN)
CNN
is
a
powerful
model
for
image
recognition
tasks;
it
abstracts
out
the
input
image
through
convolution
layers
Image
source

Convolutional
Neural
Network
(CNN)
And
they
use
this
CNN
model
(similar
architecture)
to
evaluate
the
board
position;
which learns
“some”
spatial
invariance

Go: abstraction
is
the
key
to
win
CNN:
abstraction
is
its
forte

1.
Reducing
“action
candidates”
(1) Imitating
expert
moves
(supervised
learning)
Expert
Moves
Imitator
Model
(w/
CNN)
Current
Board Next
Action
Training:

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Expert
Moves

Imitator
Model
(w/
CNN)
Expert
Moves

Imitator
Model
(w/
CNN)
VS
Improving
by
playing
against
itself

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Expert
Moves

Imitator
Model
(w/
CNN)
Expert
Moves

Imitator
Model
(w/
CNN)
VS
Return:
board
positions, win/lose info

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Expert
Moves
Imitator
Model
(w/
CNN)
Board
position win/loss
Training:
Loss
z
=
-‐1

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Expert
Moves
Imitator
Model
(w/
CNN)
Training:
z
=
+1
Board
position win/loss
Win

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Updated
Model
ver 1.1
Updated
Model
ver 1.3VS
Return:
board
It
uses
the
same
topology
as
the
expert
moves
imitator
model,
and
just
uses
the
updated parameters
Older
models
vs.
newer
models

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Updated
Model

ver 1.3
Updated
Model

ver 1.7VS
Return:
board

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Updated
Model

ver 1.5
Updated
Model

ver 2.0VS
Return:
board

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Updated
Model

ver 3204.1
Updated
Model

ver 46235.2VS
Return:
board

1.
Reducing
“action
candidates”
(2) Improving
through
self-‐plays
(reinforcement
learning)
Updated
Model

ver 1,000,000VS
The
final
model
wins 80%
of
the time
when
playing
against
the
first
model
Expert
Moves

Imitator
Model

2.
Board
Evaluation
Updated
Model
ver 1,000,000
Board
Position
Training:
Win
/
Loss
Win
(0~1)
Value

Prediction

Model
(Regression)
Adds
a regression
layer
to
the
model
Predicts
values
between
0~1
Close
to
1:
a
good
board
position
Close
to
0:
a
bad
board
position

Reducing
Search
Space
1. Reducing
“action
candidates”
(Breadth
Reduction)
2. Board
Evaluation (Depth
Reduction)
Policy
Network
Value
Network

Looking
ahead
(w/
Monte
Carlo
Search
Tree)
Action
Candidates
Reduction
(Policy
Network)
Board
Evaluation
(Value
Network)
(Rollout):
Faster
version
of
estimating
p(a|s)
à uses shallow
networks
(3
ms à 2µs)

Results
Elo rating
system
Performance
with
different
combinations
of
AlphaGo components

Takeaways
Use
the
networks
trained
for
a
certain
task
(with
different
loss
objectives)
for
several
other
tasks

Lee
Sedol 9-‐dan vs
AlphaGo

Lee
Sedol 9-‐dan vs
AlphaGo
Energy
Consumption
Lee
Sedol AlphaGo
-‐ Recommended calories
for
a man per
day
: ~2,500 kCal
-‐ Assumption: Lee consumes
the
entire
amount
of

per-‐day calories
in
this
one
game
2,500
kCal *
4,184
J/kCal
~=
10M
[J]
-‐ Assumption: CPU:
~100
W,
GPU:
~300 W
-‐ 1,202 CPUs, 176 GPUs
170,000
J/sec
*
5
hr *
3,600
sec/hr
~=
3,000M
[J]
A
very,
very
rough
calculation
;)

AlphaGo is
estimated
to
be
around
~5-‐dan
=
multiple
machines European
champion

Taking
CPU
/
GPU resources
to
virtually
infinity?
But
Google
has
promised
not
to
use
more
CPU/GPUs
than
they
used
for
Fan
Hui
for
the
game
with
Lee
No
one
knows
how it
will
converge

AlphaGo learns
millions
of
Go
games
every
day
AlphaGo will
presumably
converge
to
some
point
eventually.
However,
in
the
Nature
paper
they
don’t
report
how
AlphaGo’s performance
improves
as
a
function
of
times
AlphaGo plays
against
itself
(self-‐plays).

What
if
AlphaGo learns
Lee’s
game
strategy
Google
said
they
won’t
use
Lee’s
game
plays
as
AlphaGo’s training
data

Even
if
it
does,
it
won’t
be
easy
to
modify
the
model
trained
over
millions
of
data
points
with
just
a
few
game
plays
with
Lee
(prone
to
over-‐fitting,
etc.)

AlphaGo – How
It
Works
Presenter: Shane
(Seungwhan)
Moon
PhD
student
Language
Technologies
Institute,
School
of
Computer
Science
Carnegie
Mellon
University
me@shanemoon.com
3/2/2016

Reference
• Silver,
David,
et
al.
"Mastering
the
game
of
Go
with
deep
neural

networks
and
tree
search." Nature 529.7587
(2016):
484-‐489.

How AlphaGo Works

More Related Content

How AlphaGo Works