Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Presenter: Shane	
  (Seungwhan)	
  Moon
PhD	
  student
Language	
  Technologies	
  Institute,	
  School	
  of	
  Computer	
  Science
Carnegie	
  Mellon	
  University
3/2/2016
How	
  it	
  works

2

AlphaGo vs	
  European	
  Champion	
  (Fan	
  Hui 2-­‐Dan)
October	
  5	
  – 9,	
  2015
<Official	
  match>
-­‐ Time	
  limit:	
  1	
  hour
-­‐ AlphaGo Wins (5:0)
*
rank

3

AlphaGo vs	
  World	
  Champion	
  (Lee	
  Sedol 9-­‐Dan)
March	
  9	
  – 15,	
  2016
<Official	
  match>
-­‐ Time	
  limit:	
  2	
  hours
Venue:	
  Seoul,	
  Four	
  Seasons	
  Hotel
Image	
  Source: Josun	
  Times Jan	
  28th
2015

4

Lee	
  Sedol
Photo	
  source: Maeil	
  Economics 2013/04
wiki

5

Computer	
  Go	
  AI?

6

Computer	
  Go	
  AI – Definition
s (state)
d	
  =	
  1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
=
(e.g.	
  we	
  can	
  represent	
  the	
  board	
  into	
  a	
  matrix-­‐like	
  form)
*	
  The	
  actual	
  model	
  uses	
  other	
  features	
  than	
  board	
  positions	
  as	
  well

7

Computer	
  Go	
  AI	
  – Definition
s (state)
d	
  =	
  1 d	
  =	
  2
a (action)
Given	
  s,	
  pick	
  the	
  best	
  a
Computer	
  Go
Artificial	
  
Intelligence
s a s'

8

Computer	
  Go	
  AI – An	
  Implementation	
  Idea?
d	
  =	
  1 d	
  =	
  2
…
How	
  about	
  simulating	
  all	
  possible	
  board	
  positions?

9

Computer	
  Go	
  AI	
  – An	
  Implementation	
  Idea?
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
…
…
…

10

Computer	
  Go	
  AI	
  – An	
  Implementation	
  Idea?
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
…
…
…
… d	
  =	
  maxD
Process	
  the	
  simulation	
  until	
  the	
  game	
  ends,
then	
  report	
  win	
  /	
  lose	
  results

11

Computer	
  Go	
  AI	
  – An	
  Implementation	
  Idea?
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
…
…
…
… d	
  =	
  maxD
Process	
  the	
  simulation	
  until	
  the	
  game	
  ends,
then	
  report	
  win	
  /	
  lose	
  results
e.g. it	
  wins	
  13	
  times	
  if	
  the	
  next	
  stone	
  gets	
  placed	
  here
37,839	
  times
431,320	
  times
Choose	
  the	
  “next	
  action	
  /	
  stone”
that	
  has	
  the	
  most	
  win-­‐counts	
  in	
  the	
  full-­‐scale	
  simulation

12

This	
  is	
  NOT	
  possible;	
  it	
  is	
  said	
  the	
  possible	
  configurations	
  of	
  the	
  board	
  exceeds	
  the	
  number	
   of	
  atoms	
  in	
  the	
  universe

13

Key: To	
  Reduce Search	
  Space

14

Reducing	
  Search	
  Space
1.	
  Reducing	
  “action	
  candidates”	
  (Breadth	
  Reduction)
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
…
…
… d	
  =	
  maxD
Win?
Loss?
IF	
  there	
  is	
  a	
  model	
  that	
  can	
  tell	
  you	
  that	
  these	
  moves
are	
  not	
  common	
  /	
  probable	
  (e.g.	
  by	
  experts,	
  etc.)	
  …

15

Reducing	
  Search	
  Space
1.	
  Reducing	
  “action	
  candidates”	
  (Breadth	
  Reduction)
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
… d	
  =	
  maxD
Win?
Loss?
Remove	
  these	
  from	
  search	
  candidates	
  in	
  advance (breadth	
  reduction)

16

Reducing	
  Search	
  Space
2.	
  Position	
  evaluation	
  ahead	
  of	
  time	
  (Depth	
  Reduction)
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
… d	
  =	
  maxD
Win?
Loss?
Instead	
  of	
  simulating	
  until	
  the	
  maximum	
  depth ..

17

Reducing	
  Search	
  Space
2.	
  Position	
  evaluation	
  ahead	
  of	
  time	
  (Depth	
  Reduction)
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
V	
  =	
  1
V	
  =	
  2
V	
  =	
  10
IF	
  there	
  is	
  a	
  function	
  that	
  can	
  measure:
V(s):	
  “board	
  evaluation	
  of	
  state	
  s”

18

Reducing	
  Search	
  Space
1. Reducing	
  “action	
  candidates”	
  (Breadth	
  Reduction)
2. Position	
  evaluation	
  ahead	
  of	
  time	
  (Depth	
  Reduction)

19

1.	
  Reducing	
  “action	
  candidates”
Learning:	
  P	
  (	
  next	
  action	
  |	
  current	
  state	
  )
=	
  P	
  (	
  a	
  |	
  s	
  )

20

1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Current	
  State
Prediction	
  
Model
Next	
  State
s1 s2
s2 s3
s3 s4
Data:	
  Online	
  Go experts (5~9	
  dan)
160K games, 30M	
  board	
  positions

21

1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  Model
Current	
  Board Next	
  Board

22

1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  Model
Current	
  Board Next	
  Action
There	
  are	
  19	
  X	
  19	
  =	
  361
possible	
  actions
(with	
  different	
  probabilities)

23

1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  Model
0 0	
   0 0 0	
   0 0 0 0
0 0	
   0 0 0 1 0 0 0
0 -­‐1 0 0 1 -­‐1 1 0 0
0 1 0 0 1 -­‐1 0 0 0
0 0	
   0 0 -­‐1 0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 -­‐1 0 0 0	
   0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s af:	
  s à a
Current	
  Board Next	
  Action

24

1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  
Model
0 0	
   0 0 0	
   0 0 0 0
0 0	
   0 0 0 1 0 0 0
0 -­‐1 0 0 1 -­‐1 1 0 0
0 1 0 0 1 -­‐1 0 0 0
0 0	
   0 0 -­‐1 0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 -­‐1 0 0 0	
   0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:	
  s à p(a|s)
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0	
  	
  	
  	
  	
  	
   0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.2 0.1 0 0
0 0 0 0 0 0.4	
  0.2 0 0
0 0 0 0 0 0.1	
  	
  	
   0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
p(a|s) aargmax
Current	
  Board Next	
  Action

25

1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  
Model
0 0	
   0 0 0	
   0 0 0 0
0 0	
   0 0 0 1 0 0 0
0 -­‐1 0 0 1 -­‐1 1 0 0
0 1 0 0 1 -­‐1 0 0 0
0 0	
   0 0 -­‐1 0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 -­‐1 0 0 0	
   0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:	
  s à p(a|s) p(a|s) aargmax
Current	
  Board Next	
  Action

26

1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Deep	
  Learning
(13	
  Layer	
  CNN)
0 0	
   0 0 0	
   0 0 0 0
0 0	
   0 0 0 1 0 0 0
0 -­‐1 0 0 1 -­‐1 1 0 0
0 1 0 0 1 -­‐1 0 0 0
0 0	
   0 0 -­‐1 0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 -­‐1 0 0 0	
   0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:	
  s à p(a|s)
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0	
  	
  	
  	
  	
  	
   0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.2 0.1 0 0
0 0 0 0 0 0.4	
  0.2 0 0
0 0 0 0 0 0.1	
  	
  	
   0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
p(a|s) aargmax
Current	
  Board Next	
  Action

27

Convolutional	
  Neural	
  Network	
  (CNN)
CNN	
  is	
  a	
  powerful	
  model	
  for	
  image	
  recognition	
  tasks;	
  it	
  abstracts	
  out	
  the	
  input	
  image	
  through	
  convolution	
  layers
Image	
  source

28

Convolutional	
  Neural	
  Network	
  (CNN)
And	
  they	
  use	
  this	
  CNN	
  model	
  (similar	
  architecture)	
  to	
  evaluate	
  the	
  board	
  position;	
  which learns	
  “some”	
  spatial	
  invariance

29

Go: abstraction	
  is	
  the	
  key	
  to	
  win
CNN:	
  abstraction	
  is	
  its	
  forte

30

1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Expert	
  Moves	
  Imitator	
  Model
(w/	
  CNN)
Current	
  Board Next	
  Action
Training:

31

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Expert	
  Moves	
  
Imitator	
  Model
(w/	
  CNN)
Expert	
  Moves	
  
Imitator	
  Model
(w/	
  CNN)
VS
Improving	
  by	
  playing	
  against	
  itself

32

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Expert	
  Moves	
  
Imitator	
  Model
(w/	
  CNN)
Expert	
  Moves	
  
Imitator	
  Model
(w/	
  CNN)
VS
Return:	
  board	
  positions, win/lose info

33

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Expert	
  Moves	
  Imitator	
  Model
(w/	
  CNN)
Board	
  position win/loss
Training:
Loss
z	
  =	
  -­‐1

34

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Expert	
  Moves	
  Imitator	
  Model
(w/	
  CNN)
Training:
z	
  =	
  +1
Board	
  position win/loss
Win

35

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model
ver 1.1
Updated	
  Model
ver 1.3VS
Return:	
  board	
  positions, win/lose info
It	
  uses	
  the	
  same	
  topology	
  as	
  the	
  expert	
  moves	
  imitator	
  model,	
  and	
  just	
  uses	
  the	
  updated parameters
Older	
  models	
  vs.	
  newer	
  models

36

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model	
  
ver 1.3
Updated	
  Model	
  
ver 1.7VS
Return:	
  board	
  positions, win/lose info

37

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model	
  
ver 1.5
Updated	
  Model	
  
ver 2.0VS
Return:	
  board	
  positions, win/lose info

38

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model	
  
ver 3204.1
Updated	
  Model	
  
ver 46235.2VS
Return:	
  board	
  positions, win/lose info

39

1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model	
  
ver 1,000,000VS
The	
  final	
  model	
  wins 80%	
  of	
  the time
when	
  playing	
  against	
  the	
  first	
  model
Expert	
  Moves	
  
Imitator	
  Model

40

2.	
  Board	
  Evaluation

41

2.	
  Board	
  Evaluation
Updated	
  Model
ver 1,000,000
Board	
  Position
Training:
Win	
  /	
  Loss
Win
(0~1)
Value	
  
Prediction	
  
Model
(Regression)
Adds	
  a regression	
  layer	
  to	
  the	
  model
Predicts	
  values	
  between	
  0~1
Close	
  to	
  1:	
  a	
  good	
  board	
  position
Close	
  to	
  0:	
  a	
  bad	
  board	
  position

42

Reducing	
  Search	
  Space
1. Reducing	
  “action	
  candidates”
(Breadth	
  Reduction)
2. Board	
  Evaluation (Depth	
  Reduction)
Policy	
  Network
Value	
  Network

43

Looking	
  ahead	
  (w/	
  Monte	
  Carlo	
  Search	
  Tree)
Action	
  Candidates	
  Reduction
(Policy	
  Network)
Board	
  Evaluation
(Value	
  Network)
(Rollout):	
  Faster	
  version	
  of	
  estimating	
  p(a|s)
à uses shallow	
  networks	
  (3	
  ms à 2µs)

44

Results
Elo rating	
  system
Performance	
  with	
  different	
  combinations	
  of	
  AlphaGo components

45

Takeaways
Use	
  the	
  networks	
  trained	
  for	
  a	
  certain	
  task	
  (with	
  different	
  loss	
  objectives)	
  for	
  several	
  other	
  tasks

46

Lee	
  Sedol 9-­‐dan vs	
  AlphaGo

47

Lee	
  Sedol 9-­‐dan vs	
  AlphaGo
Energy	
  Consumption
Lee	
  Sedol AlphaGo
-­‐ Recommended calories	
  for	
  a man per	
  day
: ~2,500 kCal
-­‐ Assumption: Lee consumes	
  the	
  entire	
  amount	
  of	
  
per-­‐day calories	
  in	
  this	
  one	
  game
2,500	
  kCal *	
  4,184	
  J/kCal
~=	
  10M	
  [J]
-­‐ Assumption: CPU:	
  ~100	
  W,	
  GPU:	
  ~300 W
-­‐ 1,202 CPUs, 176 GPUs
170,000	
  J/sec	
  *	
  5	
  hr *	
  3,600	
  sec/hr
~=	
  3,000M	
  [J]
A	
  very,	
  very	
  rough	
  calculation	
  ;)

48

AlphaGo is	
  estimated	
  to	
  be	
  around	
  ~5-­‐dan
=	
  multiple	
  machines European	
  champion

49

Taking	
  CPU	
  /	
  GPU resources	
  to	
  virtually	
  infinity?
But	
  Google	
  has	
  promised	
  not	
  to	
  use	
  more	
  CPU/GPUs
than	
  they	
  used	
  for	
  Fan	
  Hui	
  for	
  the	
  game	
  with	
  Lee
No	
  one	
  knows
how it	
  will	
  converge

50

AlphaGo learns	
  millions	
  of	
  Go	
  games	
  every	
  day
AlphaGo will	
  presumably	
  converge	
  to	
  some	
  point	
  eventually.
However,	
  in	
  the	
  Nature	
  paper	
  they	
  don’t	
  report	
  how	
  AlphaGo’s performance	
  improves
as	
  a	
  function	
  of	
  times	
  AlphaGo plays	
  against	
  itself	
  (self-­‐plays).

51

What	
  if	
  AlphaGo learns	
  Lee’s	
  game	
  strategy
Google	
  said	
  they	
  won’t	
  use	
  Lee’s	
  game	
  plays	
  as	
  AlphaGo’s training	
  data	
  
Even	
  if	
  it	
  does,	
  it	
  won’t	
  be	
  easy	
  to	
  modify	
  the	
  model	
  trained	
  over	
  millions	
  of
data	
  points	
  with	
  just	
  a	
  few	
  game	
  plays	
  with	
  Lee
(prone	
  to	
  over-­‐fitting,	
  etc.)

52

AlphaGo’s Weakness?

53

AlphaGo – How	
  It	
  Works
Presenter: Shane	
  (Seungwhan)	
  Moon
PhD	
  student
Language	
  Technologies	
  Institute,	
  School	
  of	
  Computer	
  Science
Carnegie	
  Mellon	
  University
me@shanemoon.com
3/2/2016

54

Reference
• Silver,	
  David,	
  et	
  al.	
  "Mastering	
  the	
  game	
  of	
  Go	
  with	
  deep	
  neural	
  
networks	
  and	
  tree	
  search." Nature 529.7587	
  (2016):	
  484-­‐489.

More Related Content

How AlphaGo Works

  • 1. Presenter: Shane  (Seungwhan)  Moon PhD  student Language  Technologies  Institute,  School  of  Computer  Science Carnegie  Mellon  University 3/2/2016 How  it  works
  • 2. AlphaGo vs  European  Champion  (Fan  Hui 2-­‐Dan) October  5  – 9,  2015 <Official  match> -­‐ Time  limit:  1  hour -­‐ AlphaGo Wins (5:0) * rank
  • 3. AlphaGo vs  World  Champion  (Lee  Sedol 9-­‐Dan) March  9  – 15,  2016 <Official  match> -­‐ Time  limit:  2  hours Venue:  Seoul,  Four  Seasons  Hotel Image  Source: Josun  Times Jan  28th 2015
  • 4. Lee  Sedol Photo  source: Maeil  Economics 2013/04 wiki
  • 6. Computer  Go  AI – Definition s (state) d  =  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = (e.g.  we  can  represent  the  board  into  a  matrix-­‐like  form) *  The  actual  model  uses  other  features  than  board  positions  as  well
  • 7. Computer  Go  AI  – Definition s (state) d  =  1 d  =  2 a (action) Given  s,  pick  the  best  a Computer  Go Artificial   Intelligence s a s'
  • 8. Computer  Go  AI – An  Implementation  Idea? d  =  1 d  =  2 … How  about  simulating  all  possible  board  positions?
  • 9. Computer  Go  AI  – An  Implementation  Idea? d  =  1 d  =  2 … d  =  3 … … … …
  • 10. Computer  Go  AI  – An  Implementation  Idea? d  =  1 d  =  2 … d  =  3 … … … … … d  =  maxD Process  the  simulation  until  the  game  ends, then  report  win  /  lose  results
  • 11. Computer  Go  AI  – An  Implementation  Idea? d  =  1 d  =  2 … d  =  3 … … … … … d  =  maxD Process  the  simulation  until  the  game  ends, then  report  win  /  lose  results e.g. it  wins  13  times  if  the  next  stone  gets  placed  here 37,839  times 431,320  times Choose  the  “next  action  /  stone” that  has  the  most  win-­‐counts  in  the  full-­‐scale  simulation
  • 12. This  is  NOT  possible;  it  is  said  the  possible  configurations  of  the  board  exceeds  the  number   of  atoms  in  the  universe
  • 13. Key: To  Reduce Search  Space
  • 14. Reducing  Search  Space 1.  Reducing  “action  candidates”  (Breadth  Reduction) d  =  1 d  =  2 … d  =  3 … … … … d  =  maxD Win? Loss? IF  there  is  a  model  that  can  tell  you  that  these  moves are  not  common  /  probable  (e.g.  by  experts,  etc.)  …
  • 15. Reducing  Search  Space 1.  Reducing  “action  candidates”  (Breadth  Reduction) d  =  1 d  =  2 … d  =  3 … … d  =  maxD Win? Loss? Remove  these  from  search  candidates  in  advance (breadth  reduction)
  • 16. Reducing  Search  Space 2.  Position  evaluation  ahead  of  time  (Depth  Reduction) d  =  1 d  =  2 … d  =  3 … … d  =  maxD Win? Loss? Instead  of  simulating  until  the  maximum  depth ..
  • 17. Reducing  Search  Space 2.  Position  evaluation  ahead  of  time  (Depth  Reduction) d  =  1 d  =  2 … d  =  3 … V  =  1 V  =  2 V  =  10 IF  there  is  a  function  that  can  measure: V(s):  “board  evaluation  of  state  s”
  • 18. Reducing  Search  Space 1. Reducing  “action  candidates”  (Breadth  Reduction) 2. Position  evaluation  ahead  of  time  (Depth  Reduction)
  • 19. 1.  Reducing  “action  candidates” Learning:  P  (  next  action  |  current  state  ) =  P  (  a  |  s  )
  • 20. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Current  State Prediction   Model Next  State s1 s2 s2 s3 s3 s4 Data:  Online  Go experts (5~9  dan) 160K games, 30M  board  positions
  • 21. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction  Model Current  Board Next  Board
  • 22. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction  Model Current  Board Next  Action There  are  19  X  19  =  361 possible  actions (with  different  probabilities)
  • 23. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction  Model 0 0   0 0 0   0 0 0 0 0 0   0 0 0 1 0 0 0 0 -­‐1 0 0 1 -­‐1 1 0 0 0 1 0 0 1 -­‐1 0 0 0 0 0   0 0 -­‐1 0 0 0 0 0 0   0 0 0   0 0 0 0 0 -­‐1 0 0 0   0 0 0 0 0 0   0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s af:  s à a Current  Board Next  Action
  • 24. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction   Model 0 0   0 0 0   0 0 0 0 0 0   0 0 0 1 0 0 0 0 -­‐1 0 0 1 -­‐1 1 0 0 0 1 0 0 1 -­‐1 0 0 0 0 0   0 0 -­‐1 0 0 0 0 0 0   0 0 0   0 0 0 0 0 -­‐1 0 0 0   0 0 0 0 0 0   0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s g:  s à p(a|s) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2 0.1 0 0 0 0 0 0 0 0.4  0.2 0 0 0 0 0 0 0 0.1       0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 p(a|s) aargmax Current  Board Next  Action
  • 25. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction   Model 0 0   0 0 0   0 0 0 0 0 0   0 0 0 1 0 0 0 0 -­‐1 0 0 1 -­‐1 1 0 0 0 1 0 0 1 -­‐1 0 0 0 0 0   0 0 -­‐1 0 0 0 0 0 0   0 0 0   0 0 0 0 0 -­‐1 0 0 0   0 0 0 0 0 0   0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s g:  s à p(a|s) p(a|s) aargmax Current  Board Next  Action
  • 26. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Deep  Learning (13  Layer  CNN) 0 0   0 0 0   0 0 0 0 0 0   0 0 0 1 0 0 0 0 -­‐1 0 0 1 -­‐1 1 0 0 0 1 0 0 1 -­‐1 0 0 0 0 0   0 0 -­‐1 0 0 0 0 0 0   0 0 0   0 0 0 0 0 -­‐1 0 0 0   0 0 0 0 0 0   0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s g:  s à p(a|s) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2 0.1 0 0 0 0 0 0 0 0.4  0.2 0 0 0 0 0 0 0 0.1       0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 p(a|s) aargmax Current  Board Next  Action
  • 27. Convolutional  Neural  Network  (CNN) CNN  is  a  powerful  model  for  image  recognition  tasks;  it  abstracts  out  the  input  image  through  convolution  layers Image  source
  • 28. Convolutional  Neural  Network  (CNN) And  they  use  this  CNN  model  (similar  architecture)  to  evaluate  the  board  position;  which learns  “some”  spatial  invariance
  • 29. Go: abstraction  is  the  key  to  win CNN:  abstraction  is  its  forte
  • 30. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Expert  Moves  Imitator  Model (w/  CNN) Current  Board Next  Action Training:
  • 31. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Expert  Moves   Imitator  Model (w/  CNN) Expert  Moves   Imitator  Model (w/  CNN) VS Improving  by  playing  against  itself
  • 32. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Expert  Moves   Imitator  Model (w/  CNN) Expert  Moves   Imitator  Model (w/  CNN) VS Return:  board  positions, win/lose info
  • 33. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Expert  Moves  Imitator  Model (w/  CNN) Board  position win/loss Training: Loss z  =  -­‐1
  • 34. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Expert  Moves  Imitator  Model (w/  CNN) Training: z  =  +1 Board  position win/loss Win
  • 35. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model ver 1.1 Updated  Model ver 1.3VS Return:  board  positions, win/lose info It  uses  the  same  topology  as  the  expert  moves  imitator  model,  and  just  uses  the  updated parameters Older  models  vs.  newer  models
  • 36. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model   ver 1.3 Updated  Model   ver 1.7VS Return:  board  positions, win/lose info
  • 37. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model   ver 1.5 Updated  Model   ver 2.0VS Return:  board  positions, win/lose info
  • 38. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model   ver 3204.1 Updated  Model   ver 46235.2VS Return:  board  positions, win/lose info
  • 39. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model   ver 1,000,000VS The  final  model  wins 80%  of  the time when  playing  against  the  first  model Expert  Moves   Imitator  Model
  • 41. 2.  Board  Evaluation Updated  Model ver 1,000,000 Board  Position Training: Win  /  Loss Win (0~1) Value   Prediction   Model (Regression) Adds  a regression  layer  to  the  model Predicts  values  between  0~1 Close  to  1:  a  good  board  position Close  to  0:  a  bad  board  position
  • 42. Reducing  Search  Space 1. Reducing  “action  candidates” (Breadth  Reduction) 2. Board  Evaluation (Depth  Reduction) Policy  Network Value  Network
  • 43. Looking  ahead  (w/  Monte  Carlo  Search  Tree) Action  Candidates  Reduction (Policy  Network) Board  Evaluation (Value  Network) (Rollout):  Faster  version  of  estimating  p(a|s) à uses shallow  networks  (3  ms à 2µs)
  • 44. Results Elo rating  system Performance  with  different  combinations  of  AlphaGo components
  • 45. Takeaways Use  the  networks  trained  for  a  certain  task  (with  different  loss  objectives)  for  several  other  tasks
  • 46. Lee  Sedol 9-­‐dan vs  AlphaGo
  • 47. Lee  Sedol 9-­‐dan vs  AlphaGo Energy  Consumption Lee  Sedol AlphaGo -­‐ Recommended calories  for  a man per  day : ~2,500 kCal -­‐ Assumption: Lee consumes  the  entire  amount  of   per-­‐day calories  in  this  one  game 2,500  kCal *  4,184  J/kCal ~=  10M  [J] -­‐ Assumption: CPU:  ~100  W,  GPU:  ~300 W -­‐ 1,202 CPUs, 176 GPUs 170,000  J/sec  *  5  hr *  3,600  sec/hr ~=  3,000M  [J] A  very,  very  rough  calculation  ;)
  • 48. AlphaGo is  estimated  to  be  around  ~5-­‐dan =  multiple  machines European  champion
  • 49. Taking  CPU  /  GPU resources  to  virtually  infinity? But  Google  has  promised  not  to  use  more  CPU/GPUs than  they  used  for  Fan  Hui  for  the  game  with  Lee No  one  knows how it  will  converge
  • 50. AlphaGo learns  millions  of  Go  games  every  day AlphaGo will  presumably  converge  to  some  point  eventually. However,  in  the  Nature  paper  they  don’t  report  how  AlphaGo’s performance  improves as  a  function  of  times  AlphaGo plays  against  itself  (self-­‐plays).
  • 51. What  if  AlphaGo learns  Lee’s  game  strategy Google  said  they  won’t  use  Lee’s  game  plays  as  AlphaGo’s training  data   Even  if  it  does,  it  won’t  be  easy  to  modify  the  model  trained  over  millions  of data  points  with  just  a  few  game  plays  with  Lee (prone  to  over-­‐fitting,  etc.)
  • 53. AlphaGo – How  It  Works Presenter: Shane  (Seungwhan)  Moon PhD  student Language  Technologies  Institute,  School  of  Computer  Science Carnegie  Mellon  University me@shanemoon.com 3/2/2016
  • 54. Reference • Silver,  David,  et  al.  "Mastering  the  game  of  Go  with  deep  neural   networks  and  tree  search." Nature 529.7587  (2016):  484-­‐489.