Cooperating Intelligent Systems: Learning From Observations Chapter 18, AIMA
Cooperating Intelligent Systems: Learning From Observations Chapter 18, AIMA
Cooperating Intelligent Systems: Learning From Observations Chapter 18, AIMA
( A ⇒ B ⇒ C) ⇒ ( A ⇒ C)
Inductive: Learn new rules/facts from a data set D.
Unsupervised: No access to
teacher. Instead, the
machine must search for
“order” and “structure” in
the environment.
Inductive learning - example A
Etc...
− 1 − 1 − 1
− 1 − 1 − 1
+ 1 0 0
0 0 0
x = + 1 , f (x) = +1 x = + 1 , f (x) = −1 x = + 1 , f (x) = 0
0 + 1 + 1
+ 1 + 1 0
0 0 + 1
0 0 0
• f(x) is the target function
• An example is a pair [x, f(x)]
• Learning task: find a hypothesis h such that h(x) ≈ f(x) given a training set
of examples D = {[xi, f(xi) ]}, i = 1,2,…,N
Error f(x)
Η hopt (x) ∈ H
〈Egen 〉
{f(x)}
Η
{hopt (x)}
Η 1⊂ Η 2⊂ Η 3
X →C D K
Input space Output (category) space
x1 c1 0
x2 c2 1
x = ∈ X D
c = = ∈ C K
xD cK 0
Example: Robot color vision
f ( x) = g ( x) + ε
x Observed input
f(x) Observed output
g(x) True underlying function
ε I.I.D noise process
with zero mean
Example: Predict price for cotton
futures
Input: Past history
of closing prices,
and trading volume
Output: Predicted
closing price
Decision trees
yes no
• Splits usually on a single
variable
x2 > β ? x2 > γ ?
yes no yes no
The wait@restaurant decision tree
This is our true function.
Can we learn this tree from examples?
Inductive learning of decision tree
T = True, F = False
( 12) ln( 612) − ( 612) ln( 612) = 0.30
6 True,
Entropy = − 6 6 False
Decision tree learning example
Alternate?
Yes No
3 T, 3 F 3 T, 3 F
Entropy =
6
12
− [ ( ) ( ) ( ) ( )]
3
6
ln 3
6
− 3
6
ln 3 +[ ( ) ( ) ( ) ( )]
6
6 12
− 3 ln 3 − 3 ln 3 = 0.30
6 6 6 6
Yes No
3 T, 3 F 3 T, 3 F
Entropy =
6
12
− [ ( ) ( ) ( ) ( )]
3
6
ln 3
6
− 3
6
ln 3 [ ( ) ( ) ( ) ( )]
+
6
6 12
− 3 ln 3 − 3 ln 3 = 0.30
6 6 6 6
Yes No
2 T, 3 F 4 T, 3 F
Entropy =
5
12
− [ ( ) ( ) ( ) ( )]
2
5
ln 2
5
− 3
5
ln 3 + [ ( ) ( ) ( ) ( )]
7
5 12
− 4 ln 4 − 3 ln 3 = 0.29
7 7 7 7
Yes No
5 T, 2 F 1 T, 4 F
Entropy =
7
12
− [ ( ) ( ) ( ) ( )]
5
7
ln 5
7
− 2
7
ln 2 + [ ( ) ( ) ( ) ( )]
5
7 12
− 1 ln 1 − 4 ln 4 = 0.24
5 5 5 5
Yes No
2 T, 2 F 4 T, 4 F
Entropy =
4
12
− [ ( ) ( ) ( ) ( )]
2
4
ln 2
4
− 2
4
ln 2 [ ( ) ( ) ( ) ( )]
+
8
4 12
− 4 ln 4 − 4 ln 4 = 0.30
8 8 8 8
Yes No
3 T, 2 F 3 T, 4 F
Entropy =
5
12
− [ ( ) ( ) ( ) ( )]
3
5
ln 3
5
− 2
5
ln 2 +
7
5 12
[ ( ) ( ) ( ) ( )]
− 3 ln 3 − 4 ln 4 = 0.29
7 7 7 7
None Full
Some
2F 2 T, 4 F
4T
Entropy = [ ( ) ( ) ( ) ( )]
2
12
− 0
2
ln 0
2
− 2
2
ln 2 [ ( ) ( ) ( ) ( )]
+
4
2 12
− 4 ln 4 − 0 ln 0
4 4 4 4
+
6
12
[ ( ) ( ) ( ) ( )]
− 2 ln 2 − 4 ln 4 = 0.14
6 6 6 6
$ $$$
$$
3 T, 3 F 1 T, 3 F
2T
Entropy = [ ( ) ( ) ( ) ( )]
6
12
− 3
6
ln 3
6
− 3
6
ln 3 [ ( ) ( ) ( ) ( )]
+
2
6 12
− 2 ln 2 − 0 ln 0
2 2 2 2
+
4
12
[ ( ) ( ) ( ) ( )]
− 1 ln 1 − 3 ln 3 = 0.23
4 4 4 4
1 T, 1 F Italian 2 T, 2 F
Thai
1 T, 1 F 2 T, 2 F
Entropy = [ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
2
12
− 1
2
ln 1
2
− 1
2
ln 1 +
2 12
2
− 1 ln 1 − 1 ln 1
2 2 2 2
+
4
12
[ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
− 2 ln 2 − 2 ln 2 +
4 4 4 4 12
4
− 2 ln 2 − 2 ln 2 = 0.30
4 4 4 4
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example
Est. waiting
time
0-10 > 60
4 T, 2 F 10-30 2F
30-60
1 T, 1 F 1 T, 1 F
Entropy = [ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( ) ]
6
12
− 4
6
ln 4
6
− 2
6
ln 2 +
6 12
2
− 1 ln 1 − 1 ln 1
2 2 2 2
+
2
12
[ ( ) ( ) ( ) ( )] [ ( ) ( ) ( ) ( )]
− 1 ln 1 − 1 ln 1 +
2 2 2 2 12
2
− 0 ln 0 − 2 ln 2 = 0.24
2 2 2 2
Entropy decrease = 0.30 – 0.24 = 0.06
Decision tree learning example
Largest entropy decrease (0.16)
Patrons? achieved by splitting on Patrons.
None Full
Some
2F 2 T,
X?4 F Continue like this, making new
splits, always purifying nodes.
4T
Decision tree learning example
Induced tree (from examples)
Decision tree learning example
True tree
Decision tree learning example
Induced tree (from examples)
E gen ≈ Eval
Dtrain Dval
Dtrain
Dval
Dtrain
Dval Dtrain
h instance
space X
f
f and h disagree
Image adapted from F. Hoffmann @ KTH
Probability for bad hypothesis
Suppose we have a bad hypothesis h with error(h) > ε .
What is the probability that it is consistent with N samples?