Bayesian Learning
Bayesian Learning
Bayesian Learning
Bayesian Learning
• Bayes Theorem
• MAP, ML hypotheses
• MAP learners
• Minimum description length principle
• Bayes optimal classifier
• Naive Bayes learner
• Example: Learning over text data
• Bayesian belief networks
• Expectation Maximization algorithm
2
Two Roles for Bayesian Methods
• Provides practical learning algorithms:
• Naive Bayes learning
• Bayesian belief network learning
• Combine prior knowledge (prior probabilities) with observed data
• Requires prior probabilities
3
Bayes Theorem
4
Choosing Hypotheses
5
Bayes Theorem
• Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the disease is
not present. Furthermore, .008 of the entire population have
this cancer.
P(cancer) = P(cancer) =
P(|cancer) = P(|cancer) =
P(|cancer) = P(|cancer) =
6
Basic Formulas for Probabilities
• Product Rule: probability P(A B) of a conjunction of two
events A and B:
P(A B) = P(A | B) P(B) = P(B | A) P(A)
• Sum Rule: probability of a disjunction of two events A and
B:
P(A B) = P(A) + P(B) - P(A B)
7
Brute Force MAP Hypothesis Learner
8
Relation to Concept Learning(1/2)
• Then,
10
Evolution of Posterior Probabilities
11
Characterizing Learning Algorithms
by Equivalent MAP Learners
12
Learning A Real Valued
Function(1/2)
14
Learning to Predict Probabilities
• Consider predicting survival probability from patient data
• Training examples <xi, di>, where di is 1 or 0
• Want to train neural network to output a probability given xi (not a 0 or 1)
• In this case can show
where
15
Minimum Description Length Principle (1/2)
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
16
Minimum Description Length Principle
(2/2)
• Example:
P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1
P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0
P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0
therefore
and
19
Gibbs Classifier
• Bayes optimal classifier provides best result, but can be
expensive if many hypotheses.
• Gibbs algorithm:
1. Choose one hypothesis at random, according to P(h|D)
2. Use this to classify new instance
• Surprising fact: Assume target concepts are drawn at random
from H according to priors on H. Then:
E[errorGibbs] 2E [errorBayesOptional]
• Suppose correct, uniform prior distribution over H, then
• Pick any hypothesis from VS, with uniform probability
• Its expected error no worse than twice Bayes optimal
20
Naive Bayes Classifier (1/2)
• Along with decision trees, neural networks, nearest
nbr, one of the most practical learning methods.
• When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally
independent given classification
• Successful applications:
• Diagnosis
• Classifying text documents
21
Naive Bayes Classifier (2/2)
• Assume target function f : X V, where each instance x described by
attributes <a1, a2 … an>.
• Most probable value of f(x) is:
22
Naive Bayes Algorithm
• Naive Bayes Learn(examples)
For each target value vj
P(vj) estimate P(vj)
^
For each attribute value ai of each attribute a
^ |v ) estimate P(a |v )
P(a i j i j
23
Naive Bayes: Example
• Consider PlayTennis again, and new instance
<Outlk = sun, Temp = cool, Humid = high, Wind = strong>
• Want to compute:
24
Naive Bayes: Subtleties (1/2)
1. Conditional independence assumption is often violated
• ...but it works surprisingly well anyway. Note don’t need estimated posteriors
to be correct; need only that
25
Naive Bayes: Subtleties (2/2)
2. what if none of the training instances with target value vj
have attribute value ai? Then
where
• n is number of training examples for which v = vi,
• nc number of examples for which v = vj and a = ai
• p is prior estimate for
• m is weight given to prior (i.e. number of “virtual” examples)
26
Learning to Classify Text (1/4)
• Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic
27
Learning to Classify Text (2/4)
Target concept Interesting? : Document {, }
1. Represent each document by vector of words
• one attribute per word position in document
2. Learning: Use training examples to estimate
• P() P()
• P(doc|) P(doc|)
Naive Bayes conditional independence assumption
•
• Textj a single document created by concatenating all members
of docsj
29
Learning to Classify Text (4/4)
• n total number of words in Textj (counting duplicate words
multiple times)
• for each word wk in Vocabulary
* nk number of times word wk occurs in Textj
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
• positions all word positions in Doc that contain tokens
found in Vocabulary
• Return vNB where
30
Twenty NewsGroups
Interesting because:
• Naive Bayes assumption of conditional independence too
restrictive
• But it’s intractable without some such assumptions...
• Bayesian Belief networks describe conditional
independence among subsets of variables
allows combining prior knowledge about (in)dependencies
among variables with observed training data
(also called Bayes Nets)
33
Conditional Independence
• Definition: X is conditionally independent of Y given Z if the
probability distribution governing X is independent of the value of Y
given the value of Z; that is, if
(xi, yj, zk) P(X= xi|Y= yj, Z= zk) = P(X= xi|Z= zk)
more compactly, we write
P(X|Y, Z) = P(X|Z)
36
Inference in Bayesian Networks
• How can one infer the (probabilities of) values of one or more
network variables, given observed values of others?
• Bayes net contains all information needed for this inference
• If only one variable with unknown value, easy to infer it
• In general case, problem is NP hard
37
Learning of Bayesian Networks
• Several variants of this learning task
• Network structure might be known or unknown
• Training examples might provide values of all network variables, or just some
38
Learning Bayes Nets
• Suppose structure known, variables partially observable
• e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not
Lightning, Campfire...
• Similar to training neural network with hidden units
• In fact, can learn network conditional probability tables using gradient ascent!
• Converge to network h that (locally) maximizes P(D|h)
39
Gradient Ascent for Bayes Nets
• Let wijk denote one entry in the conditional probability
table for variable Yi in the network
wijk = P(Yi = yij|Parents(Yi) = the list uik of values)
• e.g., if Yi = Campfire, then uik might be
<Storm = T, BusTourGroup = F >
• Perform gradient ascent by repeatedly
1. update all wijk using training data D
41
Summary: Bayesian Belief Networks
• Combine prior knowledge with observed data
• Impact of prior knowledge (when correct!) is to lower the sample
complexity
• Active research area
• Extend from boolean to real-valued variables
• Parameterized distributions instead of tables
• Extend to first-order instead of propositional systems
• More effective inference methods
• …
42
Expectation Maximization (EM)
• When to use:
• Data is only partially observable
• Unsupervised clustering (target value unobservable)
• Supervised learning (some instance attributes unobservable)
• Some uses:
• Train Bayesian Belief Networks
• Unsupervised clustering (AUTOCLASS)
• Learning Hidden Markov Models
43
Generating Data from Mixture of k Gaussians
45
EM for Estimating k Means (2/2)
• EM Algorithm: Pick random initial h = <1, 2> then iterate
E step: Calculate the expected value E[zij] of each
hidden variable zij, assuming the current
hypothesis
h = <1, 2> holds.
46
EM Algorithm
• Converges to local maximum likelihood h and
provides estimates of hidden variables zij
47
General EM Problem
• Given:
• Observed data X = {x1,…, xm}
• Unobserved data Z = {z1,…, zm}
• Parameterized probability distribution P(Y|h), where
• Y = {y1,…, ym} is the full data yi = xi zi
• h are the parameters
• Determine: h that (locally) maximizes E[ln P(Y|h)]
• Many uses:
• Train Bayesian belief networks
• Unsupervised clustering (e.g., k means)
• Hidden Markov Models
48
General EM Method
• Define likelihood function Q(h'|h) which calculates
Y = X Z using observed X and current parameters h to
estimate Z
Q(h'|h) E[ln P(Y| h')|h, X]
• EM Algorithm:
• Estimation (E) step: Calculate Q(h'|h) using the current hypothesis h
and the observed data X to estimate the probability distribution over
Y.
Q(h'|h) E[ln P(Y| h')|h, X]
49