Bayesian Learning

Bayesian Learning
Bayesian Learning
• Bayes Theorem
• MAP, ML hypotheses
• MAP learners
• Minimum description length principle
• Bayes optimal classifier
• Naive Bayes learner
• Example: Learning over text data
• Bayesian belief networks
• Expectation Maximization algorithm
2
Two Roles for Bayesian Methods
• Provides practical learning algorithms:
• Naive Bayes learning
• Bayesian belief network learning
• Combine prior knowledge (prior probabilities) with observed data
• Requires prior probabilities
• Provides useful conceptual framework

• Provides “gold standard” for evaluating other learning algorithms
• Additional insight into Occam’s razor
3
Bayes Theorem
• P(h) = prior probability of hypothesis h

• P(D) = prior probability of training data D
• P(h|D) = probability of h given D
• P(D|h) = probability of D given h
4
Choosing Hypotheses
• Generally want the most probable hypothesis given the

training data
Maximum a posteriori hypothesis hMAP:
• If assume P(hi) = P(hj) then can further simplify, and choose

the Maximum likelihood (ML) hypothesis
5
Bayes Theorem
• Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the disease is
not present. Furthermore, .008 of the entire population have
this cancer.
P(cancer) = P(cancer) =
P(|cancer) = P(|cancer) =
P(|cancer) = P(|cancer) =
6
Basic Formulas for Probabilities
• Product Rule: probability P(A  B) of a conjunction of two
events A and B:
P(A  B) = P(A | B) P(B) = P(B | A) P(A)
• Sum Rule: probability of a disjunction of two events A and
B:
P(A  B) = P(A) + P(B) - P(A  B)
• Theorem of total probability: if events A1,…, An are

mutually exclusive with , then
7
Brute Force MAP Hypothesis Learner
1. For each hypothesis h in H, calculate the posterior probability
2. Output the hypothesis hMAP with the highest posterior probability
8
Relation to Concept Learning(1/2)
• Consider our usual concept learning task

• instance space X, hypothesis space H, training
examples D
• consider the FindS learning algorithm (outputs most
specific hypothesis from the version space V SH,D)
• What would Bayes rule produce as the MAP

hypothesis?
• Does FindS output a MAP hypothesis??
9
Relation to Concept Learning(2/2)
• Assume fixed set of instances <x1,…, xm>
• Assume D is the set of classifications: D = <c(x1),…,c(xm)>
• Choose P(D|h):
• P(D|h) = 1 if h consistent with D
• P(D|h) = 0 otherwise
• Choose P(h) to be uniform distribution

• P(h) = 1/|H| for all h in H
• Then,
10
Evolution of Posterior Probabilities
11
Characterizing Learning Algorithms
by Equivalent MAP Learners
12
Learning A Real Valued
Function(1/2)
Consider any real-valued target function f

Training examples <xi, di>, where di is noisy training value
• di = f(xi) + ei
• ei is random variable (noise) drawn independently for each xi according
to some Gaussian distribution with mean=0
Then the maximum likelihood hypothesis hML is the one that minimizes
the sum of squared errors:
13
Learning A Real Valued
Function(2/2)
• Maximize natural log of this instead...
14
Learning to Predict Probabilities
• Consider predicting survival probability from patient data
• Training examples <xi, di>, where di is 1 or 0
• Want to train neural network to output a probability given xi (not a 0 or 1)
• In this case can show
• Weight update rule for a sigmoid unit:
where
15
Minimum Description Length Principle (1/2)
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
where LC(x) is the description length of x under encoding C
Example: H = decision trees, D = training data labels

• LC1(h) is # bits to describe tree h
• LC2(D|h) is # bits to describe D given h
• Note LC2(D|h) = 0 if examples classified perfectly by h. Need only describe
exceptions
• Hence hMDL trades off tree size for training errors
16
Minimum Description Length Principle
(2/2)
Interesting fact from information theory:

The optimal (shortest expected coding length) code for an event with
probability p is –log2p bits.
So interpret (1):
• –log2P(h) is length of h under optimal code
• –log2P(D|h) is length of D given h under optimal code
 prefer the hypothesis that minimizes
length(h) + length(misclassifications)
17
Most Probable Classification
of New Instances
• So far we’ve sought the most probable hypothesis given the

data D (i.e., hMAP)
• Given new instance x, what is its most probable
classification?
• hMAP(x) is not the most probable classification!
• Consider:
• Three possible hypotheses:
P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3
• Given new instance x,
h1(x) = +, h2(x) = , h3(x) = 
• What’s most probable classification of x?
18
Bayes Optimal Classifier
• Bayes optimal classification:
• Example:
P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1
P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0
P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0
therefore
and
19
Gibbs Classifier
• Bayes optimal classifier provides best result, but can be
expensive if many hypotheses.
• Gibbs algorithm:
1. Choose one hypothesis at random, according to P(h|D)
2. Use this to classify new instance
• Surprising fact: Assume target concepts are drawn at random
from H according to priors on H. Then:
E[errorGibbs]  2E [errorBayesOptional]
• Suppose correct, uniform prior distribution over H, then
• Pick any hypothesis from VS, with uniform probability
• Its expected error no worse than twice Bayes optimal
20
Naive Bayes Classifier (1/2)
• Along with decision trees, neural networks, nearest
nbr, one of the most practical learning methods.
• When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally
independent given classification
• Successful applications:
• Diagnosis
• Classifying text documents
21
Naive Bayes Classifier (2/2)
• Assume target function f : X  V, where each instance x described by
attributes <a1, a2 … an>.
• Most probable value of f(x) is:
Naive Bayes assumption:

which gives
Naive Bayes classifier:
22
Naive Bayes Algorithm
• Naive Bayes Learn(examples)
For each target value vj
P(vj)  estimate P(vj)
^
For each attribute value ai of each attribute a
^ |v )  estimate P(a |v )
P(a i j i j
• Classify New Instance(x)
23
Naive Bayes: Example
• Consider PlayTennis again, and new instance
<Outlk = sun, Temp = cool, Humid = high, Wind = strong>
• Want to compute:
P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005

P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021
 vNB = n
24
Naive Bayes: Subtleties (1/2)
1. Conditional independence assumption is often violated
• ...but it works surprisingly well anyway. Note don’t need estimated posteriors
to be correct; need only that
• see [Domingos & Pazzani, 1996] for analysis

• Naive Bayes posteriors often unrealistically close to 1 or 0
25
Naive Bayes: Subtleties (2/2)
2. what if none of the training instances with target value vj
have attribute value ai? Then
Typical solution is Bayesian estimate for
where
• n is number of training examples for which v = vi,
• nc number of examples for which v = vj and a = ai
• p is prior estimate for
• m is weight given to prior (i.e. number of “virtual” examples)
26
Learning to Classify Text (1/4)
• Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic
• Naive Bayes is among most effective algorithms

• What attributes shall we use to represent text
documents??
27
Target concept Interesting? : Document {, }
1. Represent each document by vector of words
• one attribute per word position in document
2. Learning: Use training examples to estimate
• P()  P()
• P(doc|)  P(doc|)
Naive Bayes conditional independence assumption
where P(ai = wk | vj) is probability that word in position i is

wk, given vj
one more assumption:
28
LEARN_NAIVE_BAYES_TEXT (Examples, V)
1. collect all words and other tokens that occur in Examples
• Vocabulary  all distinct words and other tokens in
Examples
2. calculate the required P(vj) and P(wk | vj) probability terms
• For each target value vj in V do
• docsj  subset of Examples for which the target value is vj
•
• Textj  a single document created by concatenating all members
of docsj
29
• n  total number of words in Textj (counting duplicate words
multiple times)
• for each word wk in Vocabulary
* nk  number of times word wk occurs in Textj
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
• positions  all word positions in Doc that contain tokens
found in Vocabulary
• Return vNB where
30
Twenty NewsGroups
• Given 1000 training documents from each group Learn to

classify new documents according to which newsgroup it
came from
comp.graphics misc.forsale alt.atheism sci.space

comp.os.ms-win- rec.autos soc.religion.christian sci.crypt
dows.misc rec.motorcycles talk.religion.misc sci.electronics
comp.sys.ibm.pc.hardware rec.sport.baseball talk.politics.mideast sci.med
comp.sys.mac.hardware rec.sport.hockey talk.politics.misc
comp.windows.x talk.politics.guns
• Naive Bayes: 89% classification accuracy

31
Learning Curve for 20 Newsgroups
• Accuracy vs. Training set size (1/3 withheld for test)

32
Bayesian Belief Networks
Interesting because:
• Naive Bayes assumption of conditional independence too
restrictive
• But it’s intractable without some such assumptions...
• Bayesian Belief networks describe conditional
independence among subsets of variables
 allows combining prior knowledge about (in)dependencies
among variables with observed training data
(also called Bayes Nets)
33
Conditional Independence
• Definition: X is conditionally independent of Y given Z if the
probability distribution governing X is independent of the value of Y
given the value of Z; that is, if
(xi, yj, zk) P(X= xi|Y= yj, Z= zk) = P(X= xi|Z= zk)
more compactly, we write
P(X|Y, Z) = P(X|Z)
• Example: Thunder is conditionally independent of Rain, given

Lightning
P(Thunder|Rain, Lightning) = P(Thunder|Lightning)
• Naive Bayes uses cond. indep. to justify

P(X, Y|Z) = P(X|Y, Z) P(Y|Z) = P(X|Z) P(Y|Z)
34
Bayesian Belief Network (1/2)
• Network represents a set of conditional independence

assertions:
• Each node is asserted to be conditionally independent
of its nondescendants, given its immediate
predecessors.
• Directed acyclic graph 35
Bayesian Belief Network (2/2)
• Represents joint probability distribution over all
variables
• e.g., P(Storm, BusTourGroup, . . . , ForestFire)
• in general,
where Parents(Yi) denotes immediate predecessors of

Yi in graph
• so, joint distribution is fully defined by graph, plus the
P(yi|Parents(Yi))
36
Inference in Bayesian Networks
• How can one infer the (probabilities of) values of one or more
network variables, given observed values of others?
• Bayes net contains all information needed for this inference
• If only one variable with unknown value, easy to infer it
• In general case, problem is NP hard
• In practice, can succeed in many cases

• Exact inference methods work well for some network structures
• Monte Carlo methods “simulate” the network randomly to calculate
approximate solutions
37
Learning of Bayesian Networks
• Several variants of this learning task
• Network structure might be known or unknown
• Training examples might provide values of all network variables, or just some
• If structure known and observe all variables

• Then it’s easy as training a Naive Bayes classifier
38
Learning Bayes Nets
• Suppose structure known, variables partially observable
• e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not
Lightning, Campfire...
• Similar to training neural network with hidden units
• In fact, can learn network conditional probability tables using gradient ascent!
• Converge to network h that (locally) maximizes P(D|h)
39
Gradient Ascent for Bayes Nets
• Let wijk denote one entry in the conditional probability
table for variable Yi in the network
wijk = P(Yi = yij|Parents(Yi) = the list uik of values)
• e.g., if Yi = Campfire, then uik might be
<Storm = T, BusTourGroup = F >
• Perform gradient ascent by repeatedly
1. update all wijk using training data D
2. then, renormalize the to wijk assure

• j wijk = 1  0  wijk  1
40
More on Learning Bayes Nets
• EM algorithm can also be used. Repeatedly:
1. Calculate probabilities of unobserved variables, assuming h
2. Calculate new wijk to maximize E[ln P(D|h)] where D now includes both
observed and (calculated probabilities of) unobserved variables
• When structure unknown...

• Algorithms use greedy search to add/substract edges and nodes
• Active research topic
41
Summary: Bayesian Belief Networks
• Combine prior knowledge with observed data
• Impact of prior knowledge (when correct!) is to lower the sample
complexity
• Active research area
• Extend from boolean to real-valued variables
• Parameterized distributions instead of tables
• Extend to first-order instead of propositional systems
• More effective inference methods
• …
42
Expectation Maximization (EM)
• When to use:
• Data is only partially observable
• Unsupervised clustering (target value unobservable)
• Supervised learning (some instance attributes unobservable)
• Some uses:
• Train Bayesian Belief Networks
• Unsupervised clustering (AUTOCLASS)
• Learning Hidden Markov Models
43
Generating Data from Mixture of k Gaussians
• Each instance x generated by

1. Choosing one of the k Gaussians with uniform probability
2. Generating an instance at random according to that
Gaussian
44
EM for Estimating k Means (1/2)
• Given:
• Instances from X generated by mixture of k Gaussian distributions
• Unknown means <1,…,k > of the k Gaussians
• Don’t know which instance xi was generated by which Gaussian
• Determine:
• Maximum likelihood estimates of <1,…,k >
• Think of full description of each instance as

yi = < xi, zi1, zi2> where
• zij is 1 if xi generated by jth Gaussian
• xi observable
• zij unobservable
45
EM for Estimating k Means (2/2)
• EM Algorithm: Pick random initial h = <1, 2> then iterate
E step: Calculate the expected value E[zij] of each
hidden variable zij, assuming the current
hypothesis
h = <1, 2> holds.
M step: Calculate a new maximum likelihood hypothesis

h' = <'1, '2>, assuming the value taken on by each hidden variable zij is its
expected value E[zij] calculated above. Replace h = <1, 2> by h' = <'1, '2>.
46
EM Algorithm
• Converges to local maximum likelihood h and
provides estimates of hidden variables zij
• In fact, local maximum in E[ln P(Y|h)]

• Y is complete (observable plus unobservable variables)
data
• Expected value is taken over possible values of
unobserved variables in Y
47
General EM Problem
• Given:
• Observed data X = {x1,…, xm}
• Unobserved data Z = {z1,…, zm}
• Parameterized probability distribution P(Y|h), where
• Y = {y1,…, ym} is the full data yi = xi  zi
• h are the parameters
• Determine: h that (locally) maximizes E[ln P(Y|h)]
• Many uses:
• Train Bayesian belief networks
• Unsupervised clustering (e.g., k means)
• Hidden Markov Models
48
General EM Method
• Define likelihood function Q(h'|h) which calculates
Y = X  Z using observed X and current parameters h to
estimate Z
Q(h'|h)  E[ln P(Y| h')|h, X]
• EM Algorithm:
• Estimation (E) step: Calculate Q(h'|h) using the current hypothesis h
and the observed data X to estimate the probability distribution over
Y.
Q(h'|h)  E[ln P(Y| h')|h, X]
• Maximization (M) step: Replace hypothesis h by the hypothesis h' that

maximizes this Q function.
49

Bayesian Learning

Uploaded by

Copyright:

Available Formats

Bayesian Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Learning

Uploaded by

Copyright:

Available Formats

Bayesian Learning

• Provides useful conceptual framework

• P(h) = prior probability of hypothesis h

• Generally want the most probable hypothesis given the

• If assume P(hi) = P(hj) then can further simplify, and choose

• Theorem of total probability: if events A1,…, An are

1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior probability

• Consider our usual concept learning task

• What would Bayes rule produce as the MAP

• Choose P(h) to be uniform distribution

Consider any real-valued target function f

• Maximize natural log of this instead...

• Weight update rule for a sigmoid unit:

where LC(x) is the description length of x under encoding C

Example: H = decision trees, D = training data labels

Interesting fact from information theory:

• So far we’ve sought the most probable hypothesis given the

Naive Bayes assumption:

• Classify New Instance(x)

P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005

• see [Domingos & Pazzani, 1996] for analysis

Typical solution is Bayesian estimate for

• Naive Bayes is among most effective algorithms

where P(ai = wk | vj) is probability that word in position i is

• Given 1000 training documents from each group Learn to

comp.graphics misc.forsale alt.atheism sci.space

• Naive Bayes: 89% classification accuracy

• Accuracy vs. Training set size (1/3 withheld for test)

• Example: Thunder is conditionally independent of Rain, given

• Naive Bayes uses cond. indep. to justify

• Network represents a set of conditional independence

where Parents(Yi) denotes immediate predecessors of

• In practice, can succeed in many cases

• If structure known and observe all variables

2. then, renormalize the to wijk assure

• When structure unknown...

• Each instance x generated by

• Think of full description of each instance as

M step: Calculate a new maximum likelihood hypothesis

• In fact, local maximum in E[ln P(Y|h)]

• Maximization (M) step: Replace hypothesis h by the hypothesis h' that

You might also like