W8-Supervised Learning Methods
W8-Supervised Learning Methods
W8-Supervised Learning Methods
RSCI
Dr. Ayesha Kashif
• Bayesian Inference
– Naïve Bays Classifier
• Predictive Regression
– Linear Regression
– Logistic Regression
Bayesian Classification: Why?
– A statistical classifier:
• performs probabilistic prediction, i.e., predicts class membership
probabilities
– Foundation:
• Based on Bayes’ Theorem.
– Performance:
• A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural
network classifiers
– Incremental:
• Each training example can incrementally increase/decrease the
probability that a hypothesis is correct
• prior knowledge can be combined with observed data
Bayes’ Theorem: Basics
– Bayes’ Theorem:
P( H | X) = P(X | H ) P( H ) = P(X | H ) P( H ) / P(X)
P(X)
• Let X be a data sample (“evidence”): class
label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e.,
posteriori probability): the probability that
the hypothesis holds given the observed data
sample X
• P(H) (prior probability): the initial probability
– E.g., X will buy computer, regardless of age,
income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of
observing the sample X, given that the
hypothesis holds
– E.g., Given that X will buy computer, the prob.
that X is 31..40, medium income
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Classification Is to Derive the
Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and
each tuple is represented by an n-D attribute vector X = (x1, x2, …,
xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
and P(xk|Ci) is g ( x, , ) = e 2 2
2
P ( X | C i ) = g ( xk , Ci , Ci )
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ P(C | X) = P(X | C )P(C )
i i i
C2:buys_computer = ‘no’
age income student credit_rating buys_computer
<=30 high no fair no
Data to be classified: <=30 high no excellent no
X = (age <=30, 31…40 high no fair yes
>40 medium no fair yes
Income = medium, >40 low yes fair yes
Student = yes >40 low yes excellent no
31…40 low yes excellent yes
Credit_rating = Fair) <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Naïve Bayes Classifier: An Example
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Exercise
which may be solved simultaneously to yield computing formulas for α and β. Using
standard relations for the mean values, regression coefficients for this simple case of
optimization are
Slope =
Intercept =
Beta equals the covariance between x and y divided by the variance of x.
Linear Regression: Example
– Training Data
• where α and β coefficients can be calculated based on previous
formulas (using meanA = 5, and meanB = 6), and they have the
values
• Where