W8-Supervised Learning Methods

Artificial Intelligence
RSCI
Dr. Ayesha Kashif
• Bayesian Inference
– Naïve Bays Classifier
• Predictive Regression
– Linear Regression
– Logistic Regression
Bayesian Classification: Why?
– A statistical classifier:
• performs probabilistic prediction, i.e., predicts class membership
probabilities
– Foundation:
• Based on Bayes’ Theorem.
– Performance:
• A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural
network classifiers
– Incremental:
• Each training example can incrementally increase/decrease the
probability that a hypothesis is correct
• prior knowledge can be combined with observed data
Bayes’ Theorem: Basics
– Bayes’ Theorem:
P( H | X) = P(X | H ) P( H ) = P(X | H ) P( H ) / P(X)
P(X)
• Let X be a data sample (“evidence”): class
label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e.,
posteriori probability): the probability that
the hypothesis holds given the observed data
sample X
• P(H) (prior probability): the initial probability
– E.g., X will buy computer, regardless of age,
income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of
observing the sample X, given that the
hypothesis holds
– E.g., Given that X will buy computer, the prob.
that X is 31..40, medium income
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
P( H | X) = P(X | H ) P(H ) = P(X | H ) P( H ) / P(X)

P(X)
P(X | C )P(C )
P(C | X) = i i
i P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Classification Is to Derive the
Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and
each tuple is represented by an n-D attribute vector X = (x1, x2, …,
xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
• Since P(X) is constant for all classes, only

P(C | X) = P(X | C )P(C )
needs to be maximized i i i
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent
(i.e., no dependence relation between attributes):
n
P( X | C i) =  P( x | C i) = P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k =1
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
and P(xk|Ci) is g ( x,  ,  ) = e 2 2
2 
P ( X | C i ) = g ( xk ,  Ci ,  Ci )
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ P(C | X) = P(X | C )P(C )
i i i
C2:buys_computer = ‘no’
age income student credit_rating buys_computer
<=30 high no fair no
Data to be classified: <=30 high no excellent no
X = (age <=30, 31…40 high no fair yes
>40 medium no fair yes
Income = medium, >40 low yes fair yes
Student = yes >40 low yes excellent no
31…40 low yes excellent yes
Credit_rating = Fair) <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Naïve Bayes Classifier: An Example
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Exercise
• Given the table above, predict classification of the new

sample X = {1, 2, 2, class = ?}.
Predictive Regression
Linear Regression
• The prediction of continuous values can be modeled by a statistical
technique called regression .
• regression analysis is the process of determining how a variable Y is
related to one or more other variables x1 , x 2 , . . . , x n .
• The relationship that fits a set of data is characterized by a prediction
model called a regression equation .
• Common reasons for performing regression analysis

include
1. the output is expensive to measure but the inputs are not, and so a
cheap prediction of the output is sought;
2. the values of the inputs are known before the output is known, and a
working prediction of the output is required;
3. controlling the input values, we can predict the behavior of
corresponding outputs; and
4. there might be a causal link between some of the inputs and the
output, and we want to identify the links.
Regression And Model Building
• The engineer visits 25 randomly chosen retail outlets having
vending machines, and the in-outlet delivery time (in minutes)
and the volume of product delivered (in cases) are observed for
each.
• This graph is called a scatter diagram. This display clearly
suggests a relationship between delivery time and delivery
volume
Regression And Model Building
• Correlation coefficients measure the strength and sign of a
relationship, but not the slope.
• There are several ways to estimate the slope; the most
common is a linear least squares fit.
• A “linear fit” is a line intended to model the relationship
between variables.
• A “least squares” fit is one that minimizes the mean
squared error (MSE) between the line and the data.
Linear Regression
• Equation of straight line:
• Y= mX + b
• Y = b + mX
• Where Y represents the dependent variable
• X represents the independent variable
• ‘b’ represents the Y-intercept (i.e. the value of Y when X is equal to zero)
• ‘m’ represents the slope of the line
(i.e. the value of the tan Θ, where Θ represents the angle between the line and the
horizontal axis)
Linear Regression
• Linear regression with one input variable is the
simplest form of regression. It models a random
variable Y (called a response variable) as a linear
function of another random variable X (called a
predictor variable).
• Given n samples or data points of the form (x1, y1),
(x2, y2),…,(xn, yn), where xi∈X and yi∈Y, linear
regression can be expressed as
• where intercept α and slope β are unknown

constants or regression coefficients, and ε is a
random error component.
Linear Regression
– Find the Least Square Error
• minimizes the error between the actual
data points and the estimated line
• LS Minimizes the Sum of the Squared
Differences (errors) (SSE)
• where yi is the real output value given

in the data set, and yi’ is a response
value obtained from the model.
• Squaring has the obvious feature of
treating positive and negative residuals
the same.
Linear Regression: Regression coefficients
Differentiating SSE with respect to α and β Setting the partial derivatives equal to
zero (minimization of the total error) and
rearranging the terms,
which may be solved simultaneously to yield computing formulas for α and β. Using
standard relations for the mean values, regression coefficients for this simple case of
optimization are
Slope =
Intercept =
Beta equals the covariance between x and y divided by the variance of x.
Linear Regression: Example
– Training Data
• where α and β coefficients can be calculated based on previous
formulas (using meanA = 5, and meanB = 6), and they have the
values
• The optimal regression line is

Linear Regression: Goodness of fit
– Mean Square Error
• suppose that you are trying to guess someone’s weight. If
you didn’t know anything about them, your best strategy
would be to guess ȳ; in that case the MSE of your guesses
would be Var(Y):
• A number given by MSE is still hard to immediately intuit. Is this a good
prediction?
• To measure the predictive power of a model, we can compute the

coefficient of determination, more commonly known as “R-squared”:
• To measure the predictive power of a model, we can
compute the coefficient of determination, more commonly
known as “R-squared”:
• So the term Var(ε)/Var(Y) is the ratio of mean squared

error with and without the explanatory variable, which is
the fraction of variability left unexplained by the model.
Linear Regression
– Quality of the linear regression model
• One parameter, which shows this strength of linear association
between two variables by means of a single number, is called a
correlation coefficient r .
Covariance (x,y)/Standard Dev of x . Standard Dev y
• Where
• A correlation coefficient r = 0.85 indicates a good linear

relationship between two variables.
Logistic Regression
Logistic Regression
– Probability of dependent variable
• Rather than predicting the value of the dependent variable
• the logistic regression method tries to estimate the
probability that the dependent variable will have a given
value.
– Customer Credit Rating example
• If the estimated probability is greater than 0.50 then the
prediction is closer to YES (a good credit rating),
• otherwise the output is closer to NO (a bad credit rating is
more probable).
Logistic Regression
– Odds Ratio
• Logistic regression uses the concept of odds ratios to
calculate the probability.
• For example, the probability of a sports team to win a
certain match might be 0.75.
• The odds for that team to lose would be 1 – 0.75 = 0.25.
• The odds ratio for that team winning would be 0.75/0.25
= 3.
• This can be said as the odds of the team winning are 3 to
1 on.
Logistic Regression
– Linear Logistic Model
• Suppose that output Y has two possible categorical values
coded as 0 and 1. (output is a vector)
• This equation is known as the linear logistic model . The

function log (pj /[1 − pj ]) is often written as logit(p).
• The main reason for using the logit form of output is to
prevent the predicting probabilities from becoming values
out of the required range [0, 1].
Logistic Regression
– Example
• suppose that the new sample for classification has input values {x1 , x2
, x3 } = {1, 0, 1}.
• Using the linear logistic model, it is possible to estimate the probability

of the output value 1, (p[Y = 1]) for this sample.
• First, calculate the corresponding logit(p)
• and then the probability of the output value 1 for the given inputs:
Based on the final value for probability p, we

may conclude that output value Y = 1 is more
probable than the other categorical value Y = 0. Curves of the form are called sigmoidal
because they are S-shaped, and nonlinear.
References
• Allen B. Downey - Think Stats-O'Reilly Media, Inc. (2018)
• https://en.wikipedia.org/wiki/Numerical_methods_for_li
near_least_squares
• https://towardsdatascience.com/linear-regression-
derivation-d362ea3884c2

W8-Supervised Learning Methods

Uploaded by

Copyright:

Available Formats

W8-Supervised Learning Methods

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W8-Supervised Learning Methods

Uploaded by

Copyright:

Available Formats

Artificial Intelligence

P( H | X) = P(X | H ) P(H ) = P(X | H ) P( H ) / P(X)

• Since P(X) is constant for all classes, only

• Given the table above, predict classification of the new

• Common reasons for performing regression analysis

• where intercept α and slope β are unknown

• where yi is the real output value given

• The optimal regression line is

• To measure the predictive power of a model, we can compute the

• So the term Var(ε)/Var(Y) is the ratio of mean squared

• A correlation coefficient r = 0.85 indicates a good linear

• This equation is known as the linear logistic model . The

• Using the linear logistic model, it is possible to estimate the probability

Based on the final value for probability p, we

You might also like