Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CH 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 84

 Following are the examples of cases where the data analysis

task is Classification −
› A bank loan officer wants to analyze the data in order to
know which customer (loan applicant) are risky or which
are safe.
› A marketing manager at a company needs to analyze a
customer with a given profile, who will buy a new
computer.
 Following are the examples of cases where the data analysis
task is Prediction −
› Suppose the marketing manager needs to predict how
much a given customer will spend during a sale at his
company. In this example we are bothered to predict a
numeric value.
 Classification
› predicts categorical class labels (discrete or
nominal)
› classifies data (constructs a model) based on
the training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
 Prediction
› Prediction model predicts continuous-valued
functions,
 Typical applications
› Medical diagnosis
› Fraud detection
› Loan approval
› Target marketing
 Model construction: describing a set of predetermined
classes
› Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
› The set of tuples used for model construction is training set
› The model is represented as classification rules, decision
trees, or mathematical formulae
 Model usage: for classifying future or unknown objects
› Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that
are correctly classified by the model
 Test set is independent of training set, otherwise over-
fitting will occur
› If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
Process (1): Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
 Machine learning is building machines
that can adapt and learn from
experience without being explicitly
programmed.
 Supervised learning (classification)
› Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
› New data is classified based on the training set
 Unsupervised learning (clustering)
› The class labels of training data is unknown
› Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
 Data cleaning
› Preprocess data in order to reduce noise
and handle missing values
 Relevance analysis (feature selection)
› Remove the irrelevant or redundant
attributes
 Data transformation
› Generalize and/or normalize data
 Accuracy
› classifier accuracy: predicting class label
› predictor accuracy: guessing value of predicted
attributes
 Speed
› time to construct the model (training time)
› time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
 Interpretability
› understanding and insight provided by the model

 Other measures, e.g., goodness of rules, such as decision tree


size or compactness of classification rules
 Decision tree induction
 Bayesian classifier
 Rule based classifier
 Artificial neural network
 Nearest neighbour classifier
 Support vector machine
 Decision tree induction is the learning of decision trees
from class-labeled training tuples.
 A decision tree is a flowchart-like tree structure, where
each internal node(non leaf node) denotes a test on
an attribute, each branch represents an outcome of
the test, and each leaf node (or terminal node) holds a
class label.
Output: A Decision Tree for
“buys_computer”
How to Use Decision Trees

Outlook Temperature Humidity Wind PlayTennis


Sunny Hot High Weak ? No
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
“How are decision trees used for classification?”
› Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested
against the decision tree.
› A path is traced from the root to a leaf node, which holds
the class prediction for that tuple. Decision trees can easily
be converted to classification rules.
The benefits of having a decision tree are as follows −
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are
simple and fast.
Decision Tree Induction Algorithm-
 ID3
 C4.5
 CART
 Basic algorithm (a greedy algorithm)
› Tree is constructed in a top-down recursive divide-and-
conquer manner
› At start, all the training examples are at the root
› Attributes are categorical (if continuous-valued, they are
discretized in advance)
› Examples are partitioned recursively based on selected
attributes
› Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
› All samples for a given node belong to the same class
› There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
› There are no samples left
 An attribute selection measure is a heuristic for selecting the
splitting criterion that “best” separates a given data partition,
D, of class-labeled training tuples into individual classes.

 The attribute selection measure provides a ranking for each


attribute describing the given training tuples. The attribute
having the best score for the measure is chosen as the
splitting attribute for the given tuples

 attribute selection measures—


› information gain,
› gain ratio,
› gini index
 ID3 uses information gain as its attribute selection measure.
 The attribute with the highest information gain is chosen as
the splitting attribute for node N
 This attribute minimizes the information needed to classify the
tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions
 Expected information (entropy) needed to classify a tuple in
D: m
Info( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split D into vv partitions) to


| Dj |
classify D: Info ( D) 
A  |D|
j 1
 I (D )
j

 Information gained by branching on attribute A


Gain(A)  Info(D)  InfoA(D)
 Class P: buys_computer =
“yes” Infoage ( D) 
5
I (2,3) 
4
I (4,0)
 Class N: buys_computer = 14 14
“no” 9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(pi, ni)
I (2,3) means “age <=30” has 5
5
<=30 2 3 0.971 14 out of 14 samples, with 2
31…40 4 0 0 yes’es and 3 no’s. Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age)  Info( D)  Infoage ( D)  0.246
Similarly,
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes

Gain(income)  0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student )  0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
› Sort the value A in increasing order
› Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
› The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
› D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
 Information gain measure is biased towards
attributes with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to
overcome the problem (normalization to information
gain) v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 | D| | D|

› GainRatio(A) = Gain(A)/SplitInfo(A)
4 4 6 6 4 4
SplitInfo A ( D)    log 2 ( )   log 2 ( )   log 2 ( )  0.926
 Ex. 14 14 14 14 14 14
› gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected
as the splitting attribute
 If a data set D contains examples from n classes, gini index,
gini(D) is defined as n
gini( D) 1  p 2j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A (D)  gini(D1)  2 gini(D2)
|D| |D|
 Reduction in Impurity:
gini( A)  gini(D)  giniA(D)
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2  10  4
giniincome{low,medium} ( D)   Gini( D1 )   Gini( D1 )
 14   14 

but gini{medium,high} is 0.30 and thus the best since it is the lowest
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split
values
 Can be modified for categorical attributes
 The three measures, in general, return good results
but
› Information gain:
 biased towards multivalued attributes
› Gain ratio:
 tends to prefer unbalanced splits in which one
partition is much smaller than the others
› Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized
partitions and purity in both partitions
 CHAID: a popular decision tree algorithm, measure based on χ2
test for independence
 C-SEP: performs better than info. gain and gini index in certain
cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest
solution is preferred):
› The best tree as the one that requires the fewest # of bits to
both (1) encode the tree, and (2) encode the exceptions to
the tree
 Multivariate splits (partition based on multiple variable
combinations)
› CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
› Most give good results, none is significantly superior than others
 Overfitting: An induced tree may overfit the training data
› Too many branches, some may reflect anomalies due to noise
or outliers
› Poor accuracy for unseen samples
 Two approaches to avoid overfitting
› Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
 Difficult to choose an appropriate threshold
› Postpruning: Remove branches from a “fully grown” tree—get
a sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the “best pruned tree”
 Allow for continuous-valued attributes
› Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a
discrete set of intervals
 Handle missing attribute values
› Assign the most common value of the attribute
› Assign probability to each of the possible values
 Attribute construction
› Create new attributes based on existing ones that
are sparsely represented
› This reduces fragmentation, repetition, and
replication
 “What if D, the disk-resident training set of class-labeled
tuples, does not fit in memory? In other words, how
scalable is decision tree induction?”
 In data mining applications, very large training sets of
millions of tuples are common.
 Most often, the training data will not fit in memory!
Therefore, decision tree construction becomes
inefficient due to swapping of the training tuples in and
out of main and cache memories.
 More scalable approaches, capable of handling
training data that are too large to fit in memory, are
required.
 Scalable decision tree induction methods:
› RainForest
› BOAT
 Separates the scalability aspects from the criteria that
determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
› Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
› Set of AVC-sets of all predictor attributes at the node n
Training Examples AVC-set on Age AVC-set on income
age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer

<=30 high no fair no yes no


<=30 high no excellent no yes no
high 2 2
31…40 high no fair yes <=30 3 2
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer

<=30 medium yes excellent yes yes no rating yes no


31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
 Use a statistical technique called bootstrapping to
create several smaller samples (subsets), each fits in
memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new
tree T’
› It turns out that T’ is very close to the tree that would
be generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.
 In rule-based classifiers, where the learned model is
represented as a set of IF-THEN rules.
 Rules can be generated from a decision tree or directly
from the training data using a sequential covering
algorithm.
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
› Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
› ncovers = # of tuples covered by R
› ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule is triggered, need conflict resolution
› Size ordering: assign the highest priority to the triggering
rules that has the “toughest” requirement where
toughness is measured by the rule antecedent size. That is,
the triggering rule with the most attribute tests is fired.
› Class-based ordering: the classes are sorted in order of
decreasing “importance” such as by decreasing order of
prevalence or misclassification cost per class
› Rule-based ordering (decision list): rules are organized into
one long priority list, according to some measure of rule
quality such as accuracy, coverage, or size or by experts.
The triggering rule that appears earliest in the list has the
highest priority, and so it gets to fire its class prediction.
age?

<=30 31..40 >40


 Rules are easier to understand than large
student? credit rating?
trees
yes

no yes excellent fair


 One rule is created for each path from the
no yes no
root to a leaf
 Each attribute-value pair along a path forms
a conjunction: the leaf holds the class
prediction
 Example: Rule extraction from our buys_computer decision-tree
 Rules are mutually exclusive and exhaustive
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
 Sequential covering algorithm: Extracts rules directly from training
data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
› Rules are learned one at a time
› Each time a rule is learned, the tuples covered by the rules are
removed
› The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
› Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy
› Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
pos' pos
FOIL _ Gain  pos'(log 2  log 2 )
pos' neg ' pos  neg
It favors rules that have high accuracy and cover many positive
tuples
 Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune( R) 
pos  neg

Pos/neg are # of positive/negative tuples covered by R.


If FOIL_Prune is higher for the pruned version of R, prune R
classes buy_compu buy_com total recognitio
ter = yes puter = n(%)
no
buy_computer 6954 46 7000 99.34
= yes
buy_computer 412 2588 3000 86.27
= no
total 7366 2634 10000 95.52

 True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. Let TP be the number of true
positives.
 True negatives(TN): These are the negative tuples that were
correctly labeled by the classifier. Let TN be the number of true
negatives.
 False positives (FP): These are the negative tuples that were
incorrectly labeled as positive (e.g., tuples of class buys computer
= no for which the classifier predicted buys computer=yes). Let FP
be the number of false positives.
 False negatives (FN): These are the positive tuples that were
 Accuracy of a classifier M, acc(M): percentage of test set tuples
that are correctly classified by the model M: TP+TN/(P+N)
 Error rate (misclassification rate) of M = 1 – acc(M)
 Alternative accuracy measures (e.g., for cancer diagnosis)
 Sensitivity is also referred to as the true positive (recognition)
rate (i.e., the proportion of positive tuples that are correctly
identified), while specificity is the true negative rate (i.e., the
proportion of negative tuples that are correctly identified)
› sensitivity = TP/P /* true positive recognition rate */
› specificity = TN/N /* true negative recognition rate */
› precision = TP/(TP+FP)
› Precision can be thought of as a measure of exactness (i.e., what
percentage of tuples labeled as positive are actually such),
whereas recall is a measure of completeness (what percentage
of positive tuples are labeled as such)
› recall=TP/(TP+FN)
 F1 Score
 F1 Score is used to measure a test’s
accuracy. F1 Score is the Harmonic Mean
between precision and recall. The greater
the F1 Score, the better is the performance
of our model.
 F =2×precision×recall/precision+recall

 Fβ
=(1+β2)×precision×recall/β2×precision+reca
ll
 Measure predictor accuracy: measure how far off the predicted value is
from the actual known value
 Loss function: measures the error betw. yi and the predicted value yi’
› Absolute error: | yi – yi’|
› Squared error: (yi – yi’)2
 Test error (generalization error): the average loss over the test set
› Mean absolute error: Mean squared error: d

 ( yi  yi ' ) 2
d

 | y  y '|
i 1
i i
i 1
d
› Relative absolute error: d Relative squared error:
d
d

 | yi  yi ' | (y
i 1
i  yi ' ) 2
i 1
d d
| y
i 1
i y|
(y
i 1
i  y)2

The mean squared-error exaggerates the presence of outliers


Popularly use (square) root mean-square error, similarly, root relative
squared error
 Holdout method
› Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
› Random sampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the accuracies
obtained
 Cross-validation (k-fold, where k = 10 is most popular)
› Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
› At i-th iteration, use Di as test set and others as training set
› Leave-one-out: k folds where k = # of tuples, for small sized
data
› Stratified cross-validation: folds are stratified so that class dist. in
each fold is approx. the same as that in the initial data
 Bootstrap
› Works well with small data sets
› Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
 Several boostrap methods, and a common one is .632 boostrap
› Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set
end up forming the test set. About 63.2% of the original data will
end up in the bootstrap, and the remaining 36.8% will form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
› Repeat the sampling procedue k times, overall accuracy of
the model: k
acc( M )   (0.632  acc( M i )test _ set 0.368  acc( M i )train_ set )
i 1
 Ensemble methods
› Use a combination of models to increase accuracy
› Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
 Popular ensemble methods
› Bagging: averaging the prediction over a collection of
classifiers
› Boosting: weighted vote with a collection of classifiers
› Ensemble: combining a set of heterogeneous classifiers
 Analogy: Diagnosis based on multiple doctors’ majority vote
 Training
› Given a set D of d tuples, at each iteration i, a training set Di of
d tuples is sampled with replacement from D (i.e., boostrap)
› A classifier model Mi is learned for each training set Di
 Classification: classify an unknown sample X
› Each classifier Mi returns its class prediction
› The bagged classifier M* counts the votes and assigns the class
with the most votes to X
 Prediction: can be applied to the prediction of continuous values
by taking the average value of each prediction for a given test
tuple
 Accuracy
› Often significant better than a single classifier derived from D
› For noise data: not considerably worse, more robust
› Proved improved accuracy in prediction
 Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
 How boosting works?
› Weights are assigned to each training tuple
› A series of k classifiers is iteratively learned
› After a classifier Mi is learned, the weights are updated to allow
the subsequent classifier, Mi+1, to pay more attention to the
training tuples that were misclassified by Mi
› The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
 The boosting algorithm can be extended for the prediction of
continuous values
 Comparing with bagging: boosting tends to achieve greater
accuracy, but it also risks overfitting the model to misclassified data
 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
 Initially, all the weights of tuples are set the same (1/d)
 Generate k classifiers in k rounds. At round i,
› Tuples from D are sampled (with replacement) to form a
training set Di of the same size
› Each tuple’s chance of being selected is based on its weight
› A classification model Mi is derived from Di
› Its error rate is calculated using Di as a test set
› If a tuple is misclssified, its weight is increased, o.w. it is
decreased
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier
Mi error rate is the sum of the weights of the misclassified tuples:
d
error ( M i )   w j  err ( X j )
j
1  error ( M i )
 The weight of classifier Mi’s vote is log
error ( M i )
 (Numerical) prediction is similar to classification
› construct a model
› use model to predict continuous or ordered value for a given
input
 Prediction is different from classification
› Classification refers to predict categorical class label
› Prediction models continuous-valued functions
 Major method for prediction: regression
› model the relationship between one or more independent or
predictor variables and a dependent or response variable
 Regression analysis
› Linear and multiple regression
› Non-linear regression
› Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-intercept
X
 1. Relationship Between Variables Is a
Linear Function
Population Population Random
Y-Intercept Slope Error

Yi   0  1X i   i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)
Yi   0  1X i   i Observed
value
Y
i = Random error

E YX   0  1 X i

Observed value
 
Yi   0   1X i   i

^i = Random
Y

error
Unsampled
observation
X

Yi   0   1X i
Observed value
 1. Plot of All (Xi, Yi) Pairs
 2. Suggests How Well Model Will Fit

Y
60
40
20
0 X
0 20 40 60
How would you draw a line through the
points? How do you determine which line
‘fits best’?

Y
60
40
20
0 X
0 20 40 60
How would you draw a line through the
points? How do you determine which line
‘fits best’?

Slope changed
Y
60
40
20
0 X
0 20 40 60
Intercept unchanged
How would you draw a line through the
points? How do you determine which line
‘fits best’?
Slope unchanged

Y
60
40
20
0 X
0 20 40 60
Intercept changed
How would you draw a line through the
points? How do you determine which line
‘fits best’?

Slope changed
Y
60
40
20
0 X
0 20 40 60
Intercept changed
 1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values
Are a Minimum. But Positive Differences
Off-Set Negative ones
 1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values is a
Minimum. But Positive Differences Off-Set
Negative ones. So square errors!

    ˆ
n n
Yi  Yˆi
2
2
i
i 1 i 1
 1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values
Are a Minimum. But Positive Differences
Off-Set Negative. So square errors!

    ˆ
n n

 ˆ 2
Y  Yi i
2
i
i 1 i 1
 2. LS Minimizes the Sum of the Squared
Differences (errors) (SSE)
n
LS minimizes   i   1   2   3   4
 2  2  2  2  2

i 1
Y Y2   0   1X 2   2
^4
^2
^1 ^3
  
Yi   0   1X i
X
Sum of squared differences = (2 - 1)2 (4
+ - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 +(3.2 - 2.5)2 =
Let us compare two lines
4 (2,4)
w The second line is horizon
3 w (4,3.2)
2.5
2
(1,2)w
w (3,1.5)
1
The smaller the sum of
squared differences
1 2 3 4 the better the fit of the
line to the data.
67
 Prediction equation
yˆi  ˆ0  ˆ1xi

 Sample slope
SS xy   xi  x  yi  y 
ˆ1  
SS xx  i x  x 2

 Sample Y - intercept

ˆ0  y  ˆ1x
 Problem Statement
 Last year, five randomly selected students took a math
aptitude test before they began their statistics course. The
Statistics Department has three questions.
 What linear regression equation best predicts statistics
performance, based on math aptitude scores?
 If a student made an 80 on the aptitude test, what grade
would we expect her to make in statistics?
 the xi column shows scores
on the aptitude test.
Similarly, the yi column shows
statistics grades.
 The last two rows
show sums and mean
scores that we will use
to conduct the
regression analysis.
 we need to
compute the
product of the
deviation scores.
 we also need to
compute the
squares of the
deviation
scores (the last
two columns in
the table
below).
 Once we know the value of the regression coefficient
(b1), we can solve for the regression slope (b0):

 Therefore, the regression equation is: ŷ = 26.768 +


0.644x .
 If a student made an 80 on the aptitude test, the
estimated statistics grade (ŷ) would be:
 ŷ = b0 + b1x
 ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80
 ŷ = 26.768 + 51.52 = 78.288----ans
 A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance
with decision tree and selected neural network
classifiers
 Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct — prior knowledge can be
combined with observed data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
 Let X be a data sample (“evidence”): class label is
unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
› E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
› E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
 Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem

P(H | X)  P(X | H )P(H )


P(X)
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to C2 iff the probability P(Ci|X) is
the highest among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
 Let D be a training set of tuples and their associated
class labels, and each tuple is represented by an n-D
attribute vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e.,
the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i )   P( x | C i )  P( x | C i )  P( x | C i )  ...  P( x | C i )
k 1 2 n
k 1

 This greatly reduces the computation cost: Only


counts the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples of
Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ 1 
( x ) 2

g ( x,  ,  )  e 2 2
2 
and P(xk|Ci) is
P ( X | C i )  g ( xk ,  C i ,  C i )
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1

 Ex. Suppose a dataset with 1000 tuples, income=low (0),


income= medium (990), and income = high (10),
 Use Laplacian correction (or Laplacian estimator)
› Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
› The “corrected” prob. estimates are close to their
“uncorrected” counterparts
 Advantages
› Easy to implement
› Good results obtained in most of the cases
 Disadvantages
› Assumption: class conditional independence, therefore
loss of accuracy
› Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,
etc.
 Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
 How to deal with these dependencies?
› Bayesian Belief Networks
 Bayesian belief network allows a subset of the
variables conditionally independent
 A graphical model of causal relationships
› Represents dependency among the variables
› Gives a specification of joint probability distribution
 Nodes: random variables
 Links: dependency
X Y  X and Y are the parents of Z, and Y is
the parent of P
Z  No dependency between Z and P
P
 Has no loops or cycles
Family The conditional probability table
Smoker
History (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

CPT shows the conditional probability


for each possible combination of its
parents

PositiveXRay Dyspnea Derivation of the probability of a


particular combination of values of X,
from CPT:
Bayesian Belief Networks n
P ( x1 ,..., xn )   P ( x i | Parents(Y i ))
i 1
 Several scenarios:
› Given both the network structure and all
variables observable: learn only the CPTs
› Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method,
analogous to neural network learning
› Network structure unknown, all variables
observable: search through the model space to
reconstruct network topology
› Unknown structure, all hidden variables: No good
algorithms known for this purpose

You might also like