Class Basic
Class Basic
Class Basic
Chapter 8
Chapter 8. Classification: Basic Concepts
Classification—A Two-Step Process
■ Model construction: describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
■ The set of tuples used for model construction is training set
■ The model is represented as classification rules, decision trees, or
mathematical formulae
■ Model usage: for classifying future or unknown objects
■ Estimate accuracy of the model
■ The known label of test sample is compared with the classified
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
Testing Unseen
Data Data
(Jeff, Professor, 4)
Chapter 8. Classification: Basic Concepts
<=30 overcast
31..40 >40
no yes no yes
Algorithm for Decision Tree Induction
■ Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive
divide-and-conquer manner
■ At start, all the training examples are at the root
discretized in advance)
■ Examples are partitioned recursively based on selected
■ Test attributes are selected on the basis of a heuristic or
Attribute Selection Measure: Information
Gain (ID3/C4.5)
■ Select the attribute with the highest information gain
■ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
■ Expected information (entropy) needed to classify a tuple in D:
Attribute Selection: Information Gain
g Class P: buys_computer = “yes”
g Class N: buys_computer = “no”
Computing Information-Gain for
Continuous-Valued Attributes
■ Let attribute A be a continuous-valued attribute
■ Must determine the best split point for A
■ Sort the value A in increasing order
■ Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
■ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
■ The point with the minimum expected information
requirement for A is selected as the split-point for A
■ Split:
■ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)
■ Information gain measure is biased towards attributes with a
large number of values
■ C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
■ GainRatio(A) = Gain(A)/SplitInfo(A)
■ Ex.
■ Reduction in Impurity:
noise or outliers
■ Poor accuracy for unseen samples
■ Attribute construction
■ Create new attributes based on existing ones that are
sparsely represented
■ This reduces fragmentation, repetition, and replication
Classification in Large Databases
■ Classification—a classical problem extensively studied by
statisticians and machine learning researchers
■ Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
■ Why is decision tree induction popular?
■ relatively faster learning speed (than other classification
■ convertible to simple and easy to understand classification
■ can use SQL queries for accessing databases
Scalability Framework for RainForest
Rainforest: Training Set and Its AVC Sets
yes no
yes no
high 2 2
<=30 2 3
31..40 4 0 medium 4 2
>40 3 2 low 3 1
AVC-set on
AVC-set on Student
student Buy_Computer Buy_Computer
yes no rating yes no
yes 6 1 fair 6 2
no 3 4 excellent 3 3
BOAT (Bootstrapped Optimistic
Algorithm for Tree Construction)
■ Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
■ Each subset is used to create a tree, resulting in several
■ These trees are examined and used to construct a new
tree T’
■ It turns out that T’ is very close to the tree that would
be generated using the whole data set together
■ Adv: requires only two scans of DB, an incremental alg.
Presentation of Classification Results
■ Bayes’ Theorem:
medium income
Prediction Based on Bayes’ Theorem
■ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
Classification Is to Derive the Maximum Posteriori
■ Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
■ Suppose there are m classes C1, C2, …, Cm.
■ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
■ This can be derived from Bayes’ theorem
needs to be maximized
Naïve Bayes Classifier
■ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
and P(xk|Ci) is
Naïve Bayes Classifier: Training Dataset
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
Naïve Bayes Classifier: An Example
■ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
■ Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
■ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Avoiding the Zero-Probability Problem
■ Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
“uncorrected” counterparts
Naïve Bayes Classifier: Comments
■ Advantages
■ Easy to implement
■ Disadvantages
■ Assumption: class conditional independence, therefore loss of
■ Practically, dependencies exist among variables
Bayes Classifier
■ How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
Chapter 8. Classification: Basic Concepts
■ One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
■ Each attribute-value pair along a path forms a
excellent fair
conjunction: the leaf holds the class no yes
no yes
prediction no yes
■ Each time a rule is learned, the tuples covered by the rules are
■ Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
■ Comp. w. decision-tree induction: learning a set of rules
Sequential Covering Algorithm
Examples covered
covered by Rule 2
by Rule 1 covered
by Rule 3
Rule Generation
■ To generate a rule
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
Positive Negative
examples examples
How to Learn-One-Rule?
■ Start with the most general rule possible: condition = empty
■ Adding new attributes by adopting a greedy depth-first strategy
■ Picks the one that most improves the rule quality
■ favors rules that have high accuracy and cover many positive tuples
■ Rule pruning based on an independent set of test tuples
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
■ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
Classifier Evaluation Metrics: Example
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
■ Holdout method
■ Given data is randomly partitioned into two independent sets
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
■ Suppose we have 2 classifiers, M1 and M2, which one is better?
■ These mean error rates are just estimates of error on the true
population of future data cases
Estimating Confidence Intervals:
Null Hypothesis
■ Perform 10-fold cross-validation
■ Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
■ Use t-test (or Student’s t-test)
■ Null Hypothesis: M1 & M2 are the same
■ If we can reject null hypothesis, then
■ we conclude that the difference between M1 & M2 is
statistically significant
■ Chose model with lower error rate
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
Estimating Confidence Intervals:
Table for t-distribution
■ Symmetric
■ Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
■ Confidence limit, z
= sig/2
Estimating Confidence Intervals:
Statistical Significance
■ Are M1 & M2 significantly different?
■ Compute t. Select significance level (e.g. sig = 5%)
Model Selection: ROC Curves
■ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
■ Originated from signal detection theory
■ Shows the trade-off between the true
positive rate and the false positive rate
■ The area under the ROC curve is a ■ Vertical axis
measure of the accuracy of the model represents the true
positive rate
■ Rank the test tuples in decreasing ■ Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at ■ The plot also shows a
the top of the list diagonal line
■ The closer to the diagonal line (i.e., the ■ A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
Issues Affecting Model Selection
■ Accuracy
■ classifier accuracy: predicting class label
■ Speed
■ time to construct the model (training time)
■ time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability
■ understanding and insight provided by the model
■ Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
Chapter 8. Classification: Basic Concepts
■ Ensemble methods
■ Use a combination of models to increase accuracy
■ Boosting: weighted vote with a collection of classifiers
Bagging: Boostrap Aggregation
■ Analogy: Diagnosis based on multiple doctors’ majority vote
■ Training
■ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
■ A classifier model Mi is learned for each training set Di
■ Classification: classify an unknown sample X
■ Each classifier Mi returns its class prediction
■ The bagged classifier M* counts the votes and assigns the class with the
most votes to X
■ Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
■ Accuracy
■ Often significantly better than a single classifier derived from D
■ For noise data: not considerably worse, more robust
■ Proved improved accuracy in prediction
■ Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
■ How boosting works?
■ Weights are assigned to each training tuple
■ A series of k classifiers is iteratively learned
■ After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
■ The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
■ Boosting algorithm can be extended for numeric prediction
■ Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
Adaboost (Freund and Schapire, 1997)
■ Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
■ Initially, all the weights of tuples are set the same (1/d)
■ Generate k classifiers in k rounds. At round i,
■ Tuples from D are sampled (with replacement) to form a training set
Di of the same size
■ Each tuple’s chance of being selected is based on its weight
■ A classification model Mi is derived from Di
■ Its error rate is calculated using Di as a test set
■ If a tuple is misclassified, its weight is increased, o.w. it is decreased
■ Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
Random Forest (Breiman 2001)
■ Random Forest:
■ Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
■ During classification, each tree votes and the most popular class is
■ Two Methods to construct Random Forest:
■ Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
■ Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
■ Comparable in accuracy to Adaboost, but more robust to errors and outliers
■ Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
Classification of Class-Imbalanced Data Sets
Summary (II)
■ Significance tests and ROC curves are useful for model selection.
■ There have been numerous comparisons of the different
classification methods; the matter remains a research topic
■ No single method has been found to be superior over all others
for all data sets
■ Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve
trade-offs, further complicating the quest for an overall superior
CS412 Midterm Exam Statistics
■ Opinion Question Answering:
■ Like the style: 70.83%, dislike: 29.16%
■ >=90: 24 ■ <40: 2
■ 60-69: 37
■ 80-89: 54
■ 50-59: 15
■ 70-79: 46
■ 40-49: 2
■ Final grading are based on overall score accumulation
and relative class distributions
Issues: Evaluating Classification Methods
■ Accuracy
■ classifier accuracy: predicting class label
■ Speed
■ time to construct the model (training time)
Predictor Error Measures
■ Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
■ Loss function: measures the error betw. yi and the predicted value yi’
■ Absolute error: | yi – yi’|
■ Squared error: (yi – yi’)2
■ Test error (generalization error): the average loss over the test set
■ Mean absolute error: Mean squared error:
Scalable Decision Tree Induction Methods
tree earlier
■ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
■ Builds an AVC-list (attribute, value, class label)
Data Cube-Based Decision-Tree Induction
■ Integration of generalization with decision-tree induction
(Kamber et al.’97)
■ Classification at primitive concept levels
■ E.g., precise temperature, humidity, outlook, etc.
■ Low-level concepts, scattered classes, bushy
■ Semantic interpretation problems
■ Cube-based multi-level classification
■ Relevance analysis at multi-levels
■ Information-gain analysis with dimension + level