MISY 631 Final Review Calculators Will Be Provided For The Exam
MISY 631 Final Review Calculators Will Be Provided For The Exam
MISY 631 Final Review Calculators Will Be Provided For The Exam
Final Review
Calculators will be provided for the exam.
Topic 1: Naïve Bayes Classification (classification models)
1. Be able to understand and apply Bayes’ rule
Another way of classifying data,
𝑷(𝑩|𝑨)𝑷(𝑨)
Formula: 𝑷(𝑨|𝑩) = 𝑷(𝑩)
P(AB) = P(A)P(B|A) independent, result of A does not affect P(B)
P(BA) = P(B)P(A|B)
P(A)P(B|A) = P(B)P(A|B)
P(AB)=/P(A)*P(B) dependent
2. Be able to calculate the probability for an instance (e.g., the probability given in slide
10, lecture 8) using Naïve Bayes Classification
𝑷(𝐃𝐞𝐟𝐚𝐮𝐥𝐭 = 𝐲𝐞𝐬|𝐁𝐚𝐥𝐚𝐧𝐜𝐞 >= 𝟓𝟎𝐊, 𝐀𝐠𝐞 < 𝟒𝟓, 𝐄𝐦𝐩𝐥𝐨𝐲𝐞𝐝 = 𝐍𝐨)
=
𝑷(𝐁𝐚𝐥𝐚𝐧𝐜𝐞 >= 𝟓𝟎𝐊, 𝐀𝐠𝐞 < 𝟒𝟓, 𝐄𝐦𝐩𝐥𝐨𝐲𝐞𝐝 = 𝐍𝐨|𝐃𝐞𝐟𝐚𝐮𝐥𝐭 = 𝐲𝐞𝐬)𝑷(𝐃𝐞𝐟𝐚𝐮𝐥𝐭 = 𝐲𝐞𝐬)
𝑷(𝐁𝐚𝐥𝐚𝐧𝐜𝐞 >= 𝟓𝟎𝐊, 𝐀𝐠𝐞 < 𝟒𝟓, 𝐄𝐦𝐩𝐥𝐨𝐲𝐞𝐝 = 𝐍𝐨)
𝑷(𝐂 = 𝟏|𝐝𝟏 = 𝐯𝟏, 𝐝𝟐 = 𝐯𝟐, …, 𝐝𝐧 = 𝐯𝐧)=
𝑷(𝐝𝟏 = 𝐯𝟏, 𝐝𝟐 = 𝐯𝟐, … , 𝐝𝐧 = 𝐯𝒏|𝐂 = 𝟏)𝐏(𝐂 = 𝟏)
𝑷(𝐝𝟏 = 𝐯𝟏, 𝐝𝟐 = 𝐯𝟐, … , 𝐝𝐧 = 𝐯𝒏|𝑪 = 𝟏)𝑷(𝑪 = 𝟏) + 𝑷(𝐝𝟏 = 𝐯𝟏, 𝐝𝟐 = 𝐯𝟐, … , 𝐝𝐧 = 𝐯𝒏|𝑪 = 𝟎)𝑷(𝑪 = 𝟎)
C: class e.g., default (1), not default (0)
d1, d2, …, dn: descriptive attributes
Why it is difficult to calculate
Assuming every descriptive attribute is binary, the number of possible
combination of (d1 = v1, d2 = v2, … , dn = vn) is 2n
There are 2n+1 estimations
Some combinations may not appear in the training data and hence no way
to estimate their probabilities
Solution
Assuming descriptive attributes are independent of each other given C
𝑃(d1 = v1, d2 = v2, … , dn = vn|C = 1)
= 𝑃(d1 = v1|C = 1)P(d2 = v2|C = 1) … P(dn = vn|C = 1)
Might sacrifice prediction accuracy to some extent but significantly
reduce the level of difficulty
1
P(C) = (# of training instances belonging to class C +1) / (# of training instances +K)
K: number of predefined classes
4. Be able to understand the concept of AUC (i.e., what is AUC?) and the meaning of
special points on ROC and special ROC curve (see slide 28, lecture 8)
AUC: Area under the ROC Curve
Different cut off gives you different result even though you have the same
probability. So you need to have a more robust and more comprehensive measures for
your classification method. And this measure is called ROC (receiver operating
characteristic). So we cannot just use the 0.5 and we want to evaluate a classifier
using a group of cut-off probability. That’s why we need ROC. If you change cat off
point you will have another set of TPR and FPT. Plot a curve (red).
A graphical approach for displaying trade-off between detection rate (True Positive Rate)
and false alarm rate (False Positive Rate).
ROC curve plots TPR against FPR
Performance of a model represented as a point in an ROC curve
Changing the threshold (i.e., cutoff) parameter of classifier changes the location
of the point
TPR (True Positive Rate) or Recall: the fraction of actual positive instances that are
predicted as positive. TPR = TP/(TP+FN)
2
Calculate area below M1 and M2, M1 is better for small FPR because larger area is better
If area is close to 0.5 its useless. If it’s less than 0.5 you have to toss it because that’s
worse than guessing
1. What is a linear model? The (graphical) difference between linear model and decision
tree.
For linear model you have one decision boundary but for decision tree you have
multiple.
Use decision boundary (line, slope) to classify data (default or no default, +es or dots)
A linear model employs a linear combination of descriptive attributes, namely decision
boundary or f(x), to classify instances. ( the objective of linear model is to learn all these w's
from training data)
f(x) = w0 +w1 x1 +w2 x2 +…+ wn xn
Values of w0 , w1 , w2 , … wn are learned from training data.
A new instance (not in the training data) is classified as one class if f(x) > 0 or the
other class if f(x) ≤ 0
Absolute values of w1 , w2 , … wn generally indicate the importance of their
respectively associated descriptive attribute in classifying an instance.
Generally, the larger the absolute value of wi, the more important its associated
attribute in classification.
2. What is logistic regression? Formulas for logistic regression (slide 15, lecture 9)?
Why it is a linear model?
Logistic regression learns from training data to predict the probability P(Y=C|x).
3
C: class e.g., default or no default
x= (x1,x2,…xn) is a set of descriptive attributes, e.g., age, balance, employed
Example: P (Y=Default|Age<45, Balance>= 50K, Employed=No)
Logistic regression assumes.
1
𝑃(𝑌 = 1|𝑥) = 1+exp(w0 +w1 x1 +w2 x2 +⋯+wnxn)
exp() is a exponential function.
𝑃(𝑌 = 0|𝑥) = 1 − 𝑃(𝑌 = 1|𝑥)
exp(w0 + w1 x1 + w2 x2 + ⋯ + wn xn)
=
1 + exp(w0 + w1 x1 + w2 x2 + ⋯ + wn xn)
As X is large it’s closer to 1. X is negative, it’s closer to 0.
Logistic regression:
Learns values of w0 , w1 , w2 , … wn from training data.
Then, predicts 𝑃(𝑌 = 1|𝒙), 𝑃(𝑌 = 0|𝒙) for a new instance using the formulas in
the previous slide;
Classifies a new instance as class 1 if 𝑃(𝑌 = 1|𝒙) > 𝑃(𝑌 = 0|𝒙) or class 0
otherwise.
WHY LINEAR?
If P(Y=0|X)/P(Y=1|X)>1 then classify x as 0.
If exp(…)>1 classify x as 0
Ln(exp(…))>Ln1 classify instanse as 0
w0+w1x1+..+wnxn>1 classify x as 0 definition of linear model
3. Comparison among the three classification methods
Functionality: All three methods (decision tree, Naïve Bayes, logistic regression) can
classify a new instance and predict the probability that a new instance belonging to a
class.
Performance: We do not know which one is better in every situation. Use cross-validation
to evaluate the performance of each method. Compare avg AUC. But depending on the
real world case and you apply 10-fold cross validation to choose the best one.
Methodology:
Decision tree is a piecewise classifier (i.e., pick one attribute at a time) and create
many decision boundaries to partition a instance space.
Logistic regression considers all attributes at the same time (i.e., a linear
combination of attributes) and create one decision boundary to partition a instance
space
Naïve Bayes is not a partition approach but employs Bayes rule and conditional
independence assumption to estimate the probability a new instance belonging to
a class.
Comprehensibility:
Decision tree visualize the produced model and it is easy to be understood by
managers without strong background in statistics and data mining. However, tree
may grow too large in size.
Logistic regression is easy to understand for managers with background in
statistics.
For managers, Naïve Bayes only produces a probability estimation and functions
like a black box .
Explanatory power:
Decision tree has good explanatory power. Generally, attributes appearing in
higher levels of a tree are more important for classification than those in lower
levels.
4
Logistic regression has strong explanatory power. Absolute values of w1 , w2 ,
… wn generally indicate the importance of their respectively associated
descriptive attribute in classifying an instance.
For managers, the explanatory power of Naïve Bayes is low.
4. Cost-sensitive learning
Making classification decisions based on probabilities ONLY could be problematic:
Probability (only)-based classification: classify an instance as 1 if
P(Y=1|X)>P(Y=0|X) or P(Y=1|X)>0.5
Cost-sensitive learning:
We need to consider both probabilities and cost/utility of a decision to
minimize cost or maximize utility
Learn probabilities from training data
Construct cost/utility matrix from a business context
3. How does SVM handle non-separable training data and non-linearly separable
training data?
5
Real world data can be complicated: non-separable case
define a penalty for each misclassified training example (e.g., red dot)
Maximizing margin – (sum of penalties)
Subject to (constraints)
correctly classified training examples above H1 belong to the “+” class
correctly classified training examples below H2 belong to the dot class
define a constraint for each misclassified example
3. How can Neural Network learns from a data set with non-linear decision boundaries?
Slide 14 of lecture 12
Perceptron is a linear model and it is ineffective for classifying data sets with non-
linear decision boundaries.
Solution
Change the activate function to a non-linear function
multilayer network