Bi Intro
Bi Intro
1
Also adapted from sources
Tan, Steinbach, Kumar (TSK) Book:
Introduction to Data Mining
Weka Book: Witten and Frank (WF):
Data Mining
Han and Kamber (HK Book):
Data Mining
BI Book is denoted as “BI Chapter #...”
2
BI1.4 Business Intelligence
Architectures
• Data Sources • An example
– Gather and integrate data – Building a telecom
– Challenges customer retention model
• Given a customer’s
• Data Warehouses and telecom behavior, predict
Data Marts if the customer will stay or
– Extract, transform and load leave
data – KDDCUP 2010 Data
– Multidimensional
Exploratory Analysis
• Data Mining and Data
Analytics
– Extraction of Information
and Knowledge from Data
– Build Models of Prediction
3
BI3: Data Warehousing
• Data warehouse:
– Repository for the data available for BI and Decision Support Systems
– Internal Data, external Data and Personal Data
– Internal data:
• Back office: transactional records, orders, invoices, etc.
• Front office: call center, sales office, marketing campaigns,
• Web-based: sales transactions on e-commerce websites
– External:
• Market surveys, GIS systems
– Personal: data about individuals
– Meta: data about a whole data set, systems, etc. E.g., what structure is
used in the data warehouse? The number of records in a data table, etc.
• Data marts: subset of data warehouse for one function (e.g.,
marketing).
• OLAP: set of tools that perform BI analysis and decision making.
• OLTP: transactional related online tools, focusing on dynamic data.
4
Working with Data: BI Chap 7
• Let’s first consider an
Independent Variables Dependent
example dataset Variable
rainy 71 91 TRUE no
5
Measures of Dispersion
• Variance 1 m
2
m 1 i 1
( xi ) 2
1/ 2
1 2
m
• Standard deviation
m 1 i 1
( xi )
• Normal Distribution: interval r *
– r=1 contains approximately 68% of the observed Thm 7.1Chebyshev’s Theorem
values;
r>=1, and (x1, x2, …xm)
– r=2: 95% of the observed values be a group of m values.
– r=3: 100% of values
– Thus, if a sample outside ( 3 ), it may be an (1-1/r2) of the values will fall
outlier within interval r *
6
Heterogeneity Measures
• The Gini index (Wiki: The Gini
coefficient (also known as the Gini index or H
G 1 fh
Gini ratio) is a 2
measure of statistical dispersion developed
by the Italian statistician and sociologist
Corrado Gini and published in his 1912
paper "Variability and Mutability" (Italian: i 1
Variabilità e mutabilità) )
i1
7
Test of Significance
• Given two models:
– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances
8
Confidence Intervals
• Given a frequency of (f) is 25%. How close is
this to the true probability p?
• Prediction is just like tossing a biased coin
– “Head” is a “success”, “tail” is an “error”
• In statistics, a succession of independent events
like this is called a Bernoulli process
– Statistical theory provides us with confidence intervals
for the true underlying proportion!
– Mean and variance for a Bernoulli trial with success
probability p: p, p(1-p)
9
Confidence intervals
• We can say: p lies within a certain specified
interval with a certain specified confidence
• Example: S=750 successes in N=1000 trials
– Estimated success rate: f=75%
– How close is this to true success rate p?
• Answer: with 80% confidence p[73.2,76.7]
10
Confidence Interval for Normal
Distribution
• For large enough N, p follows a normal distribution
• p can be modeled with a random variable X:
• c% confidence interval [-z X z] for random
variable X with 0 mean is given by: c=Area = 1 -
Pr[ z X z ] c
Pr[ z X z ] 1 (2 * Pr[ X z ])
11
-Z/2 Z1- /2
Transforming f
f p
• Transformed value for f:
p (1 p ) / N
(i.e. subtract the mean and divide by the
standard deviation)
• Resulting equation:
f p
Pr z z c
p(1 p ) / N
• Solving for p: 2 2 2 z2
z f f z
p f z 1
2 N N N 4 N 2 N
12
Confidence Interval for
Accuracy
• Consider a model that produces an accuracy of 80%
when evaluated on 100 test instances:
– N=100, acc = 0.8
– Let 1- = 0.95 (95% confidence)
– From probability table, Z/2=1.96
1- Z
0.99 2.58
N 50 100 500 1000 5000
0.98 2.33
0.95 1.96
p(lower) 0.670 0.711 0.763 0.774 0.789
0.90 1.65
p(upper) 0.888 0.866 0.833 0.824 0.811
13
Confidence limits
• Confidence limits for the normal distribution with 0 mean
and a variance of 1:
Pr[Xz] z
0.1% 3.09
14
Examples
• f=75%, N=1000, c=80% (so that z=1.28):
p [0.732,0.767]
15
Implications
• First, the more test data the better
– N is large, thus confidence level is large
• Second, when having limited training data, how do we
ensure a large number of test data?
– Thus, cross validation, since we can then make all training data to
participate in the test.
• Third, which model are testing?
– Each fold in an N-fold cross validation is testing a different model!
– We wish this model to be close to the one trained with the whole
data set
• Thus, it is a balancing act: # folds in a CV cannot be too
large, or too small.
16
Cross Validation: Holdout Method
— Break up data into groups of the same size
—
— Hold aside one group for testing and use the rest to build model
— Repeat
iteration
Test
17
Cross Validation (CV)
• Natural performance • Confidence
measure for classification – 2% error in 100 tests
problems: error rate – 2% error in 10000 tests
– #Success: instance’s class • Which one do you trust more?
is predicted correctly – Apply the confidence interval
idea…
– #Error: instance’s class is
predicted incorrectly • Tradeoff:
– # of Folds = # of Data N
– Error rate: proportion of
• Leave One Out CV
errors made over the whole
• Trained model very close to
set of instances final model, but test data =
• Training Error vs. Test very biased
– # of Folds = 2
Error • Trained Model very unlike
• Confusion Matrix final model, but test data =
close to training distribution
18
ROC (Receiver Operating Characteristic)
• Page 298 of TSK book.
• Many applications care about ranking (give a queue from the
most likely to the least likely)
• Examples…
• Which ranking order is better?
• ROC: Developed in 1950s for signal detection theory to
analyze noisy signals
– Characterize the trade-off between positive hits and false alarms
• ROC curve plots TP (on the y-axis) against FP (on the x-axis)
• Performance of each classifier represented as a point on the
ROC curve
– changing the threshold of algorithm, sample distribution or cost matrix
changes the location of the point
19
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
• Widely-used metric:
ad TP TN
Accuracy
a b c d TP TN FP FN
20
How to Construct an ROC curve
• Use classifier that produces
Instance P(+|A) True posterior probability for each
Class
test instance P(+|A) for
1 0.95 +
instance A
2 0.93 +
3 0.87 - • Sort the instances according
4 0.85 -
to P(+|A) in decreasing order
5 0.85 - • Apply threshold at each
6 0.85 + unique value of P(+|A)
7 0.76 -
• Count the number of TP, FP,
8 0.53 +
9 0.43 - TN, FN at each threshold
Predicted10
by classifier 0.25 +
This is the ground truth
• TP rate, TPR = TP/(TP+FN)
21
• FP rate, FPR = FP/(FP + TN)
How to construct an ROC curve
Class + - + - - - + - + +
P
Threshold 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
>= TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
22
Using ROC for Model Comparison
No model consistently
outperform the other
M is better for
1
small FPR
M is better for
2
large FPR
Area Under the ROC
curve: AUC
Ideal:
Area = 1
Random guess:
Area = 0.5
23
Area Under the ROC Curve (AUC)
(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal
• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of the
true class
24