0% found this document useful (0 votes)

28 views

Bi Intro

This document provides an introduction and overview of key concepts in business intelligence and data analytics. It discusses data sources, data warehouses, data marts, online analytical processing (OLAP), online transaction processing (OLTP), and univariate and multivariate data analysis. Specific analytical techniques covered include histograms, measures of central tendency and dispersion, heterogeneity measures, hypothesis testing, and confidence intervals. Worked examples are provided to illustrate how to calculate and interpret confidence intervals for proportions and model accuracy. The document serves as a high-level introduction to foundational topics in business intelligence and data mining.

Uploaded by

Danial Mansoor

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Bi Intro

Uploaded by

Danial Mansoor

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Business Intelligence and

Data Analytics Intro

Qiang Yang
Based on Textbook: Business
Intelligence by Carlos Vercellis

1
Also adapted from sources
 Tan, Steinbach, Kumar (TSK) Book:
 Introduction to Data Mining
 Weka Book: Witten and Frank (WF):
 Data Mining
 Han and Kamber (HK Book):
 Data Mining
 BI Book is denoted as “BI Chapter #...”

2
BI1.4 Business Intelligence
Architectures
• Data Sources • An example
– Gather and integrate data – Building a telecom
– Challenges customer retention model
• Given a customer’s
• Data Warehouses and telecom behavior, predict
Data Marts if the customer will stay or
– Extract, transform and load leave
data – KDDCUP 2010 Data
– Multidimensional
Exploratory Analysis
• Data Mining and Data
Analytics
– Extraction of Information
and Knowledge from Data
– Build Models of Prediction

3
BI3: Data Warehousing
• Data warehouse:
– Repository for the data available for BI and Decision Support Systems
– Internal Data, external Data and Personal Data
– Internal data:
• Back office: transactional records, orders, invoices, etc.
• Front office: call center, sales office, marketing campaigns,
• Web-based: sales transactions on e-commerce websites
– External:
• Market surveys, GIS systems
– Personal: data about individuals
– Meta: data about a whole data set, systems, etc. E.g., what structure is
used in the data warehouse? The number of records in a data table, etc.
• Data marts: subset of data warehouse for one function (e.g.,
marketing).
• OLAP: set of tools that perform BI analysis and decision making.
• OLTP: transactional related online tools, focusing on dynamic data.

4
Working with Data: BI Chap 7
• Let’s first consider an
Independent Variables Dependent
example dataset Variable

Outlook Temp Humidity Windy Play

• Univariate Analysis (7.1) sunny 85 85 FALSE no
sunny 80 90 TRUE no
• Histograms overcast 83 86 FALSE yes

– Empirical density=e_h/m, rainy 70 96 FALSE yes

e_h=values that belong to rainy 68 80 FALSE yes

rainy 65 70 TRUE no
class h. overcast 64 65 TRUE yes

– X-axis=value range sunny 72 95 FALSE no

sunny 69 70 FALSE yes

– Y-axis=empirical density
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes

rainy 71 91 TRUE no

5
Measures of Dispersion
• Variance 1 m
2  
m  1 i 1
( xi   ) 2

1/ 2
 1 2
m
• Standard deviation    
 m  1 i 1
( xi   ) 

• Normal Distribution: interval   r *
– r=1 contains approximately 68% of the observed Thm 7.1Chebyshev’s Theorem
values;
r>=1, and (x1, x2, …xm)
– r=2: 95% of the observed values be a group of m values.
– r=3: 100% of values
– Thus, if a sample outside (   3 ), it may be an (1-1/r2) of the values will fall
outlier within interval   r *

6
Heterogeneity Measures
• The Gini index (Wiki: The Gini
coefficient (also known as the Gini index or H
G  1  fh
Gini ratio) is a 2
measure of statistical dispersion developed
by the Italian statistician and sociologist
Corrado Gini and published in his 1912
paper "Variability and Mutability" (Italian: i 1
Variabilità e mutabilità) )

• Let fh be the frequency of class

h; then G is Gini index
• Entropy E: 0 means lowest H
E   f h log 2 f h
heterogeneity, and 1 highest.

7
Test of Significance
• Given two models:
– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances

• Can we say M1 is better than M2?

– How much confidence can we place on accuracy of M1
and M2?
– Can the difference in performance measure be
explained as a result of random fluctuations in the test
set?

8
Confidence Intervals
• Given a frequency of (f) is 25%. How close is
this to the true probability p?
• Prediction is just like tossing a biased coin
– “Head” is a “success”, “tail” is an “error”
• In statistics, a succession of independent events
like this is called a Bernoulli process
– Statistical theory provides us with confidence intervals
for the true underlying proportion!
– Mean and variance for a Bernoulli trial with success
probability p: p, p(1-p)

9
Confidence intervals
• We can say: p lies within a certain specified
interval with a certain specified confidence
• Example: S=750 successes in N=1000 trials
– Estimated success rate: f=75%
– How close is this to true success rate p?
• Answer: with 80% confidence p[73.2,76.7]

• Another example: S=75 and N=100

– Estimated success rate: 75%
– With 80% confidence p[69.1,80.1]

10
Confidence Interval for Normal
Distribution
• For large enough N, p follows a normal distribution
• p can be modeled with a random variable X:
• c% confidence interval [-z  X  z] for random
variable X with 0 mean is given by: c=Area = 1 - 

Pr[ z  X  z ]  c

Pr[  z  X  z ]  1  (2 * Pr[ X  z ])
11
-Z/2 Z1-  /2
Transforming f
f p
• Transformed value for f:
p (1  p ) / N
(i.e. subtract the mean and divide by the
standard deviation)
• Resulting equation:
 f p 
Pr  z   z  c
 p(1  p ) / N 

• Solving for p:  2 2 2   z2 
z f f z
p f  z    1  
 2 N N N 4 N 2   N
 
12
Confidence Interval for
Accuracy
• Consider a model that produces an accuracy of 80%
when evaluated on 100 test instances:
– N=100, acc = 0.8
– Let 1- = 0.95 (95% confidence)
– From probability table, Z/2=1.96
1- Z
0.99 2.58
N 50 100 500 1000 5000
0.98 2.33
0.95 1.96
p(lower) 0.670 0.711 0.763 0.774 0.789
0.90 1.65
p(upper) 0.888 0.866 0.833 0.824 0.811
13
Confidence limits
• Confidence limits for the normal distribution with 0 mean
and a variance of 1:
Pr[Xz] z
0.1% 3.09

• Thus: Pr[ 1.65  X  1.65]  90% 0.5% 2.58

1% 2.33
5% 1.65
10% 1.28
20% 0.84
40% 0.25
• To use this we have to reduce our random variable p to
have 0 mean and unit variance

14
Examples
• f=75%, N=1000, c=80% (so that z=1.28):
p  [0.732,0.767]

• f=75%, N=100, c=80% (so that z=1.28):

p  [0.691,0.801]

• Note that normal distribution assumption is

only valid for large N (i.e. N > 100)
• f=75%, N=10, c=80% (so that z=1.28):
p  [0.549,0.881]

15
Implications
• First, the more test data the better
– N is large, thus confidence level is large
• Second, when having limited training data, how do we
ensure a large number of test data?
– Thus, cross validation, since we can then make all training data to
participate in the test.
• Third, which model are testing?
– Each fold in an N-fold cross validation is testing a different model!
– We wish this model to be close to the one trained with the whole
data set
• Thus, it is a balancing act: # folds in a CV cannot be too
large, or too small.

16
Cross Validation: Holdout Method
— Break up data into groups of the same size
—

— Hold aside one group for testing and use the rest to build model

— Repeat

iteration
Test

17
Cross Validation (CV)
• Natural performance • Confidence
measure for classification – 2% error in 100 tests
problems: error rate – 2% error in 10000 tests
– #Success: instance’s class • Which one do you trust more?
is predicted correctly – Apply the confidence interval
idea…
– #Error: instance’s class is
predicted incorrectly • Tradeoff:
– # of Folds = # of Data N
– Error rate: proportion of
• Leave One Out CV
errors made over the whole
• Trained model very close to
set of instances final model, but test data =
• Training Error vs. Test very biased
– # of Folds = 2
Error • Trained Model very unlike
• Confusion Matrix final model, but test data =
close to training distribution

18
ROC (Receiver Operating Characteristic)
• Page 298 of TSK book.
• Many applications care about ranking (give a queue from the
most likely to the least likely)
• Examples…
• Which ranking order is better?
• ROC: Developed in 1950s for signal detection theory to
analyze noisy signals
– Characterize the trade-off between positive hits and false alarms
• ROC curve plots TP (on the y-axis) against FP (on the x-axis)
• Performance of each classifier represented as a point on the
ROC curve
– changing the threshold of algorithm, sample distribution or cost matrix
changes the location of the point

19
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)

• Widely-used metric:
ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
20
How to Construct an ROC curve
• Use classifier that produces
Instance P(+|A) True posterior probability for each
Class
test instance P(+|A) for
1 0.95 +
instance A
2 0.93 +
3 0.87 - • Sort the instances according
4 0.85 -
to P(+|A) in decreasing order
5 0.85 - • Apply threshold at each
6 0.85 + unique value of P(+|A)
7 0.76 -
• Count the number of TP, FP,
8 0.53 +
9 0.43 - TN, FN at each threshold
Predicted10
by classifier 0.25 +
This is the ground truth
• TP rate, TPR = TP/(TP+FN)
21
• FP rate, FPR = FP/(FP + TN)
How to construct an ROC curve
Class + - + - - - + - + +
P
Threshold 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
>= TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve:

22
Using ROC for Model Comparison
 No model consistently
outperform the other
 M is better for
1
small FPR
 M is better for
2
large FPR
 Area Under the ROC
curve: AUC
 Ideal:
 Area = 1
 Random guess:
 Area = 0.5
23
Area Under the ROC Curve (AUC)
(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal

• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of the
true class

Rd020 Process Questionnaire
100% (2)
Rd020 Process Questionnaire
93 pages
MGH Data Analysis With Microsoft Power BI 126045861X
92% (12)
MGH Data Analysis With Microsoft Power BI 126045861X
808 pages
Temenos Ibs
50% (2)
Temenos Ibs
57 pages
Chapter 8
No ratings yet
Chapter 8
42 pages
Chapter 5 Variable Selection
No ratings yet
Chapter 5 Variable Selection
56 pages
Statistics
No ratings yet
Statistics
37 pages
06 - Testing of Hypothesis
No ratings yet
06 - Testing of Hypothesis
9 pages
Variance and Standard Deviation
No ratings yet
Variance and Standard Deviation
75 pages
Research Methodology and Biostatistics Part II 2
No ratings yet
Research Methodology and Biostatistics Part II 2
45 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
00 - Inrroduction To Statistics
No ratings yet
00 - Inrroduction To Statistics
30 pages
Chapter 5 - Basic Forecasting Method - 2024
No ratings yet
Chapter 5 - Basic Forecasting Method - 2024
15 pages
Time Series Analysis and Forecasting
No ratings yet
Time Series Analysis and Forecasting
21 pages
Chapter 5 Measurement Analysis2
No ratings yet
Chapter 5 Measurement Analysis2
41 pages
Fem & Rem
No ratings yet
Fem & Rem
20 pages
7confidence+interval
No ratings yet
7confidence+interval
18 pages
EMTSession4Forecasting 31032020 115641am
No ratings yet
EMTSession4Forecasting 31032020 115641am
39 pages
Statistical Inference 417
No ratings yet
Statistical Inference 417
90 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
3.analysis & Intrepretasi - PDF
No ratings yet
3.analysis & Intrepretasi - PDF
43 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
KKA CETR RAS 601 UNIT4 1st Lec
No ratings yet
KKA CETR RAS 601 UNIT4 1st Lec
22 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
Foi
No ratings yet
Foi
33 pages
Prediction---accuracy
No ratings yet
Prediction---accuracy
33 pages
Errors in Measurement
No ratings yet
Errors in Measurement
35 pages
Quimica Analítica - Repaso Examen Final
No ratings yet
Quimica Analítica - Repaso Examen Final
13 pages
Module 06 - One Population Parameter Estimation - Topic 4A
No ratings yet
Module 06 - One Population Parameter Estimation - Topic 4A
59 pages
Introduction To Uncertainty: Asma Khalid and Muhammad Sabieh Anwar
No ratings yet
Introduction To Uncertainty: Asma Khalid and Muhammad Sabieh Anwar
36 pages
Estimating Demand: Regression Analysis
No ratings yet
Estimating Demand: Regression Analysis
29 pages
BA20 Session2 M
No ratings yet
BA20 Session2 M
40 pages
BUS_7
No ratings yet
BUS_7
48 pages
Lecture 3&4
No ratings yet
Lecture 3&4
63 pages
Chapter 3 Econometrics
No ratings yet
Chapter 3 Econometrics
34 pages
STATISTICS-lecture-2_studentcopysd
No ratings yet
STATISTICS-lecture-2_studentcopysd
3 pages
Chapter Topics
No ratings yet
Chapter Topics
18 pages
Midterm Psych
No ratings yet
Midterm Psych
84 pages
Lecture 1
No ratings yet
Lecture 1
94 pages
Business Research Methods William G. Zikmund CH 17
No ratings yet
Business Research Methods William G. Zikmund CH 17
71 pages
DM14 Visualisation
100% (1)
DM14 Visualisation
67 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
ECON 332 Business Forecasting Methods Prof. Kirti K. Katkar
No ratings yet
ECON 332 Business Forecasting Methods Prof. Kirti K. Katkar
38 pages
Lecture 05 - Measures of Dispersion
No ratings yet
Lecture 05 - Measures of Dispersion
17 pages
Lesson 1
No ratings yet
Lesson 1
37 pages
Desc. Stat
No ratings yet
Desc. Stat
41 pages
VaR Historical Simulations and EVT
No ratings yet
VaR Historical Simulations and EVT
26 pages
Math 1060 - Lecture 7
No ratings yet
Math 1060 - Lecture 7
26 pages
Lecture 4-Statistical Inferences
No ratings yet
Lecture 4-Statistical Inferences
118 pages
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
No ratings yet
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
31 pages
Week2-1
No ratings yet
Week2-1
24 pages
Last Lecture - Visual Presentation of Data (Graphs) - Shapes of Distributions - Examples and Review Today - Measures of Central Tendency
No ratings yet
Last Lecture - Visual Presentation of Data (Graphs) - Shapes of Distributions - Examples and Review Today - Measures of Central Tendency
27 pages
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
No ratings yet
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
33 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Regression and Life Cycle Costing
No ratings yet
Regression and Life Cycle Costing
28 pages
PETE110 - Spring2019 - 2020 - Data Interpretation
No ratings yet
PETE110 - Spring2019 - 2020 - Data Interpretation
32 pages
Physics - 01
No ratings yet
Physics - 01
25 pages
53 BD 1489 DC 922
No ratings yet
53 BD 1489 DC 922
35 pages
Objectives: - by The End of This Lecture Students Will
No ratings yet
Objectives: - by The End of This Lecture Students Will
18 pages
Operations Management, Forecasting, MBA Lecture Notes
98% (64)
Operations Management, Forecasting, MBA Lecture Notes
8 pages
ZG536 - L4 - Descriptive Analytics - 030224
No ratings yet
ZG536 - L4 - Descriptive Analytics - 030224
28 pages
11paired T
No ratings yet
11paired T
49 pages
A Probability and Statistics Companion
From Everand
A Probability and Statistics Companion
John J. Kinney
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Marketing Strategies
No ratings yet
Marketing Strategies
4 pages
HR DATA - Excel
No ratings yet
HR DATA - Excel
684 pages
Babar Ali #Cv
No ratings yet
Babar Ali #Cv
2 pages
Resume Ahmad Syam
No ratings yet
Resume Ahmad Syam
2 pages
Business Intelligence vs. Business Analytics: Stitchdata Blog Explains
No ratings yet
Business Intelligence vs. Business Analytics: Stitchdata Blog Explains
2 pages
OSI PI Uses Cases
No ratings yet
OSI PI Uses Cases
12 pages
Instant Download Principles of Information Systems Ralph Stair PDF All Chapters
100% (4)
Instant Download Principles of Information Systems Ralph Stair PDF All Chapters
62 pages
Junior BI Analyst
No ratings yet
Junior BI Analyst
3 pages
SAP Hana For Business Intelligence
No ratings yet
SAP Hana For Business Intelligence
6 pages
Agoda
No ratings yet
Agoda
9 pages
Business Intelligence & Data Warehousing: Course Schedule
No ratings yet
Business Intelligence & Data Warehousing: Course Schedule
3 pages
Mark Xu Managing Strategic Intelligence Techniques and Technologies Premier Reference 2007
No ratings yet
Mark Xu Managing Strategic Intelligence Techniques and Technologies Premier Reference 2007
324 pages
DS WhitePapers Concurrent Engineering With 3DEXPERIENCE
No ratings yet
DS WhitePapers Concurrent Engineering With 3DEXPERIENCE
34 pages
Pizza Hut N Dominos Case Study
No ratings yet
Pizza Hut N Dominos Case Study
25 pages
Genesis Summary Corp
No ratings yet
Genesis Summary Corp
23 pages
MI0036 - Business Intelligence Tools: Project Planning
No ratings yet
MI0036 - Business Intelligence Tools: Project Planning
4 pages
Antsomi Company Profile Deck 2023 Jun2023
No ratings yet
Antsomi Company Profile Deck 2023 Jun2023
34 pages
Strategic Plan 2018-2020: FIT's Division of Information Technology
No ratings yet
Strategic Plan 2018-2020: FIT's Division of Information Technology
34 pages
New England Seafood PDF
No ratings yet
New England Seafood PDF
2 pages
Situational Analysis Template
No ratings yet
Situational Analysis Template
3 pages
019 - Study On Omnichannel Marketing
No ratings yet
019 - Study On Omnichannel Marketing
13 pages
CRM
No ratings yet
CRM
13 pages
The Comprehensive Guide To Resource Capacity Planning
No ratings yet
The Comprehensive Guide To Resource Capacity Planning
17 pages
Assignment/ Tugasan
No ratings yet
Assignment/ Tugasan
10 pages
Bi Notes
No ratings yet
Bi Notes
48 pages
Swt549 Data Mining and Business Intelligence TH 1.10 Ac26
No ratings yet
Swt549 Data Mining and Business Intelligence TH 1.10 Ac26
2 pages
Sap Hana Sample Resume2
No ratings yet
Sap Hana Sample Resume2
5 pages