Data Mining: Business Intelligence
Data Mining: Business Intelligence
Data Mining: Business Intelligence
Business Intelligence
outline
Data Mining and KDD
Why Data Mining
Applications of Data Mining
Data Preprocessing
Data Mining techniques
Visualization of the results
Summary
2
Data Mining and KDD
3
Looking for knowledge
The Explosive Growth of Data
The World Wide Web
Business: e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation
Society and everyone: news, digital cameras, YouTube, forums,
blogs, Google & Co
We are drowning in data, but starving for knowledge!
Avoid data tombs
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets.
4
What is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
5
Knowledge Discovery (KDD)
Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Data sources
6
Data Mining and Business
Intelligence
Increasing potential
to support End User
business decisions Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithms Disciplines
8
Why Data Mining?
9
Why is Data Mining so complex? A
matter of data dimensions
Tremendous amount of data
Walmart – Customer buying patterns – a data warehouse 7.5 Terabytes
large in 1995
VISA – Detecting credit card interoperability issues – 6800 payment
transactions per second
High-dimensionality of data
Many dimensions to be combined together
Data cube example: time, location, product sales
High complexity of data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Spatial, spatiotemporal, multimedia, text and Web data
10
What does Data Mining provide me
with? (1)
Multidimensional concept description: Characterization and
discrimination
Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
Characterization describes things in the same class,
discrimination describes how to separate different classes
Frequent patterns, association, correlation vs. causality
Wine Spaghetti [0.3% of all basket cases, 75% of cases when
tomato sauce is bought]
Is this correlation or not?
11
What does Data Mining provide me
with? (2)
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based
on gas mileage
Predict some unknown or missing numerical values
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass
similarity
12
What does Data Mining provide me
with? (3)
Outlier analysis
Outlier: Data object that does not comply with the general
behavior of the data
Fraud detection is the main application area
Noise or exception?
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD
memory
Periodicity analysis
Similarity-based analysis
13
Applications of Data Mining
Market Analysis and Management
Data sources:
credit card transactions, loyalty cards, smart cards, discount
coupons, ...
Target marketing
Find clusters of “model” customers who share the same
characteristics:
• Geographics (lives in Rome, lives in Trentino)
• Demographics (married, between 21-35, at least one child, family income
more than 40.000€/year)
• Psychographics (likes new products, consistently uses the Web)
• Behaviors (searches info in Internet, always defends her decisions)
Determine customer purchasing patterns over time
14
Applications of Data Mining
Market Analysis and Management
Cross-market analysis
Find associations between product sales, and predict based on
such association
Compare the sales in the US and in Italy, find associations in
old products and predict if new ones will have success
Customer profiling
What types of customers buy what products
Customers with age between 20-30 and income > 20K€ will buy
product A
Customer requirement analysis
Identify the best products for different groups of customers
Predict what factors will attract new customers
15
Applications of Data Mining
Corporate Analysis
Finance Planning and Asset Evaluation
Cash flow prediction and analysis
Cross-sectional and time-series analysis (financial ratio, trend
analysis)
Resource Planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based pricing procedure
set pricing strategy in a highly competitive market
Other examples?
16
What’s next?
Data Preprocessing
Why is it needed?
Data cleaning
Data integration and transformation,
Data reduction
Discretization and Concept hiererchy
Data Mining techniques
Frequent patterns, association rules
Classification and prediction
Cluster Analysis
Visualization of the results
Are you sleeping?
Summary
17
Data Preprocessing
18
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., occupation=“ ”, birthdate=“31/12/2099”
noisy: containing errors or outliers
• e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2007!!)
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records. In one copy of the data
customer A has to pay 200.000€, in the second copy of the data A does not
have to pay anything.
19
Why is data dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
20
Why Is Data Preprocessing
Important?
21
Data Preprocessing
1. Data cleaning – missing values
“Data cleaning is one of the three biggest problems in data
warehousing”— Ralph Kimball
22
Data Preprocessing
1. Data cleaning – binning
Handle noisy data
Binning, clustering, regression (not details)
Binning
1. Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26
2. Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8, 9
Bin 2: 15, 21, 21
Bin 3: 24, 25, 26
3. Smoothing by bin means:
Bin 1: 7, 7, 7
Bin 2: 19, 19, 19
Bin 3: 25, 25, 25
23
Data Preprocessing
1. Data cleaning – clustering
noise
24
Data Preprocessing
2. Integration and transformation
Data Integration combines data from multiple sources
into a coherent store D1 D2 D3
Schema integration
Integrate metadata from different sources
A.cust-id B.cust-number D1,2,3
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different (e.g., cm vs. inch)
25
Data Preprocessing
2. Integration and transformation
Data integration can lead to redundant attributes
Same object (A.house = B.residence)
Derivates (A.annualIncome = B.salary+C.rentalIncome)
Redundant attributes can be discoverd via correlation
analysis
A mathematical method detecting the correletion between two
attributes
Correlation coefficient (Pearson’s product moment coefficient):
the higher it is, the stronger the correlation between attributes
Χ2 (chi-square) test
No details on these methods here
26
Data Preprocessing
2. Integration and transformation
Aggregation:
Sum the sales of different branches (in different data sources) to
compute the company sales
Generalization:
concept hierarchy climbing
From integer attribute age to classes of age (children, adult, old)
Normalization: scaled to fall within a small, specified range
Change the range from [-∞,+ ∞] to [-1,+1]
{-13, -6, -3, 10, 100} {-0.13, -0.06, -0.03, 0.1, 1}
27
Data Preprocessing
3. Data reduction
Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Different reduction types (dimensions, numerosity,
discretization)
Dimensionality: Attribute subset selection
Example with a decision tree (left branches True, right False)
A4?
Initial attribute set:
{A1, A2, A3, A1? A6? Reduced attribute
A4, A5, A6} set: {A1, A4, A6}
2 clusters
Sparse data leads
to many clusters
– non effective
29
Data Preprocessing
3. Data reduction
Numerosity: Sampling
obtaining a small sample s to represent the whole data set N
Problem: How to select a representative sampling set
Random sampling is not enough – representative samples should
be preserved
Stratified sampling: Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Random sampling Stratified sampling
No samples
from here
30
Data Preprocessing
4. Discretization - concept hierarchy
Three types of attributes
Nominal — values from an unordered set (color, profession)
Ordinal — values from an ordered set (military or academic rank)
Continuous — numbers (integer or real numbers)
Discretization
Divide the range of a continuous attribute into intervals
Reduces data size and its complexity
Some data mining algorithms do not support continuous types, and in
those cases discretization is mandatory
Some useful methods:
Binning, clustering (already presented)
Entropy-based discretization (no details here)
31
Data Preprocessing
4. Discretization - concept hierarchy
Concept hierarchy generation
For categorical data
Specification of an ordering between attributes (schema level)
• street < city < state < country
Specification of a hierarchy of values (data level)
• {Urbana, Champaign, Chicago} < Illinois
Automatic generation using the number of distinct values
• For the set of attributes: {street, city, state, country}
• IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15
• THEN: street < city < state < country
32
Outline
Data Mining techniques
Frequent patterns, association rules
• Support and confidence
Classification and prediction
• Decision trees
• Bayesian classifiers
• Support Vector Machines
• Lazy learning
Cluster Analysis
Visualization of the results
Summary
33
Data Mining techniques
34
Frequent pattern analysis
What is it?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
Frequent pattern analysis: searching for frequent patterns
Motivation: Finding inherent regularities in data
• Which products are bought together? Yesterday’s wine and spaghetti
example
• What are the subsequent purchases after buying a PC?
• Can we automatically classify web documents?
Applications
• Basket data analysis
• Cross-marketing
• Catalog design
• Sale campaign analysis
35
Basic Concepts: Frequent Patterns
and Association Rules (1)
Transaction-id Items bought
1 Wine, Bread, Spaghetti Itemsets (= transactions
2 Wine, Cocoa, Spaghetti in this example)
3 Wine, Spaghetti, Cheese
36
Support and confidence
That is.
support, s, probability that a transaction contains {A B }
s = P(A B )
confidence, c, conditional probability that a transaction
having A also contains B.
c = P(A|B).
Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.
37
Basic Concepts: Frequent Patterns
and Association Rules (2)
Transaction-id Items bought
1 Wine, Bread, Spaghetti
Suppose:
2 Wine, Cocoa, Spaghetti
support s = 50%
3 Wine, Spaghetti, Cheese
confidence
4 Bread, Cheese, Sugar c=50%
5 Bread, Cocoa, Spaghetti, Cheese,
Sugar
Support is used to define frequent patterns (sets of
products in more than s% itemsets)
{Wine} in itemsets 1, 2, 3 (support = 60%)
{Bread} in itemsets 1, 4, 5 (support = 60%)
{Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%)
{Cheese} in itemsets 3, 4, 5 (support = 60%)
{Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%)
38
Basic Concepts: Frequent Patterns
and Association Rules (3)
Transaction-id Items bought
1 Wine, Bread, Spaghetti
Suppose:
2 Wine, Cocoa, Spaghetti
support s = 50%
3 Wine, Spaghetti, Cheese
confidence
4 Bread, Cheese, Sugar c=50%
5 Bread, Cocoa, Spaghetti, Cheese,
Sugar
Confidence defines association rules: X Y rules in frequent
patterns whose confidence is bigger than c
Suggestion: {Wine, Spaghetti} is the only frequent
pattern to be considered. Why?
Association rules:
Wine Spaghetti (support=60%, confidence=100%)
Spaghetti Wine (support=60%, confidence=75%)
39
Advanced concepts in Asssociation
Rules discovery
Algorithms must face scalability problems
Apriori: If there is any itemset which is infrequent, its superset
should not be generated/tested!
Advanced problems
Boolean vs. quantitative associations
age(x, “30..39”) and income(x, “42..48K”) buys(x, “car”)
[s=1%, c=75%]
Single level vs. multiple-level analysis
What brands of wine are associated with what brands of
spaghetti?
Are support and confidence
clear?
40
Another example for association
rules
Transaction-id Items bought
Support s = 40%
1 Margherita, Beer, Coke
Confidence c = 70%
2 Margherita, Beer
4 Margherita, Coke
41
Another example for association
rules
Transaction-id Items bought
Support s = 40%
1 Margherita, Beer, Coke
Confidence c = 70%
2 Margherita, Beer
4 Margherita, Coke
42
Classification vs. Prediction
Classification
Characterizes (describes) a set of items belonging to a training
set; these items are already classified according to a label attribute
The characterization is a model
The model can be applied to classify new data (predict the class
they should belong to)
Prediction
models continuous-valued functions, i.e., predicts unknown or
missing values
Applications
Credit approval, target marketing, fraud detection
43
Classification: the process
1. Model construction
The class label attribute defines the class each item should belong
to
The set of items used for model construction is called training set
The model is represented as classification rules, decision trees, or
mathematical formulae
2. Model usage
Estimate accuracy of the model
• On the training set
• On a generalization of the training set
If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
44
Classification: the process
Model construction Classification
Algorithms
Training
Data
45
Classification: the process
IF rank = ‘professor’
Model usage OR years > 6
THEN tenured = ‘yes’
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
46
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
47
Evaluating generated models
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
48
Example of Classification
no yes
Age > 60
Low risk
no yes
no yes
51
Classification techniques
Decision Trees (2)
How are the attributes in decision trees selected?
Two well-known indexes are used
• Information gain selects the most informative attribute in
distinguishing the items between the classes
• It biases towards attributes with a large set of values
• Gain ratio faces the information gain limitations
52
Classification techniques
Bayesian classifiers (2)
Bayesian classification
A statistical classification technique
• Predicts class membership probabilities
Founded on the Bayes theorem
P( X | H ) P( H )
P( H | X )
P( X )
• What if X = “Red and rounded” and H = “Apple”?
Performance
• The simplest implementation (Naïve Bayes) can be compared to decision trees
and neural networks
Incremental
• Each training example can increase/decrease the probability that an
hypothesis in correct
53
Other Classification Methods
k-nearest neighbor classifier
case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
54
The k-Nearest Neighbor Algorithm
56
5 minutes break!
57
Classification techniques
Support Vector Machines
One of the most advanced classification techniques
Left figure: a small margin between the classes is found
Right figure: the largest margin is found
Support vector machines (SVMs) are able to identify the right figure margin
58
Classification techniques
SVMs + Kernel Functions
Is data always linearly separable?
NO!!!
Solution: SVMs + Kernel Functions
59
Classification techniques
Lazy learning
Lazy learning
Simply stores training data (or only minor processing) and waits
until it is given a test tuple
Less time in training but more time in predicting
Uses a richer hypothesis space (many local linear functions),
and hence the accuracy is higher
Instance-based learning
Subcategory of lazy learning
Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
An example: k-nearest neighbor approach
60
Classification techniques
k-nearest neighbor
All instances correspond to points in the n-Dimensional
space – x is the instance to be classified
The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
For discrete-valued, k-NN returns the most common value
among the k training examples nearest to x
61
Prediction techniques
An overview
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
Major method for prediction: regression
model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis
Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
No details here
62
What is cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
It belongs to unsupervised learning
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms (day 1 slides)
63
Examples of cluster analysis
Marketing:
Help marketers discover distinct groups in their customer bases
Land use:
Identification of areas of similar land use in an earth observation
database
Insurance:
Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning:
Identifying groups of houses according to their house type, value,
and geographical location
64
Good clustering
A good clustering method will produce high quality clusters
with
high intra-class similarity
low inter-class similarity
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
It is hard to define “similar enough” or “good enough”
65
A small example
How to cluster this data?
66
Visualization of the results
Presentation of the results or knowledge obtained from
data mining in visual forms
Examples
Scatter plots
Association rules
Decision trees
Clusters
67
Summary
Data Mining
Why Data and KDD
Mining?
Data
preprocessing
Some
scenarios
Classification
Clustering
68