Introduction To Data Mining
Introduction To Data Mining
1
Understanding Data Mining
Data growing at a phenomenal rate – almost double every
year
Data kept in files but mostly in Relational Data bases
Large operational databases usually for OLTP applications e.g.
Banking
Large pool of past historical data
Can we do something with the large amount of data?
Find meaningful and interesting information from the data
Can standard SQL do it? No, we need different approach and
algorithms
Data Mining is the answer
2
Why Now?
3
Data Mining
Credit ratings/targeted marketing
Given a database of 100,000 names, which persons are the least likely
to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?
Customer relationship management
Which customers are likely to be the most loyal, and which are most
likely to leave for a competitor?
Data Mining helps extract such information
4
What is Data Mining?
Extracting or Mining knowledge from large data set to find
patterns that are
valid: hold on new data with some certainty
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret the pattern
Also known as Knowledge Discovery in Databases
Example
Which items are purchased together in a retail store?
5
Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management
identify those who are likely to leave for a competitor.
Targeted marketing
identify likely responders to promotions
Fraud detection - telecommunications, financial
transactions
from an online stream of event identify fraudulent events
Manufacturing and production
automatically adjust knobs when process parameter changes
6
Applications (continued)
7
Data Mining Versus KDD
8
The KDD process
Problem formulation
Data collection
Result evaluation and Visualization:
Data cleaning : Remove noise and inconsistent data
Data integration : Data from multiple sources combined
Data selection: Select data relevant to the mining task
Data transformation : Transform data (Summary, aggregation or
consolidate) appropriate for mining
Data Mining : Find interesting pattern
Pattern Evaluation : Identify truly interesting pattern
Presentation : GUI
10
Data Mining works with
Warehouse Data
Data Warehousing provides the
Enterprise with a memory
11
Data Mining Algorithms
12
Some basic data mining tasks
Predictive: Predict values of data using known
result found from different data (historical)
Regression
Classification
Time series analysis
Descriptive: Identifies patterns or relationship in
data
Clustering / similarity matching
Association rules and variants
Summarization
Sequence Discovery
13
Supervised Learning vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
14
Classification
Maps data into predefined classes or groups
Given old data about customers and payments,
predict new applicant’s loan eligibility.
15
Classification
16
Classification
17
Classification vs. Prediction
Classification
predicts categorical class labels (discrete or nominal)
or missing values
Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
18
Model Construction (Process I)
Classification
Algorithms
Training
Data
19
Classification
Use the Model in Prediction (Process II)
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
20
Another example
21
Association Rules and
Market Basket Analysis
22
What is Market Basket Analysis?
Customer Analysis
Market Basket Analysis uses the information about
what a customer purchases to give us insight into
who they are and why they make certain purchases.
Product Analysis
Market basket Analysis gives us insight into the
merchandise by telling us which products tend to
be purchased together and which are most
amenable to purchase.
23
Market Basket Example
24
Association Rules
There has been a considerable amount of research in the area of Market Basket Analysis.
Its appeal comes from the clarity and utility of its results, which are expressed in the
form association rules.
Given
A database of transactions
Each transaction contains a set of items
Find all rules X->Y that correlate the presence of one set of items X with another set of
items Y
Example: When a customer buys bread and butter, they buy milk 85% of the time
+
25
Results: Useful, Trivial, or Inexplicable?
While association rules are easy to understand, they are not always useful.
26
How Does It Work?
Grocery Point-of-Sale Transactions
Customer Items
1 Orange Juice,
juice, Soda
2 Milk, Orange Juice, Window Cleaner
3 Orange Juice, Detergent
4 Orange Juice, Detergent, soda
juice, detergent, Soda
5 Window Cleaner, Soda
cleaner, soda
Co-Occurrence of Products
Window
OJ Cleaner Milk Soda Detergent
OJ 4 1 1 2 1
Window Cleaner 1 2 1 1 0
Milk 1 1 1 0 0
Soda 2 1 0 3 1
Detergent 1 0 0 1 2
27
How Does It Work?
The co-occurrence table contains some simple patterns
Orange juice and soda are more likely to be purchased together than any other two items
Detergent is never purchased with window cleaner or milk
Milk is never purchased with soda or detergent
These simple observations are examples of Associations and may suggest a formal rule like:
If a customer purchases soda, THEN the customer also purchases orange juice
Window
OJ Cleaner Milk Soda Detergent
OJ 1 4 1 2 1
Window Cleaner 2 1 1 1 0
Milk 1 1 1 0 0
Soda 1 2 0 3 1
Detergent 0 1 0 1 2
28
How Good Are the Rules?
In the data, two of five transactions include both soda and orange juice, These
two transactions support the rule. The support for the rule is two out of five or
40%
Since both transactions that contain soda also contain orange juice there is a
high degree of confidence in the rule. In fact every transaction that contains
soda contains orange juice. So the rule If soda, THEN orange juice has a
confidence of 100%.
29
Confidence and Support - How
Good Are the Rules
A rule must have some minimum user-specified confidence
1 & 2 -> 3 has a 90% confidence if when a customer
bought 1 and 2, in 90% of the cases, the customer also
bought 3.
A rule must have some minimum user-specified support
1 & 2 -> 3 should hold in some minimum percentage of
transactions to have value.
30
Confidence and Support
Transaction ID # Items
1 { 1, 2, 3 }
For minimum support = 50% = 2 transactions
2 { 1,3 }
and minimum confidence = 50%
3 { 1,4 }
4 { 2, 5, 6 }
Frequent One Item Set Support
{1} 75 %
{2} 50 %
For the rule 1=> 3:
{3} 50 % Support = Support({1,3}) = 50%
{4} 25 % Confidence (1->3) = Support ({1,3})/Support({1}) = 66%
Confidence (3->1)= Support ({1,3})/Support({3}) = 100%
Frequent Two Item Set Support
{ 1,2 } 25 %
{ 1,3 } 50 %
{ 1,4 } 25 %
{ 2,3 } 25 %
31
Association Examples
Find all rules that have “Diet Coke” as a result. These rules may help
plan what the store should do to boost the sales of Diet Coke.
Find all rules that have “Yogurt” in the condition. These rules may
help determine what products may be impacted if the store
discontinues selling “Yogurt”.
Find all rules that have “Brats” in the condition and “mustard” in the
result. These rules may help in determining the additional items that
have to be sold together to make it highly likely that mustard will
also be sold.
Find the best k rules that have “Yogurt” in the result.
32
Example - Minimum Support Pruning / Rule
Generation
Scan Database Find Pairings Find Level of Support
33
Other Association Rule Applications
Quantitative Association Rules
Age[35..40] and Married[Yes] -> NumCars[2]
Association Rules with Constraints
Find all association rules where the prices of items are > 100 dollars
Temporal Association Rules
Diaper -> Beer (1% support, 80% confidence)
Diaper -> Beer (20%support) 7:00-9:00 PM weekdays
Optimized Association Rules
Given a rule (l < A < u) and X -> Y, Find values for l and u such that support
greater than certain threshold and maximizes a support and confidence.
Check Balance [$ 30,000 .. $50,000] -> Certificate of Deposit (CD)= Yes
+
34
Classification by Decision Tree
Learning
A classic machine learning / data mining problem
Develop rules for when a transaction belongs to a
class based on its attribute values
Smaller decision trees are better
ID3 is one particular algorithm
35
A Database… (Training Dataset)
Age Income Student Credit_Rating Buys_Computer
<=30 High No Fair No
<=30 High No Excellent No
31…40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31…40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31…40 Medium No Excellent Yes
31…40 High Yes Fair Yes
>40 Medium No Excellent No
36
Output: A Decision Tree
Age?
No Yes Yes
37
Algorithm: Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
advance)
Examples are partitioned recursively based on selected attributes
38
Different Possibilities for Partitioning Tuples
Based on Splitting Criterion
39
Attribute Selection Measure: Information
Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info ( D ) pi log 2 ( pi )
Information needed (after using A to split D i into
1 v partitions) to
classify D:
v | Dj |
InfoA (D) Info(Dj )
Information gained by branching on attributej 1A| D |
42
Computing Information-Gain for
Continuous-Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is considered
as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for A is
selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point
43
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a
large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex. 4 4 6 6 4 4
SplitInfo A ( D ) log 2 ( ) log 2 ( ) log 2 ( ) 0 .926
14 14 14 14 14 14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as the
splitting attribute
44
Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is defined
as n
gini ( D ) 1 p 2j
j 1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is
defined as |D | |D |
gini A ( D ) 1 gini ( D 1) 2 gini ( D 2 )
|D | |D |
Reduction in Impurity:
gini( A) gini(D) giniA ( D)
The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute) 45
Gini index (CART, IBM IntelligentMiner)
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
46
Comparing Attribute Selection Measures
47
Regression
48
Regression
Example
A person wishes to reach a certain level of savings
before retirement
He wants to predict his savings value based on its
current value and several past values
He uses a simple linear regression formula to predict
this value by fitting past behavior to a linear function
and then use this function to predict value at any point
of time
49
Time series analysis
Value of an attribute is examined as it varies over time
Evenly spaced time points – daily, weekly, horly etc.
Three basic functions in time series analysis
Distance measures are used to determine similarity between
time series
Structure of the line is examined to determine its behavior
Use historical time series plot to predict future values
Application
Stock market analysis – whether to buy a stock or not
50
Clustering : Unsupervised
Learning
Similar to classification except groups are not
predefined rather defined by data alone
Most similar data are grouped into same clusters
Dissimilar data should be in different clusters
51
Clustering Examples
52
Clustering
Unsupervised learning when old data with class labels
not available e.g. when introducing a new product.
Group/cluster existing customers based on time
series of payment history such that similar customers
in same cluster.
Key requirement: Need a good measure of similarity
between instances.
Identify micro-markets and develop policies for each
53
Clustering Example
54
Clustering Houses
Size Based
Geographic Distance Based
55
Clustering Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
56
Impact of Outliers on Clustering
57
Clustering Problem
58
Types of Clustering
Hierarchical – Nested set of clusters created.
Partitional – One set of clusters created.
Incremental – Each element handled one at a
time.
Simultaneous – All elements handled together.
Overlapping/Non-overlapping
59
Agglomerative Example
A B C D E
A B
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5 E C
D 2 4 1 0 3
E 3 3 5 3 0
D
Threshold of
1 2 34 5
A B C D E
61
Partitional methods: K-means
Criteria: minimize sum of square of distance
Between each point and centroid of the cluster.
Between each pair of points in the cluster
Algorithm:
Select initial partition with K clusters: random, first K, K
separated points
Repeat until stabilization:
Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/splitting
62
Collaborative Filtering
63
Mining market
Around 20 to 30 mining tool vendors
Major tool players:
Clementine,
IBM’s Intelligent Miner,
SGI’s MineSet,
SAS’s Enterprise Miner.
All pretty much the same set of tools
Many embedded products:
fraud detection:
electronic commerce applications,
health care,
customer relationship management: Epiphany
64
Large-scale Endeavors
Products
Clustering Classification Association Sequence Deviation
SAS Decision
Trees
SPSS
Oracle ANN
(Darwin)
IBM Time Decision
Series Trees
DBMiner
(Simon Fraser)
65
Vertical integration: Mining on the web
66
Some success stories
Network intrusion detection using a combination of sequential
rule discovery and classification tree on 4 GB DARPA data
Won over (manual) knowledge engineering approach
http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed
description of the entire process
Major US bank: customer attrition prediction
First segment customers based on financial behavior: found 3 segments
Build attrition models for each of the 3 segments
40-50% of attritions were predicted == factor of 18 increase
Targeted credit marketing: major US banks
find customer segments based on 13 months credit balances
build another response model based on surveys
increased response 4 times -- 2%
67
Relationship with other fields
68
The Future
Database – RDBMS & SQL are the two milestones in the evolution of
Database systems
Currently, data mining is little more than a set of tools that can be used
to uncover previously hidden information in databases
Many tools are available, but no all-encompassing model or approach
Future is to create an all-encompassing tools, better integrated, less
human interaction and human interpretation
A major development could be creation of sophisticated query
language that includes normal SQL and complicated OLAP functions
DMQL is a step towards that
69
Thank You
70