Intro Data Mining
Intro Data Mining
Introduction
Why Mine Data? Commercial
Viewpoint
Lots of data is being collected
and warehoused
◦ Web data, e-commerce
◦ purchases at department/
grocery stores
◦ Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
◦ Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
Mining Large Data Sets - Motivation
3,500,000
1,000,000
Number of
analysts
500,000
0
1995 1996 1997 1998 1999
What is Data Mining?
Many Definitions
◦ Non-trivial extraction of implicit,
previously unknown and potentially
useful information from data
◦ Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Data Mining is Not ...
Growing Base of data
Increased
Computing
Power
DM
Statistical Improved
and Learning Data Collection
Algorithms and Mgmt
Applications
Fraud Detection
Loan and Credit Approval
Market Basket Analysis
Customer Segmentation
Financial Applications
E-Commerce & Decision Support
Web and text mining
Market Analysis and
Management (1)
Where are the data sources for analysis?
◦ Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies
Target marketing
◦ Find clusters of “model” customers who share the
same characteristics: interest, income level,
spending habits, etc.
Determine customer purchasing patterns over
time
◦ Conversion of single to a joint bank account: marriage,
etc.
Cross-market analysis
Market Analysis and
Management (2)
Customer profiling
◦ data mining can tell you what types of customers buy
what products (using techniques : clustering or
classification)
Identifying customer requirements
◦ identifying the best products for different customers
◦ use prediction to find what factors will attract new
customers
Provides summary information
◦ various multidimensional summary reports
◦ statistical summary information
Fraud Detection and
Management (1)
Applications
◦ widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
◦ use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
Examples
◦ auto insurance: detect a group of people who stage
accidents to collect on insurance
◦ money laundering: detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network)
◦ medical insurance: detect professional patients and
ring of doctors and ring of references
Fraud Detection and
Management (2)
Detecting telephone fraud
◦ Telephone call model: destination of the call,
duration, time of day or week. Analyze
patterns that deviate from an expected norm.
◦ British Telecom identified discrete groups of
callers with frequent intra-group calls,
especially mobile phones, and broke a
multimillion dollar fraud. (source:
Gartner,2006)
Retail
◦ Analysts estimate that 38% of retail shrink is
due to dishonest employees. (Business
today,2006)
Other Applications
Sports
◦ IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat
Astronomy
◦ JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
Internet Web Surf-Aid
◦ IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover
customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Data Mining: On What Kind
of Data?
Relational databases
Data warehouses
Transactional databases
Object-oriented databases
Time-series data
Text databases and multimedia databases
Heterogeneous
Data Mining Techniques
Association Rules & sequential Patterns
Classification
Clustering
Similar Images
Text/Web Mining
Outliers analysis (using stats)
–
Data mining - Analysis
12/04/09 20
Examples of Data Mining
Conditional Logic
n
12/04/09 21
Examples of Datamining contd....
12/04/09 23
Consider this complex tricky
query
A sales executive wishes to see all the sales for the past three
years where profitability has been greater than xx percent.
He wishes to see it by month. And where the percentages
have been greater than yy percent, he wants to see whether
the sales team has been in place during this period or
whether there has been personnel turnover. They are looking
for territorial versus personnel factors in sales success. He
also wishes to see trends in profitability so, where all sales
by year have steadily increased for zz percent at least two
years in a row, he wishes to see the top five products ranked
by profitability.
12/04/09 24
This query requires
Sums
Percentages
Grouping
Trends
Time-based analysis
Comparisons
12/04/09 25
Data mining - Users
Executives - need top-level insights and
spend far less time with computers than
the other groups.
Analysts may be financial analysts,
statisticians, consultants, or database
designers.
End users are sales people, scientists,
market researchers, engineers, physicians,
etc.
12/04/09 26
Mining market
Around 20 to 30 mining tool vendors
Major tool players:
◦ Clementine,
◦ IBM’s Intelligent Miner,
◦ SGI’s MineSet,
◦ SAS’s Enterprise Miner.
All pretty much the same set of tools
Many embedded products:
◦ fraud detection:
◦ electronic commerce applications,
◦ health care,
◦ customer relationship management: Epiphany
Vertical integration:
Mining on the web
Web log analysis for site design:
◦ what are popular pages,
◦ what links are hard to find.
Electronic stores sales enhancements:
◦ recommendations, advertisement:
◦ Collaborative filtering: Net perception,
Wisewire
◦ Inventory control: what was a shopper looking
for and could not find..
Data Mining
Through a variety of techniques, data mining identifies nuggets of
information in bodies of data.
Data mining extracts information in such a way that it can be used in
areas such as decision support, prediction, forecasts, and
estimation. Data is often voluminous but of low value and with little
direct usefulness in its raw form. It is the hidden information in the
data that has value.
In data mining, success comes from combining your (or your expert’s)
knowledge of the data with advanced, active analysis techniques in
which the computer identifies the underlying relationships and
features in the data.
The process of data mining generates models from historical data that
are later used for predictions, pattern detection, and more. The
technique for building these models is called machine learning, or
modeling.
Data Mining Tasks
Prediction Methods
◦ Use some variables to predict unknown or future
values of other variables.
Description Methods
◦ Find human-interpretable patterns that describe
the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Modeling Techniques
Predictive modeling methods include decision trees, neural
networks, and statistical models.
Clustering models focus on identifying groups of similar records
and labeling the records according to the group to which they
belong. Clustering methods include Kohonen, k-means, and
TwoStep.
Association rules associate a particular conclusion (such as the
purchase of a particular product) with a set of conditions (the
purchase of several other products).
Screening models can be used to screen data to locate fields
and records that are most likely to be of interest in modeling
and identify outliers that may not fit known patterns.
Available methods include feature selection and anomaly
detection.
Typical Applications
following:
Direct mail. Determine which demographic groups have
the highest response rate. Use this information to
maximize the response to future mailings.
Credit scoring. Use an individual’s credit history to make
credit decisions.
Human resources. Understand past hiring practices and
create decision rules to streamline the hiring process.
Medical research. Create decision rules that suggest
appropriate procedures based on medical evidence.
Typical Applications
Data preparation. After cataloging your data
resources, you will need to prepare your data
for
mining.
Preparations include selecting, cleaning,
constructing, integrating, and formatting
data.
Modeling.
sophisticated analysis methods are used to
extract information from the data. This phase
involves selecting modeling techniques,
generating test designs, and building and
assessing models.
Evaluation.
Once you have chosen your models, you are
ready to evaluate how the data mining results can
help you to achieve your business objectives.
Elements of this phase include evaluating results,
reviewing the data mining process, and determining
the next steps.
Deployment.
This phase focuses on integrating your new
knowledge into your everyday business processes to
solve your original business problem.
This phase includes plan deployment, monitoring
and
maintenance, producing a final report, and
reviewing the project.
Data Mining
Association Analysis: Basic
Concepts
and Algorithms
Association Rule Mining
Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
TID Item
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
TID
◦ k-itemset
An itemset that contains k
items
Support count ()
◦ Frequency of occurrence of an
itemset
◦ E.g. ({Milk, Bread,Diaper}) =
2
1
Support
◦ Fraction of transactions that
contain an itemset
◦ E.g. s({Milk, Bread, Diaper}) =
2/5
Frequent Itemset
◦ An itemset whose support is
greater than or equal to a
minsup threshold
Definition: Association Rule
●Association Rule
–An implication expression of the
TID
form X Y, where X and Y are
itemsets
–Example:
{Milk, Diaper} {Beer}
1
uFraction of transactions that Example:
contain both X and Y
{Milk, Diaper } ⇒ Beer
–Confidence (c)
uMeasures how often items in
σ (Milk , Diaper, Beer ) 2
Y s = = = 0.4
appear in transactions that |T| 5
contain X
σ (Milk, Diaper, Beer) 2
2
c= = = 0.67
σ (Milk, Diaper) 3
Association Rule Mining
Task
Given a set of transactions T, the goal of
association rule mining is to find all rules
having
◦ support ≥ minsup threshold
◦ confidence ≥ minconf threshold
◦
Brute-force approach:
◦ List all possible association rules
◦ Compute the support and confidence for each
rule
◦ Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
Observations: {Milk} {Diaper,Beer} (s=0.4, c=0.5)
1 Bread
•All the above rules are binary
partitions of the same itemset:
{Milk, Diaper, Beer}
•Rules originating from the same
itemset have identical support but
can have different confidence
•Thus, we may decouple the
support and confidence
requirements
Mining Association Rules
Two-step approach:
1.Frequent Itemset Generation
– Generate all itemsets whose support minsup
2.Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
Frequent itemset generation is still
computationally expensive
Frequent Itemset
Generation
Brute-force approach:
◦ Each itemset in the lattice is a candidate frequent
itemset
◦ Count the support of each candidate by scanning the
database
◦ Transactions List of
◦ Candidate
TID Items
◦ 1 Bread, Milk
◦ 2 Bread, Diaper, Beer, Eggs
◦ N 3 Milk, Diaper, Beer, Coke
◦ 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
◦ Match each transaction against every candidate
◦ Complexity ~ O(NMw) w => Expensive since M = 2d !!!
A Naïve Algorithm
Transaction ID Items
100 Bread, Cheese
200 Bread, Cheese, Juice
300 Bread, Milk
400 Cheese, Juice, Milk
◦ Let k=1
◦ Generate frequent itemsets of length 1
◦ Repeat until no new frequent itemsets are
identified
Generate length (k+1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets of
length k that are infrequent
Count the support of each candidate by
scanning the DB
Eliminate candidates that are infrequent,
leaving only those that are frequent
Reducing Number of
Comparisons
Candidate counting:
◦ Scan the database of transactions to determine
the support of each candidate itemset
◦ To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transaction against
every candidate, match it against candidates
contained in the hashed buckets
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
APRIORI
Transaction ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Cheese, Juice, Milk
500 Cheese, Juice, Milk
50% support
Frequent items L1
Item Frequency
Bread 4
Cheese 3
Juice 4
Milk 3
Candidate item pairs C2
Itemsets Frequency
(Bread, Cheese) 2
(Bread, Juice) 3
(Bread, Milk) 2
(Cheese, Juice) 3
(Cheese, Milk) 1
(Juice, Milk) 2
Frequent items L2
Item Frequency
Bread, Juice 3
Cheese, Juice 3
The APRIORI Algorithm
Bread Juice with confidence of 3/4
=75%
Juice Bread with confidence of 3/4
=75%
Cheese Juice with confidence of 3/3
=100%
Juice Cheese with confidence of 3/4
=75%
A larger APRIORI Example
Item Number Item Name
1 Biscuits
2 Bread
3 Cereal
4 Cheese
5 Chocolate
6 Coffee
7 Donuts
8 Eggs
9 Juice
10 Milk
11 Newspaper
12 Pastry
13 Rolls
14 Sugar
15 Tea
16 Yogurt
TID Items
1 Biscuits, Bread, Cheese, Coffee, Yogurt
2 Bread, Cereal, Cheese, Coffee
3 Cheese, Chocolate, Donuts, Juice, Milk
4 Bread, Cheese, Coffee, Cereal, Juice
5 Bread, Cereal, Chocolate, Donuts, Juice
6 Milk, Tea
7 Biscuits, Bread, Cheese, Coffee, Milk
8 Eggs, Milk, Tea
9 Bread, Cereal, Cheese, Chocolate, Coffee
10 Bread, Cereal, Chocolate, Donuts, Juice
11 Bread, Cheese, Juice
12 Bread, Cheese, Coffee, Donuts, Juice
13 Biscuits, Bread, Cereal
14 Cereal, Cheese, Chocolate, Donuts, Juice
15 Chocolate, Coffee
16 Donuts
17 Donuts, Eggs, Juice
18 Biscuits, Bread, Cheese, Coffee
19 Bread, Cereal, Chocolate, Donuts, Juice
20 Cheese, Chocolate, Donuts, Juice
21 Milk, Tea, Yogurt
22 Bread, Cereal, Cheese, Coffee
23 Chocolate, Donuts, Juice, Milk, Newspaper
24 Newspaper, Pastry, Rolls
25 Rolls, Sugar, Tea
25% support
Frequency count for all items
{Cheese, Coffee} 9
{Cheese, Donuts} 3
{Cheese, Juice} 4
{Chocolate, 1
Coffee}
{Chocolate, 7
Donuts}
{Chocolate, Juice} 7
{Coffee, Donuts} 1
{Coffee, Juice} 2
{Donuts, Juice} 9
The frequent 2-itemset or
L2
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
{Chocolate, Donuts} 7
{Chocolate, Juice} 7
{Donuts, Juice} 9
Candidate 3-itemsets or C3
{Bread, Cereal, 4
{Bread,
Cheese} Cereal, 4
{Bread,
Coffee} Cheese, 8
{Chocolate,
Coffee} Donuts, 7
Juice}
Frequent 3-itemsets or L3
{Bread, Cheese, 8
{Chocolate,
Coffee} Donuts, 7
Juice}
Confidence of association rules from
{Chocolate, Donuts, Juice}
Rule Support of BCD Frequency of Confidence
N MP 7 9LHS 0.78
M NP 7 10 0.70
P NM7 11 0.64
MP N 7 9 0.78
NP M 7 7 1.0
NM P 7 7 1.0
Confidence of association rules
from {Bread, Cheese, Coffee}
Rule Support of BCD Frequency of Confidence
B 8 13
LHS 0.61
CD
C 8 11 0.72
BD
D 8 9 0.89
BC
CD 8 9 0.89
B
BD 8 8 1.0
C
BC 8 8 1.0
D
All association rules
Cheese Bread
Cheese Coffee
Coffee Bread
Coffee Cheese
Cheese, Coffee Bread
Bread, Coffee Cheese
Bread, Cheese Coffee
Chocolate Donuts
Chocolate Juice
Donuts Chocolate
Donuts Juice
Donuts, Juice Chocolate
Chocolate, Juice Donuts
Chocolate, Donuts Juice
Bread Cereal
Cereal Bread
Closed and Maximal
Itemsets
Closed itemset – a frequent itemset X such
that there exists no superset of X with the
same support count as X.
Maximal itemset – a frequent itemset Y is
maximal if it is not a proper subset of any
other frequent itemset.
A maximal itemset is a closed itemset, but a
closed itemset is not necessarily a
maximal itemset.
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
A transaction database to
illustrate closed and maximal
itemsets
Transaction ID Items
100 Bread, Cheese, Juice
200 Bread, Cheese, Juice, Milk
300 Cheese, Juice, Egg
400 Bread, Juice, Milk, Egg
500 Milk, Egg
Frequent itemsets for the database
Itemset Support Closed? Maximal? Both?
{Bread} 3 No No No
{Cheese} 3 No No No
{Juice} 4 Yes No No
{Milk} 3 Yes No No
{Egg} 3 Yes No No
{Bread, Cheese} 2 No No No
{Bread, Juice} 3 Yes No No
{Bread, Milk} 2 No No No
{Cheese, Juice} 3 Yes No No
{Juice, Milk} 2 No No No
{Juice, Egg} 2 Yes Yes Yes
{Milk, Egg} 2 Yes Yes Yes
{Bread, Cheese, 2 Yes Yes Yes
Juice}
{Bread, Juice, 2 Yes Yes Yes
Milk}
Association models
associate a particular conclusion
(such as a decision to buy something)
with a set of conditions.
The Generalized Rule Induction (GRI)
node discovers association rules in
the data. For example, customers
who purchase razors and aftershave
lotion are also likely to purchase
shaving cream. GRI extracts rules
with the highest information content
based on an index that takes both
the generality (support) and
accuracy (confidence) of rules into
account. GRI can handle numeric
and categorical inputs, but the target
must be categorical.
Association models
The Apriori node extracts a set of
rules from the data, pulling out the
rules with the highest information
content.
Apriori offers five different methods of
selecting rules and uses a
sophisticated indexing scheme to
process large datasets efficiently.
For large problems, Apriori is
generally faster to train than GRI; it
has no arbitrary limit on the
number of rules that can be
retained, and it can handle rules
with up to 32 preconditions.
Apriori requires that input and output
fields all be categorical but delivers
better performance because it is
optimized for this type of data.
At the end of the processing, a table of the best rules is
presented. Unlike a decision tree, this set of association
rules cannot be used directly to make predictions in the
way that a standard model (such as a decision tree or a
neural network) can. This is due to the many different
possible conclusions for the rules. Another level of
transformation is required to transform the
association rules into a classification ruleset. Hence, the
association rules produced by association
algorithms are known as unrefined models. Although
the user can browse these unrefined
models, they cannot be used explicitly as classification
models unless the user tells the system
to generate a classification model from the unrefined
model. This is done from the browser
through a Generate menu option.
Association Rule Mining
X & Y appear in only 10% o the transactions
but whenever X appears there is an 80 %
chance that Y also appears.
The 10 % presence is called – Support ( or
prevalance)
80 % chance is called – confidence (or
predictability)
High level of support – the rule is frequent
enough for the business to be interested in
it.
High level of confidence – the rule is true
often enough to justify a decision based on
it.
Association Rule Mining
Total number of transactions = N
Support(X) = (Number of times X appears) / N
= P(X)
Support(XY) = (Number of times X and Y
appears together) / N
= P(X∩Y)
Confidence of (X Y)
=Support(XY)/Support(X)
= P(X∩Y) / P(Y)
= P(Y∣X)
P(Y∣X) is the probability of Y once X has taken
place, also called the conditional probability of
Y
Association Rule Mining
Lift – used to measure the power of
association between items that are
purchased together.
How much more likely an item Y is to be
purchased if the customer has brought the
item X that has been identified as having
an association with the first item Y,
compared to the likelihood of Y being
purchased without the the other item
being purchased.
P(Y∣X) / P(Y)