Frequent Patterns and Association Rule Mining: Outline
Frequent Patterns and Association Rule Mining: Outline
Frequent Patterns and Association Rule Mining: Outline
Outline
Frequent Itemsets and Association Rule
APRIORI
Post-processing
Applications
1
Transactional Data
Definitions:
An item: an article in a basket, or an attribute-value pair
A transaction: set of items purchased in a basket; it may have
TID (transaction ID)
A transactional dataset: A set of transactions
2
Association Rule
Example
Tid Items bought Customer Customer
buys both
10 Beer, Nuts, Diaper buys
Beer, Coffee, Diaper diaper
20
30 Beer, Diaper, Eggs
3
Use of Association Rules
Association rules do not necessarily represent causality
or correlation between the two itemsets.
X Y does not mean X causes Y, no Causality
X Y can be different from Y X, unlike correlation
Association rules assist in Basket data analysis, cross-
marketing, catalog design, sale campaign analysis, Web
log (click stream) analysis, and DNA sequence analysis.
What products were often purchased together?— Beer and
diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Computational Complexity
of Frequent Itemset Mining
How many itemsets are potentially to be generated in the worst case?
The number of frequent itemsets to be generated is senstive to the minsup
threshold
When minsup is low, there exist potentially an exponential number of
frequent itemsets
The worst case is close to MN where M: # distinct items, and N: max length
of transactions when M is large.
The worst case complexity vs. the expected probability
Ex. Suppose Walmart has 104 kinds of products
The chance to pick up one product 10 -4
The chance to pick up a particular set of 10 products: ~10 -40
What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
4
Outline
Frequent Itemsets and Association Rule
APRIORI
Post-processing
Applications
10
5
The Downward Closure Property
11
AB AC AD BC BD CD
A B C D
12
6
APRIORI
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from
length k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can
be generated
13
Implementation of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={{a,b,c}, {a,b,d}, {a,c,d}, {a,c,e}, {b,c,d}}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
abcd is kept since abc, abd, acd, and bcd are in L3
acde is removed because cde and ade are not in L3
C4 = {abcd}
14
7
Self-joining
15
8
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
17
18
9
Partition: Scan Database Only Twice
19
20
10
Transaction Reduction
21
11
Outline
Frequent Itemsets and Association Rule
APRIORI
Post-processing
Applications
12
Example – deriving rules from frequent itemsets
13
Mining Multiple-Level Association Rules
Example
14
Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor”
relationships between items
Example
milk wheat bread [support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second
rule
A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor
15
Mining Quantitative Associations
Categorical Attributes and Quantitative
Attributes
Techniques can be categorized by how
numerical attributes, such as age or salary are
treated
Static discretization based on predefined concept
hierarchies
Dynamic discretization based on data distribution.
Clustering: Distance-based association.
One dimensional clustering then association
16
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
17
Interestingness Measure: Correlations (Lift)
2000 / 5000
lift ( B, C ) 0.89 Basketbal Not Sum (row)
3000 / 5000 * 3750 / 5000 l basketball
Cereal 2000 1750 3750
1000 / 5000
lift ( B, C ) 1.33 Not cereal 1000 250 1250
3000 / 5000 *1250 / 5000 Sum(col.) 3000 2000 5000
18
Outline
Frequent Itemsets and Association Rule
APRIORI
Post-processing
Applications
19
Transactions for Purchase of Different-
Colored Cellular Phone Faceplates
Transaction Faceplate Colors Purchased
1 Red, white, green
2 White, orange
3 White, blue
4 Red, white, orange
5 Red, blue
6 White, blue
7 White, orange
8 Red, white, blue, green
9 Red, white, blue
10 Yellow
20
Item Sets with Support Count of At Least Two
(20%)
Item Set Support (Count)
{red} 6
{white} 7
{blue} 6
{orange} 2
{green} 2
{red, white} 4
{red, blue} 4
{red, green} 2
{white, blue} 4
{white, orange} 2
{white, green} 2
21
Final Results for Phone Faceplate Transactions
Rule # Conf.% X Y Supp.(X) Supp.(Y) Supp.(XUY) Lift
The support for the rule indicates its impact in terms of overall size: What
proportion of transactions is affected?
The confidence indicates what rate Y will be found, is useful in determining
the business or operational usefulness of a rule.
The lift ratio indicates how efficient the rule is in finding Y, compared to
random selection.
The more records the rule is based on, the more solid the conclusion since
the key evaluative statistics are based on ratios and proportion.
22
Yield Map
http://www.cs.washington.edu/research/metip/about/digital.html
23
Color Image : Data
Data
24
Preprocessing of Data
Data Discretization: [0,63] [64,127] [128,191] [192,255]
intervals.
[0,31] [32,63] [64,95] [96,127] [128,159] [160,191] [192,225] [226,255]
Pixel Band1 (Red) Band2 (Green) Band3 (Blue) Band4 (Gray Scale)
Yield
1 B11(40) B25(140) B37(200) B44(240)
2 B11(50) B25(130) B37(210) B44(250)
25
Evaluate Rules
Reference
26