Lect 6
Lect 6
Lect 6
Association Rule
— Chapter 6 —
12/05/24 1
Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
12/05/24 2
What Is Frequent Pattern
Analysis?
Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that
occurs/appears frequently in a data set
First proposed by Agrawal, Imielinski, and Swami
[AIS93] in the context of frequent itemsets and
association rule mining
3
What Is Frequent Pattern
Analysis?
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer
and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design,
sale campaign analysis, Web log (click stream) analysis,
and DNA sequence analysis.
4
Why Is Freq. Pattern Mining
Important?
Freq. pattern: An intrinsic and important property
of datasets. Frequent patterns are that appear
frequently in data set.
For ex. A set of item sets, such as milk and
bread, that appear frequently together in a
transaction data set is a frequent itemset.
5
Association Rule Mining
Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association
TID Items Rules
1 Bread, Milk {Diaper} {Beer},
2 Bread, Diaper, Beer, Eggs {Milk, Bread}
3 Milk, Diaper, Beer, Coke {Eggs,Coke},
4 Bread, Milk, Diaper, Beer
{Beer, Bread} {Milk},
5 Bread, Milk, Diaper, Coke
Implication means co-
occurrence, not causality!
6
Basic Concepts: Frequent
Patterns
7
Transaction data: supermarket
data
Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
Concepts:
An item: an item/article in a basket
8
Support
(absolute) support, or,
support count of X () :
TID Items
Frequency or occurrence of an
itemset X 1 Bread, Milk
E.g. ({Milk, Bread,Diaper}) = 2 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
(relative) support, s, is the 4 Bread, Milk, Diaper, Beer
fraction of transactions that contains 5 Bread, Milk, Diaper, Coke
X (i.e., the probability that a
transaction contains X)
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset X is frequent if X’s support
is no less than a minsup threshold
9
Basic Concepts: Association Rules
Ti Items bought
d
Find all the rules X Y with
10 Beer, Nuts, Diaper
minimum support and
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs confidence
40 Nuts, Eggs, Milk support, s, probability that
50 Nuts, Coffee, Diaper, Eggs,
Milk a transaction contains X
Customer
Customer Y
buys both
buys
diaper
confidence, c, conditional
probability that a
transaction having X also
Customer contains Y
buys beer
Let Association
minsup = 50%, rules: (many
minconf = 50%
Freq.more!)
Pat.: Beer:3, Nuts:3, Diaper:4,
Beer
Eggs:3, Diaper
{Beer, (60%,
Diaper}:3
100%) 10
Definition: Association Rule
Association Rule TID Items
12
Association Rule Mining: Brute-
force approach
List all possible association rules
Compute the support and confidence for
each rule
Prune rules that fail the minsup and
minconf thresholds
Computationally prohibitive!
13
Brute Force approach to
Frequent Itemset Generation
For an itemset with 3
elements, we have 8 subsets TID Items
Each subset is a candidate 1 Bread, Milk
frequent itemset which
needs to be matched against 2 Bread, Diaper, Beer, Eggs
each transaction 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
1-itemsets 5 Bread, Milk, Diaper, Coke
Itemset Count
{Milk} 4
2-itemsets
Itemset Count
{Diaper} 4
{Milk, Diaper} 3
{Beer} 3
{Diaper, 3
Beer}
3-itemsets {Beer, Milk} 2
Itemset Count
{Milk, Diaper, Beer} 2
mportant Observation:
Counts of subsets can’t be smaller than the count of an itemset!
12/05/24 14
Reducing Number of Candidates
Apriori principle:
If an itemset is frequent, then all of its
X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the
support of its subsets
This is known as the anti-monotone property
of support
15
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk
{Milk,Diaper} {Beer} (s=0.4,
2 Bread, Diaper, Beer, Eggs
c=0.67)
3 Milk, Diaper, Beer, Coke {Milk,Beer} {Diaper} (s=0.4,
4 Bread, Milk, Diaper, Beer c=1.0)
5 Bread, Milk, Diaper, Coke {Diaper,Beer} {Milk} (s=0.4,
c=0.67)
{Beer} {Milk,Diaper} (s=0.4,
Observations: c=0.67)
• All the above rules are binary{Diaper} {Milk,Beer}
of
partitions the same (s=0.4,
itemset: c=0.5)
{Milk, Diaper, Beer} {Milk} {Diaper,Beer} (s=0.4,
c=0.5)
• Rules originating from the same itemset have identical
support but
can have different confidence
• 16
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support
minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
Two sub-problems:
Find all itemsets that have transaction
18
12/05/24
Second Sub-problem
Straightforward approach:
For every large itemset l, find all non-empty
subsets of l.
For every such subset a, output a rule of
19
12/05/24
Closed Patterns and Max-
Patterns
A long pattern contains a combinatorial number of
sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002)
+ … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns
instead
An itemset X is closed if X is frequent and there exists
no super-pattern Y כX, with the same support as X
An itemset X is a max-pattern if X is frequent and
there exists no frequent super-pattern Y כX
Closed pattern is a lossless compression of freq.
patterns
Reducing the # of patterns and rules
20
Closed Patterns and Max-
Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.
What is the set of closed itemset?
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of max-pattern?
<a1, …, a100>: 1
What is the set of all patterns?
!!
21
Computational Complexity of Frequent
Itemset Mining
How many itemsets are potentially to be generated in the worst
case?
The number of frequent itemsets to be generated is senstive to
the minsup threshold
When minsup is low, there exist potentially an exponential
number of frequent itemsets
The worst case: MN where M: # distinct items, and N: max length
of transactions
The worst case complexty vs. the expected probability
Ex. Suppose Walmart has 104 kinds of products
The chance to pick up one product 10-4
The chance to pick up a particular set of 10 products: ~10 -40
What is the chance this particular set of 10 products to be
frequent 103 times in 109 transactions?
22
The Downward Closure Property and
Scalable Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be
frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts}
@SIGMOD’00)
Vertical data format approach (Charm—Zaki &
Hsiao @SDM’02) 23
Scalable Frequent Itemset Mining
Methods
Approach
Approach
24
Apriori: A Candidate Generation & Test
Approach
Two steps:
Join large itemsets L
k-1 with Lk-1.
Prune out all itemsets in joined result which
contain a (k-1)subset not found in Lk-1.
26
12/05/24
Implementation of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
27
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
28
The Apriori Algorithm (Pseudo-
Code)
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 29
How to Count Supports of Candidates?
30
Counting Supports of Candidates Using Hash
Tree
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
31
Support Counting Using a
Hash Tree
Create a hash tree and hash all the
candidate k-itemsets to the leaf nodes of
the tree
For each transaction, generate all k-
32
Support Counting Using a
HashTree
For each k-item subset,
hash it to a leaf node of the hash tree,
itemsets
hashed to the same leaf node.
itemset,
---- increment the support count of the
candidate k-itemset
33
Step 2: Generating rules from
frequent itemsets
A B is an association rule if
Confidence(A B) ≥ minconf,
support(A B) = support(AB) = support(X)
confidence(A B) = support(A B) /
support(A)
34
Generating rules: an example
Suppose {2,3,4} is frequent, with sup=50%
Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2},
{3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75%
respectively
These generate these association rules:
2,3 4, confidence=100%
2,4 3, confidence=100%
3,4 2, confidence=67%
2 3,4, confidence=67%
3 2,4, confidence=67%
4 2,3, confidence=67%
All rules have support = 50%
35
Scalable Frequent Itemset Mining
Methods
Format
… 10 {yz, qs,
2Hash
wt}
Table
Frequent 1-itemset: a, b, d, e
ab is not a candidate 2-itemset if the sum of count of {ab,
ad, ae} is below support threshold
J. Park, M. Chen, and P. Yu. An effective hash-based algorithm
for mining association rules. SIGMOD’95
39
Sampling for Frequent Patterns
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication
rules for market basket data.
SIGMOD’97 41
Methods to Improve Apriori’s
Efficiency
12/05/24 43
Association Rules with Apriori
K:5 KE:4 KE
E:4 KM:3 KM
M:3 KO:3 KO
O:3 => KY:3 => KY => KEO
Y:3 EM:2 EO
EO:3
EY:2
MO:1
MY:2
OY:2
44
Scalable Frequent Itemset Mining
Methods
Format
FP-tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the
FP-tree
46
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent
pattern
47
Step 1: FP-Tree Construction
FP-Tree is constructed using 2 passes over
the data-set:
Pass 1:
– Scan data and find support for each item.
48
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and
maps it to a path
2. Fixed order is used, so paths can overlap when
transactions share items (when they have the
same prefix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes
containing the same item, creating singly
linked lists (dotted lines)
– The more paths that overlap, the higher the
49
Step 1: FP-Tree Construction (Example)
50
FP-Tree size
The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions
share items (and hence prefixes).
– Best case scenario: all transactions contain the
Patterns containing p
…
Pattern f
53
Step 2: Frequent Itemset Generation
54
Prefix path sub-trees (Example)
55
Step 2: Frequent Itemset Generation
56
Example
57
Conditional FP-Tree
58
Find Patterns Having P From P-conditional
Database
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
61
A Special Case: Single Prefix Path in FP-
tree
a3:n3
{} r1
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
62
Association Rules
Let’s have an example
T100 1,2,5
T200 2,4
T300 2,3
T400 1,2,4
T500 1,3
T600 2,3
T700 1,3
T800 1,2,3,5
T900 1,2,3
63
FP Tree
64
Mining the FP tree
65
Exercise
12/05/24 66
Association Rules with FP Tree
K:5
E:4
M:3
O:3
Y:3
67
Association Rules with FP Tree
68
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent
pattern mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not
count node-links and the count field)
69
The Frequent Pattern Growth Mining
Method
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern
conditional FP-tree
Until the resulting FP-tree is empty, or it
70
Scaling FP-growth by Database
Projection
What about if FP-tree cannot fit in memory?
DB projection
First partition a database into a set of projected DBs
Then construct and mine FP-tree for each projected DB
Parallel projection vs. partition projection techniques
Parallel projection
Project the DB in parallel for each frequent item
Parallel projection is space costly
All the partitions can be processed in parallel
Partition projection
Partition the DB based on the ordered frequent items
Passing the unprocessed parts to the subsequent
partitions
71
Partition-Based Projection
am-proj DB cm-proj DB
fc f …
fc f
fc f
72
Performance of FPGrowth in Large
Datasets
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Run time(sec.)
Runtime (sec.)
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
73
Advantages of the Pattern Growth
Approach
Divide-and-conquer:
Decompose both the mining task and DB according to the
frequent patterns obtained so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-
tree, no pattern search and matching
A good open-source implementation and refinement of
FPGrowth
FPGrowth+ (Grahne and J. Zhu, FIMI'03)
74