06 FPBasic
06 FPBasic
06 FPBasic
1
Mining Frequent Patterns, Association and Correlations: Basic
Concepts and Methods
Basic Concepts
Summary
2
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and
association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click
stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining Important?
Freq. pattern: An intrinsic and important property of datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
data
Classification: discriminative, frequent pattern analysis
Broad applications
4
Basic Concepts: Frequent Patterns
Customer
buys beer
5
Basic Concepts: Association Rules
Tid Items bought Find all the rules X Y with minimum
10 Beer, Nuts, Diaper
support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs support, s, probability that a transaction
40 Nuts, Eggs, Milk contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk
confidence, c, conditional probability that a
Customer
buys both
Customer transaction having X also contains Y
buys diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Customer
buys beer Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
6
7
Example
Association rules analysis is a technique to uncover how items
are associated to each other. There are three common ways to
measure association.
Measure 1: Support. This says how popular an itemset
is, as measured by the proportion of transactions in which
an itemset appears. In Table, the support of {apple} is 4
out of 8, or 50%. Itemsets can also contain multiple items.
For instance, the support of {apple, beer, rice} is 2 out of
8, or 25%.
Measure 2: Confidence. This says how likely item Y is
purchased when item X is purchased, expressed as {X ->
Y}. This is measured by the proportion of transactions with
item X, in which item Y also appears. In Table, the
confidence of {apple -> beer} is 3 out of 4, or 75%.
Measure 3: Lift. This says how likely item Y is purchased
when item X is purchased, while controlling for how
popular item Y is. In Table, the lift of {apple -> beer} is
1,which implies no association between items. A lift value
greater than 1 means that item Y is likely to be bought if
item X is bought, while a value less than 1 means that item
Y is unlikely to be bought if item X is bought.
8
Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-patterns, e.g., {a 1,
…, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-
patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no super-pattern Y כ
X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there exists no frequent
super-pattern Y כX (proposed by Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
9
Closed Patterns and Max-Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.
What is the set of closed itemset?
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of max-pattern?
<a1, …, a100>: 1
What is the set of all patterns?
!!
10
11
12
Closed Patterns
A closed pattern is
1. a frequent pattern. So it meets the minimum support criteria.
2. suppose a pattern xy has support count of 3 and a pattern xyz has support
count of 2.
3. Is the pattern xy is a closed pattern?
Pattern xy is a frequent pattern and also the only super-pattern xyz is less
14
Max-Patterns
A max pattern is
1. a frequent pattern. So it also meets the minimum support criteria like closed
pattern
2. all super-patterns of a max pattern are NOT frequent patterns.
Example –
1. suppose there are a total of 3 items: x, y, z.
2. Suppose a pattern xy has support count of 3 and a pattern xyz has support
count of 1.
3. Is the pattern xy is a max pattern?
Pattern xy is a frequent pattern and also the only super-pattern xyz is NOT a
2.Like before, for the first example, suppose there are a total of 3 items: a, b, c.
3.Suppose a pattern ab has support count of 3 and a pattern abc has support
count of 2.
4.Is the pattern ab is a max pattern?
16
Chapter 5: Mining Frequent Patterns, Association and Correlations:
Basic Concepts and Methods
Basic Concepts
Summary
17
Scalable Frequent Itemset Mining Methods
18
The Downward Closure Property and Scalable Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
i.e., every transaction having {beer, diaper, nuts} also contains {beer,
diaper}
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
19
Apriori: A Candidate Generation & Test Approach
20
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
21
Implementation of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
22
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
23
Exercise
Mine the frequent itemsets for the given transactional database
using the Apriori algorithm.
24
a
25
Generating Association Rules form Frequent Itemsets
Once the frequent itemsets from transactions in a database D
have been found, it is straightforward to generate strong
association rules from them (where strong association rules
satisfy both minimum support and minimum confidence).
Association rules can be generated as follows:
26
Generating Association Rules form Frequent Itemsets
AllElectronics data contains frequent itemset X = {I1, I2, I5}.
What are the association rules that can be generated from X?
The nonempty subsets of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1},
{I2}, and {I5}. The resulting association rules are
27
How to Count Supports of Candidates?
28
Scalable Frequent Itemset Mining Methods
29
Further Improvement of the Apriori Method
30
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be frequent in at least
one of the partitions of DB
Scan 1: partition database and find local frequent patterns
A k-itemset whose corresponding hashing bucket count is below the threshold cannot be
frequent count itemsets
Candidates: a, b, c, d, e 35 {ab, ad, ae}
88 {bd, be, de}
Hash entries
{ab, ad, ae}
.
.
. .
{bd, be, de}
.
.
ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support
threshold
J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD’95
32
Sampling for Frequent Patterns
33
DIC: Reduce Number of Scans
ABCD
Once both A and D are determined frequent, the
counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are determined
frequent, the counting of BCD begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur.
Dynamic itemset counting and implication DIC 3-items
rules for market basket data. SIGMOD’97
34
Scalable Frequent Itemset Mining Methods
35
Pattern-Growth Approach: Mining Frequent Patterns Without
Candidate Generation
Bottlenecks of the Apriori approach
It may still need to generate a huge number of candidate sets. For example, if there are 104
frequent 1-itemsets, the Apriori algorithm will need to generate more than 107 candidate 2-
itemsets.
It may need to repeatedly scan the whole database and check a large set of candidates by
pattern matching. It is costly to go over each transaction in the database to determine the
support of the candidate itemsets.
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
36
Construct FP-tree from a Transaction Database
Patterns containing p
…
Pattern f
38
Find Patterns Having P From P-conditional Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
39
From Conditional Pattern-bases to Conditional FP-trees
Construct the FP-tree for the frequent items of the pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
41
A Special Case: Single Prefix Path in FP-tree
a3:n3
{} r1
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
42
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more frequently occurring, the
more likely to be shared
Never be larger than the original database (not count node-links and the
count field)
43
The Frequent Pattern Growth Mining Method
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and database partition
Method
For each frequent item, construct its conditional pattern-base, and then its
conditional FP-tree
Repeat the process on each newly created conditional FP-tree
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern
44
Scaling FP-growth by Database Projection
What about if FP-tree cannot fit in memory?
DB projection
First partition a database into a set of projected DBs
Then construct and mine FP-tree for each projected DB
Parallel projection vs. partition projection techniques
Parallel projection
Project the DB in parallel for each frequent item
Parallel projection is space costly
All the partitions can be processed in parallel
Partition projection
Partition the DB based on the ordered frequent items
Passing the unprocessed parts to the subsequent partitions
45
Partition-Based Projection
am-proj DB cm-proj DB
fc f …
fc f
fc f
46
Performance of FPGrowth in Large Datasets
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
Run time(sec.)
Runtime (sec.)
70 100
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
47
Advantages of the Pattern Growth Approach
Divide-and-conquer:
Decompose both the mining task and DB according to the frequent patterns obtained
so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-tree, no pattern search and
matching
A good open-source implementation and refinement of FPGrowth
FPGrowth+ (Grahne and J. Zhu, FIMI'03)
48
Further Improvements of Mining Methods
49
Extension of Pattern Growth Mining Methodology
Pattern-growth-based Clustering
MaPle (Pei, et al., ICDM’03)
Pattern-Growth-Based Classification
Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)
50
Scalable Frequent Itemset Mining Methods
51
ECLAT: Mining by Exploring Vertical Data Format
Vertical format: t(AB) = {T11, T25, …}
tid-list: list of trans.-ids containing an itemset
Deriving frequent patterns based on vertical intersections
t(X) = t(Y): X and Y always happen together
t(X) t(Y): transaction having X always has Y
Using diffset to accelerate mining
Only keep track of differences of tids
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Diffset (XY, X) = {T2}
Eclat (Zaki et al. @KDD’97)
Mining Closed patterns using vertical format: CHARM (Zaki & Hsiao@SDM’02)
52
Scalable Frequent Itemset Mining Methods
53
Mining Frequent Closed Patterns: CLOSET
Flist: list of all frequent items in support ascending order
Flist: d-a-f-e-c Min_sup=2
TID Items
Divide search space 10 a, c, d, e, f
20 a, b, e
Patterns having d
30 c, e, f
Patterns having d but no a, etc. 40 a, c, d, f
50 c, e, f
Find frequent closed pattern recursively
Every transaction having d also has cfa cfad is a frequent closed pattern
J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for Mining Frequent
Closed Itemsets", DMKD'00.
CLOSET+: Mining Closed Itemsets by Pattern-Growth
58
Visualization of Association Rules: Rule Graph
59
Visualization of Association Rules
(SGI/MineSet 3.0)
60
Chapter 5: Mining Frequent Patterns, Association and Correlations:
Basic Concepts and Methods
Basic Concepts
Summary
61
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower
support and confidence
Measure of dependent/correlated events: lift
62
Are lift and 2 Good Measures of Correlation?
63
Null-Invariant Measures
64
Comparison of Interestingness Measures
Null-(transaction) invariance is crucial for correlation analysis
Lift and 2 are not null-invariant
5 null-invariant measures
April 12, 2024 Data Mining: Concepts and Techniques Subtle: They disagree 65
Analysis of DBLP Coauthor Relationships
Recent DB conferences, removing balanced associations, low sup, etc.
66
Which Null-Invariant Measure Is Better?
IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in rule
implications
Kulczynski and Imbalance Ratio (IR) together present a clear picture for all the
three datasets D4 through D6
D4 is balanced & neutral
D5 is imbalanced & neutral
D6 is very imbalanced & neutral
Chapter 5: Mining Frequent Patterns, Association and Correlations:
Basic Concepts and Methods
Basic Concepts
Summary
68
Summary
69
Ref: Basic Concepts of Frequent Pattern Mining
(Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets
of items in large databases. SIGMOD'93
(Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98
(Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed
itemsets for association rules. ICDT'99
(Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95
70
Ref: Apriori and Its Improvements
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules.
KDD'94
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large
databases. VLDB'95
J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95
H. Toivonen. Sampling large databases for association rules. VLDB'96
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for
market basket analysis. SIGMOD'97
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database
systems: Alternatives and implications. SIGMOD'98
71
Ref: Depth-First, Projection-Based FP Mining
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J.
Parallel and Distributed Computing, 2002.
G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. FIMI'03
B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining implementations. Proc. ICDM’03
Int. Workshop on Frequent Itemset Mining Implementations (FIMI’03), Melbourne, FL, Nov. 2003
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD’ 00
J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. KDD'02
J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support.
ICDM'02
J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets.
KDD'03
72
Ref: Vertical Format and Row Enumeration Methods
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association
rules. DAMI:97.
M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02.
C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning Algorithm for Itemsets with
Constraints. KDD’02.
F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long
Biological Datasets. KDD'03.
H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High Dimensional Data: A
Top-Down Row Enumeration Approach, SDM'06.
73
Ref: Mining Correlations and Interesting Rules
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations.
SIGMOD'97.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large
sets of discovered association rules. CIKM'94.
R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest. Kluwer Academic, 2001.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98.
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns.
KDD'02.
E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern Mining: A Unified
Framework", Data Mining and Knowledge Discovery, 21(3):371-397, 2010
74