Lect 6

Mining Frequent Patterns,
Association Rule
— Chapter 6 —
12/05/24 1
Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
12/05/24 2
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that
occurs/appears frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami
[AIS93] in the context of frequent itemsets and
association rule mining
3
What Is Frequent Pattern
Analysis?
 Motivation: Finding inherent regularities in data

What products were often purchased together?— Beer
and diapers?!

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design,
sale campaign analysis, Web log (click stream) analysis,
and DNA sequence analysis.
4
Why Is Freq. Pattern Mining
Important?
 Freq. pattern: An intrinsic and important property
of datasets. Frequent patterns are that appear
frequently in data set.
For ex. A set of item sets, such as milk and
bread, that appear frequently together in a
transaction data set is a frequent itemset.
 Freq. pattern Mining: It searches for recurring

relationships in a given data set.
5
Association Rule Mining
 Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association
TID Items Rules
1 Bread, Milk {Diaper}  {Beer},
2 Bread, Diaper, Beer, Eggs {Milk, Bread} 
3 Milk, Diaper, Beer, Coke {Eggs,Coke},
4 Bread, Milk, Diaper, Beer
{Beer, Bread}  {Milk},
5 Bread, Milk, Diaper, Coke
Implication means co-
occurrence, not causality!
6
Basic Concepts: Frequent
Patterns
Tid Items bought  itemset: A set of one or

10 Beer, Nuts, Diaper more items I = {i1, i2, …, im}:
20 Beer, Coffee, Diaper a set of items.
30 Beer, Diaper, Eggs Example: {Milk, Bread, Diaper}
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs,
Milk
 k-itemset I = {i1, …, ik}
Given a dataset D, an itemset  An itemset that contains k items
X has a (frequency) count in D
An association rule is about

Transaction t :
relationships between two  t a set of items, and t  I.
disjoint itemsets X and Y  Transaction Database D: a
XY set of transactions D = {t1, t2, …,
It presents the pattern when
tn}.
X occurs, Y also occurs
7
Transaction data: supermarket
data
 Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
 Concepts:
 An item: an item/article in a basket
 I: the set of all items sold in the store
 A transaction: items purchased in a basket;
it may have TID (transaction ID)

 A transactional dataset: A set of transactions
8
Support
 (absolute) support, or,
support count of X () :
TID Items
Frequency or occurrence of an
itemset X 1 Bread, Milk
 E.g. ({Milk, Bread,Diaper}) = 2 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
 (relative) support, s, is the 4 Bread, Milk, Diaper, Beer
fraction of transactions that contains 5 Bread, Milk, Diaper, Coke
X (i.e., the probability that a
transaction contains X)
 E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
An itemset X is frequent if X’s support
is no less than a minsup threshold
9
Basic Concepts: Association Rules
Ti Items bought
d
 Find all the rules X  Y with
10 Beer, Nuts, Diaper
minimum support and
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs confidence
40 Nuts, Eggs, Milk  support, s, probability that
50 Nuts, Coffee, Diaper, Eggs,
Milk a transaction contains X 
Customer
Customer Y
buys both
buys
diaper
 confidence, c, conditional
probability that a
transaction having X also
Customer contains Y
buys beer 
Let Association
minsup = 50%, rules: (many
minconf = 50%
Freq.more!)
Pat.: Beer:3, Nuts:3, Diaper:4,

Beer
Eggs:3,  Diaper
{Beer, (60%,
Diaper}:3
100%) 10
Definition: Association Rule
 Association Rule TID Items
– An implication expression 1 Bread, Milk

2 Bread, Diaper, Beer, Eggs
of the form X  Y, where
3 Milk, Diaper, Beer, Coke
X and Y are itemsets 4 Bread, Milk, Diaper, Beer
– Example: 5 Bread, Milk, Diaper, Coke
{Milk, Diaper} 
{Beer} Example:
 Rule Evaluation Metrics {Milk, Diaper}  Beer
– Support (s)
 (Milk, Diaper, Beer) 2
 Fraction of transactions that s  0.4
contain both X and Y |T| 5
– Confidence (c)  (Milk, Diaper, Beer) 2
c  0.67
 Measures how often items in Y  (Milk , Diaper) 3
appear in transactions that
contain X
11
Association Rule Mining Task
 Association Rule
– An implication expression of the form X  Y,
it semantically means that the presence of X
is a good indicators of the presence of Y. X
and Y are itemsets.
– Example:
{Milk, Diaper}  {Beer}
 Given a set of transactions T, the goal of
association rule mining is to find all rules
having
 support ≥ minsup threshold
 confidence ≥ minconf threshold
12
Association Rule Mining: Brute-
force approach
 List all possible association rules
 Compute the support and confidence for
each rule
 Prune rules that fail the minsup and
minconf thresholds
 Computationally prohibitive!
13
Brute Force approach to
Frequent Itemset Generation
 For an itemset with 3
elements, we have 8 subsets TID Items
 Each subset is a candidate 1 Bread, Milk
frequent itemset which
needs to be matched against 2 Bread, Diaper, Beer, Eggs
each transaction 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
1-itemsets 5 Bread, Milk, Diaper, Coke
Itemset Count
{Milk} 4
2-itemsets
Itemset Count
{Diaper} 4
{Milk, Diaper} 3
{Beer} 3
{Diaper, 3
Beer}
3-itemsets {Beer, Milk} 2
Itemset Count
{Milk, Diaper, Beer} 2
mportant Observation:
Counts of subsets can’t be smaller than the count of an itemset!
12/05/24 14
Reducing Number of Candidates
 Apriori principle:
 If an itemset is frequent, then all of its
subsets must also be frequent
 Apriori principle holds due to the following

property of the support measure:
X , Y : ( X  Y )  s( X ) s(Y )
 Support of an itemset never exceeds the
support of its subsets
 This is known as the anti-monotone property
of support
15
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk
{Milk,Diaper}  {Beer} (s=0.4,
2 Bread, Diaper, Beer, Eggs
c=0.67)
3 Milk, Diaper, Beer, Coke {Milk,Beer}  {Diaper} (s=0.4,
4 Bread, Milk, Diaper, Beer c=1.0)
5 Bread, Milk, Diaper, Coke {Diaper,Beer}  {Milk} (s=0.4,
c=0.67)
{Beer}  {Milk,Diaper} (s=0.4,
Observations: c=0.67)
• All the above rules are binary{Diaper} {Milk,Beer}
 of
partitions the same (s=0.4,
itemset: c=0.5)
{Milk, Diaper, Beer} {Milk}  {Diaper,Beer} (s=0.4,
c=0.5)
• Rules originating from the same itemset have identical
support but
can have different confidence
• 16
Mining Association Rules
 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support 
minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
 Frequent itemset generation is still

computationally expensive
17
Problem Decomposition
Two sub-problems:
 Find all itemsets that have transaction
support above minsup.

 These itemsets are called large itemsets.
 From all the large itemsets, generate the
set of association rules that have

confidence about minconf.
18
12/05/24
Second Sub-problem
Straightforward approach:
 For every large itemset l, find all non-empty
subsets of l.
 For every such subset a, output a rule of
the form a  (l – a) if ratio of support(l) to

support(a) is at least minconf.
19
12/05/24
Closed Patterns and Max-
Patterns
 A long pattern contains a combinatorial number of
sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002)
+ … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns
instead
 An itemset X is closed if X is frequent and there exists
no super-pattern Y ‫ כ‬X, with the same support as X
 An itemset X is a max-pattern if X is frequent and
there exists no frequent super-pattern Y ‫ כ‬X
 Closed pattern is a lossless compression of freq.
patterns
 Reducing the # of patterns and rules
20
Closed Patterns and Max-
Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !!
21
Computational Complexity of Frequent
Itemset Mining
 How many itemsets are potentially to be generated in the worst
case?

The number of frequent itemsets to be generated is senstive to
the minsup threshold

When minsup is low, there exist potentially an exponential
number of frequent itemsets

The worst case: MN where M: # distinct items, and N: max length
of transactions
 The worst case complexty vs. the expected probability

Ex. Suppose Walmart has 104 kinds of products

The chance to pick up one product 10-4

The chance to pick up a particular set of 10 products: ~10 -40

What is the chance this particular set of 10 products to be
frequent 103 times in 109 transactions?
22
The Downward Closure Property and
Scalable Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be
frequent
 If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
 i.e., every transaction having {beer, diaper, nuts}
also contains {beer, diaper}

 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
 Vertical data format approach (Charm—Zaki &
Hsiao @SDM’02) 23
Scalable Frequent Itemset Mining
Methods
 Apriori: A Candidate Generation-and-Test
Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth
Approach
24
Apriori: A Candidate Generation & Test
Approach
 Apriori pruning principle: If there is any itemset

which is infrequent, its superset should not be
generated/tested!
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can
be generated
25
Apriori Candidate Generation
 Takes in Lk-1 and returns Ck.
Two steps:
 Join large itemsets L
k-1 with Lk-1.
 Prune out all itemsets in joined result which
contain a (k-1)subset not found in Lk-1.
26
12/05/24
Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk

Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
27
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
28
The Apriori Algorithm (Pseudo-
Code)
Ck: Candidate itemset of size k

Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 29
How to Count Supports of Candidates?
 Why counting supports of candidates a problem?

 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets
 Interior node contains a hash table
 Subset function: finds all the candidates
contained in a transaction
30
Counting Supports of Candidates Using Hash
Tree
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
31
Support Counting Using a
Hash Tree
 Create a hash tree and hash all the
candidate k-itemsets to the leaf nodes of
the tree
 For each transaction, generate all k-
itemsubsets of the transaction

 E.g. for a transaction {1,2,3,4}, the 3-item
subsets are {1,2,3}, {1,2,4}, {1,3,4}, and
{2,3,4}
32
Support Counting Using a
HashTree
 For each k-item subset,
 hash it to a leaf node of the hash tree,
 and check it against the candidate k-
itemsets
 hashed to the same leaf node.
 If the k-item subset matches a candidate k-
itemset,
---- increment the support count of the
candidate k-itemset
33
Step 2: Generating rules from
frequent itemsets
 Frequent itemsets  association rules

 One more step is needed to generate
association rules
 For each frequent itemset X,
For each proper nonempty subset A of X,
 Let B = X - A

A  B is an association rule if

Confidence(A  B) ≥ minconf,
support(A  B) = support(AB) = support(X)
confidence(A  B) = support(A  B) /
support(A)
34
Generating rules: an example
 Suppose {2,3,4} is frequent, with sup=50%
 Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2},
{3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75%
respectively
 These generate these association rules:

2,3  4, confidence=100%



2  3,4, confidence=67%



All rules have support = 50%
35
Methods
 Apriori: A Candidate Generation-and-Test Approach
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical Data
Format
 Mining Close Frequent Patterns and Maxpatterns

12/05/24 36
Further Improvement of the Apriori Method
 Major computational challenges

 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
37
Partition: Scan Database Only
Twice
 Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent
patterns

Scan 2: consolidate global frequent patterns
 A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DB1 + DB2 + + DBk = DB

sup1(i) < sup2(i) < supk(i) < sup(i) < σDB
σDB1 σDB2 σDBk 38
DHP: Reduce the Number of
Candidates
 A k-itemset whose corresponding hashing bucket count is

below the threshold cannot be frequent count itemset

Candidates: a, b, c, d, e s ad,
{ab,
35
88 ae}
{bd, be,

Hash entries de}
.

{ab, ad, ae} .
.
.
{bd, be, de}

. .

… 10 {yz, qs,
2Hash
wt}
Table

Frequent 1-itemset: a, b, d, e

ab is not a candidate 2-itemset if the sum of count of {ab,
ad, ae} is below support threshold
 J. Park, M. Chen, and P. Yu. An effective hash-based algorithm
for mining association rules. SIGMOD’95
39
Sampling for Frequent Patterns
 Select a sample of original database, mine

frequent patterns within sample using Apriori
 Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent
patterns
 H. Toivonen. Sampling large databases for
association rules. In VLDB’96
40
DIC: Reduce Number of Scans
ABCD
 Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD  Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication
rules for market basket data.
SIGMOD’97 41
Methods to Improve Apriori’s
Efficiency
 Hash-based itemset counting: A k-itemset whose

corresponding hashing bucket count is below the threshold
cannot be frequent
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
12/05/24
42
Exercise
 A dataset has five TID Items_bought

transactions, let min- T1 M, O, N, K, E,
support=60% and Y
min_confidence=80% T2
T3 D, O, N, K , E,
Y
 Find all frequent T4
M, A, K, E
itemsets using Apriori T5
algorithm M, U, C, K ,Y
C, O, O, K, I ,E
12/05/24 43
Association Rules with Apriori
K:5 KE:4 KE
E:4 KM:3 KM
M:3 KO:3 KO
O:3 => KY:3 => KY => KEO
Y:3 EM:2 EO
EO:3
EY:2
MO:1
MY:2
OY:2
44
Methods
 Apriori: A Candidate Generation-and-Test Approach
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical Data
Format
 Mining Close Frequent Patterns and Maxpatterns

12/05/24 45
Introduction
 Apriori: uses a generate-and-test approach –
generates candidate itemsets and tests if they are
frequent
– Generation of candidate itemsets is expensive(in
both space and time)

– Support counting is expensive
• Subset checking (computationally expensive)
• Multiple Database scans (I/O)
 FP-Growth: allows frequent itemset discovery without
candidate itemset generation. Two step approach:
– Step 1: Build a compact data structure called the
FP-tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the
FP-tree
46
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test

Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on abc: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent
pattern
47
Step 1: FP-Tree Construction
 FP-Tree is constructed using 2 passes over
the data-set:
Pass 1:
– Scan data and find support for each item.
– Discard infrequent items.
– Sort frequent items in decreasing order
based on their support.

Use this order when building the FP-Tree,
so common prefixes can be shared.
48
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and
maps it to a path
2. Fixed order is used, so paths can overlap when
transactions share items (when they have the
same prefix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes
containing the same item, creating singly
linked lists (dotted lines)
– The more paths that overlap, the higher the
compression. FP-tree may fit in memory.

4. Frequent itemsets extracted from the FP-Tree.
49
Step 1: FP-Tree Construction (Example)
50
FP-Tree size
 The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions
share items (and hence prefixes).
– Best case scenario: all transactions contain the
same set of items.

• 1 path in the FP-tree
– Worst case scenario: every transaction has a
unique set of items (no items in common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to
store the pointers between the nodes and the counters.
 The size of the FP-tree depends on how the items are

ordered
 Ordering by decreasing support is typically used but
it does not always lead to the smallest tree (it's a
heuristic).
51
Construct FP-tree from a Transaction
Database
TID items Items bought (ordered) frequent

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list = f-c-a-b-m-p p:2 m:1
52
Partition Patterns and Databases
 Frequent patterns can be partitioned into

subsets according to f-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
 Completeness and non-redundency
53
Step 2: Frequent Itemset Generation
 FP-Growth extracts frequent itemsets from

the FP-tree.
 Bottom-up algorithm - from the leaves
towards the root
 Divide and conquer: first look for frequent
itemsets ending in e, then de, etc. . . then
d, then cd, etc. . .
 First, extract prefix path sub-trees ending
in an item(set). (hint: use the linked lists)
54
Prefix path sub-trees (Example)
55
Step 2: Frequent Itemset Generation
 Each prefix path sub-tree is

processed recursively to extract
the frequent itemsets. Solutions
are then merged.
 E.g. the prefix path sub-tree
for e will be used to extract

frequent itemsets ending in e,
then in de, ce, be and ae, then
in cde, bde, cde, etc.
 Divide and conquer approach
56
Example
Let minSup = 2 and extract all frequent

itemsets containing e.
 1. Obtain the prefix path sub-tree for e:
57
Conditional FP-Tree
 The FP-Tree that would be built if we only

consider transactions containing a
particular itemset (and then removing that
itemset from all transactions).
 I Example: FP-Tree conditional on e.
58
Find Patterns Having P From P-conditional
Database
 Starting at the frequent item header table in the FP-tree

 Traverse the FP-tree by following the link of each frequent
item p
 Accumulate all of transformed prefix paths of item p to
form p’s conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
59
From Conditional Pattern-bases to Conditional FP-
trees
 For each pattern-base

 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of
the pattern base
m-conditional pattern base:

{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
60
Recursion: Mining Each Conditional FP-
tree
{}
{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
61
A Special Case: Single Prefix Path in FP-
tree
 Suppose a (conditional) FP-tree T has a shared

single prefix-path P
 Mining can be decomposed into two parts
{}  Reduction of the single prefix path into one node
a1:n1
 Concatenation of the mining results of the two
parts
a2:n2
a3:n3
{} r1
b1:m1 C1:k1 a1:n1

 r1 =
a2:n2
+ b1:m1 C1:k1
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
62
Association Rules
 Let’s have an example
 T100 1,2,5
 T200 2,4
 T300 2,3
 T400 1,2,4
 T500 1,3
 T600 2,3
 T700 1,3
 T800 1,2,3,5
 T900 1,2,3
63
FP Tree
64
Mining the FP tree
65
Exercise
 A dataset has five TID Items_bought

transactions, let min- T1 M, O, N, K, E,
support=60% and Y
min_confidence=80% T2
T3 D, O, N, K , E,
Y
 Find all frequent T4
M, A, K, E
itemsets using FP T5
Tree M, U, C, K ,Y
C, O, O, K, I ,E
12/05/24 66
Association Rules with FP Tree
K:5
E:4
M:3
O:3
Y:3
67
Association Rules with FP Tree
Y: KEMO:1 KEO:1 KY:1

K:3 KY
O: KEM:1 KE:2
KE:3KO EO KEO
M: KE:2 K:1
K:3 KM
E: K:4 KE
68
Benefits of the FP-tree Structure
 Completeness
 Preserve complete information for frequent
pattern mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not
count node-links and the count field)
69
The Frequent Pattern Growth Mining
Method
 Idea: Frequent pattern growth
 Recursively grow frequent patterns by pattern
and database partition

 Method
 For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree

 Repeat the process on each newly created
conditional FP-tree
 Until the resulting FP-tree is empty, or it
contains only one path—single path will

generate all the combinations of its sub-paths,
each of which is a frequent pattern
70
Scaling FP-growth by Database
Projection
 What about if FP-tree cannot fit in memory?
 DB projection
 First partition a database into a set of projected DBs
 Then construct and mine FP-tree for each projected DB
 Parallel projection vs. partition projection techniques
 Parallel projection

Project the DB in parallel for each frequent item

Parallel projection is space costly

All the partitions can be processed in parallel
 Partition projection

Partition the DB based on the ordered frequent items

Passing the unprocessed parts to the subsequent
partitions
71
Partition-Based Projection
 Parallel projection needs a Tran. DB

lot of disk space fcamp
fcabm
 Partition projection saves it fb
cbp
fcamp
p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB

fcam fcab f fc f …
cb fca cb … …
fcam fca …
am-proj DB cm-proj DB
fc f …
fc f
fc f
72
Performance of FPGrowth in Large
Datasets
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Run time(sec.)
Runtime (sec.)
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
FP-Growth vs. Apriori FP-Growth vs. Tree-

Projection
73
Advantages of the Pattern Growth
Approach
 Divide-and-conquer:

Decompose both the mining task and DB according to the
frequent patterns obtained so far

Lead to focused search of smaller databases
 Other factors

No candidate generation, no candidate test

Compressed database: FP-tree structure

No repeated scan of entire database

Basic ops: counting local freq items and building sub FP-
tree, no pattern search and matching
 A good open-source implementation and refinement of
FPGrowth

FPGrowth+ (Grahne and J. Zhu, FIMI'03)
74

Lect 6

Uploaded by

Copyright:

Available Formats

Lect 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 6

Uploaded by

Copyright:

Available Formats

Mining Frequent Patterns,

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

 Freq. pattern Mining: It searches for recurring

Tid Items bought  itemset: A set of one or

 I: the set of all items sold in the store

 A transaction: items purchased in a basket;

it may have TID (transaction ID)

– An implication expression 1 Bread, Milk

 confidence ≥ minconf threshold

subsets must also be frequent

 Apriori principle holds due to the following

 Frequent itemset generation is still

support above minsup.

 From all the large itemsets, generate the

set of association rules that have

the form a  (l – a) if ratio of support(l) to

also contains {beer, diaper}

 Freq. pattern growth (FPgrowth—Han, Pei & Yin

 Apriori: A Candidate Generation-and-Test

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

 Apriori pruning principle: If there is any itemset

Ck: Candidate itemset of size k

 Why counting supports of candidates a problem?

itemsubsets of the transaction

 and check it against the candidate k-

 If the k-item subset matches a candidate k-

 Frequent itemsets  association rules

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data

 Mining Close Frequent Patterns and Maxpatterns

 Major computational challenges

DB1 + DB2 + + DBk = DB

 A k-itemset whose corresponding hashing bucket count is

 Select a sample of original database, mine

 Hash-based itemset counting: A k-itemset whose

 A dataset has five TID Items_bought

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data

 Mining Close Frequent Patterns and Maxpatterns

both space and time)

– Discard infrequent items.

– Sort frequent items in decreasing order

based on their support.

compression. FP-tree may fit in memory.

same set of items.

 The size of the FP-tree depends on how the items are

TID items Items bought (ordered) frequent

 Frequent patterns can be partitioned into

 Patterns having m but no p

 Patterns having c but no a nor b, m, p

 Completeness and non-redundency

 FP-Growth extracts frequent itemsets from

 Each prefix path sub-tree is

for e will be used to extract