Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lect 6

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 74

Mining Frequent Patterns,

Association Rule

— Chapter 6 —

12/05/24 1
Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

12/05/24 2
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that
occurs/appears frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami
[AIS93] in the context of frequent itemsets and
association rule mining

3
What Is Frequent Pattern
Analysis?
 Motivation: Finding inherent regularities in data

What products were often purchased together?— Beer
and diapers?!

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design,
sale campaign analysis, Web log (click stream) analysis,
and DNA sequence analysis.

4
Why Is Freq. Pattern Mining
Important?
 Freq. pattern: An intrinsic and important property
of datasets. Frequent patterns are that appear
frequently in data set.
For ex. A set of item sets, such as milk and
bread, that appear frequently together in a
transaction data set is a frequent itemset.

 Freq. pattern Mining: It searches for recurring


relationships in a given data set.

5
Association Rule Mining
 Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions
Example of Association
TID Items Rules
1 Bread, Milk {Diaper}  {Beer},
2 Bread, Diaper, Beer, Eggs {Milk, Bread} 
3 Milk, Diaper, Beer, Coke {Eggs,Coke},
4 Bread, Milk, Diaper, Beer
{Beer, Bread}  {Milk},
5 Bread, Milk, Diaper, Coke
Implication means co-
occurrence, not causality!

6
Basic Concepts: Frequent
Patterns

Tid Items bought  itemset: A set of one or


10 Beer, Nuts, Diaper more items I = {i1, i2, …, im}:
20 Beer, Coffee, Diaper a set of items.
30 Beer, Diaper, Eggs Example: {Milk, Bread, Diaper}
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs,
Milk
 k-itemset I = {i1, …, ik}
Given a dataset D, an itemset  An itemset that contains k items
X has a (frequency) count in D
An association rule is about

Transaction t :
relationships between two  t a set of items, and t  I.
disjoint itemsets X and Y  Transaction Database D: a
XY set of transactions D = {t1, t2, …,
It presents the pattern when
tn}.
X occurs, Y also occurs

7
Transaction data: supermarket
data
 Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
 Concepts:
 An item: an item/article in a basket

 I: the set of all items sold in the store

 A transaction: items purchased in a basket;

it may have TID (transaction ID)


 A transactional dataset: A set of transactions

8
Support
 (absolute) support, or,
support count of X () :
TID Items
Frequency or occurrence of an
itemset X 1 Bread, Milk
 E.g. ({Milk, Bread,Diaper}) = 2 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
 (relative) support, s, is the 4 Bread, Milk, Diaper, Beer
fraction of transactions that contains 5 Bread, Milk, Diaper, Coke
X (i.e., the probability that a
transaction contains X)
 E.g. s({Milk, Bread, Diaper}) = 2/5

 Frequent Itemset
An itemset X is frequent if X’s support
is no less than a minsup threshold
9
Basic Concepts: Association Rules
Ti Items bought
d
 Find all the rules X  Y with
10 Beer, Nuts, Diaper
minimum support and
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs confidence
40 Nuts, Eggs, Milk  support, s, probability that
50 Nuts, Coffee, Diaper, Eggs,
Milk a transaction contains X 
Customer
Customer Y
buys both
buys
diaper
 confidence, c, conditional
probability that a
transaction having X also
Customer contains Y
buys beer 
Let Association
minsup = 50%, rules: (many
minconf = 50%
Freq.more!)
Pat.: Beer:3, Nuts:3, Diaper:4,

Beer
Eggs:3,  Diaper
{Beer, (60%,
Diaper}:3
100%) 10
Definition: Association Rule
 Association Rule TID Items

– An implication expression 1 Bread, Milk


2 Bread, Diaper, Beer, Eggs
of the form X  Y, where
3 Milk, Diaper, Beer, Coke
X and Y are itemsets 4 Bread, Milk, Diaper, Beer
– Example: 5 Bread, Milk, Diaper, Coke
{Milk, Diaper} 
{Beer} Example:
 Rule Evaluation Metrics {Milk, Diaper}  Beer
– Support (s)
 (Milk, Diaper, Beer) 2
 Fraction of transactions that s  0.4
contain both X and Y |T| 5
– Confidence (c)  (Milk, Diaper, Beer) 2
c  0.67
 Measures how often items in Y  (Milk , Diaper) 3
appear in transactions that
contain X
11
Association Rule Mining Task
 Association Rule
– An implication expression of the form X  Y,
it semantically means that the presence of X
is a good indicators of the presence of Y. X
and Y are itemsets.
– Example:
{Milk, Diaper}  {Beer}
 Given a set of transactions T, the goal of
association rule mining is to find all rules
having
 support ≥ minsup threshold

 confidence ≥ minconf threshold

12
Association Rule Mining: Brute-
force approach
 List all possible association rules
 Compute the support and confidence for

each rule
 Prune rules that fail the minsup and

minconf thresholds
 Computationally prohibitive!

13
Brute Force approach to
Frequent Itemset Generation
 For an itemset with 3
elements, we have 8 subsets TID Items
 Each subset is a candidate 1 Bread, Milk
frequent itemset which
needs to be matched against 2 Bread, Diaper, Beer, Eggs
each transaction 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
1-itemsets 5 Bread, Milk, Diaper, Coke
Itemset Count
{Milk} 4
2-itemsets
Itemset Count
{Diaper} 4
{Milk, Diaper} 3
{Beer} 3
{Diaper, 3
Beer}
3-itemsets {Beer, Milk} 2
Itemset Count
{Milk, Diaper, Beer} 2

mportant Observation:
Counts of subsets can’t be smaller than the count of an itemset!
12/05/24 14
Reducing Number of Candidates
 Apriori principle:
 If an itemset is frequent, then all of its

subsets must also be frequent

 Apriori principle holds due to the following


property of the support measure:

X , Y : ( X  Y )  s( X ) s(Y )
 Support of an itemset never exceeds the
support of its subsets
 This is known as the anti-monotone property
of support
15
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk
{Milk,Diaper}  {Beer} (s=0.4,
2 Bread, Diaper, Beer, Eggs
c=0.67)
3 Milk, Diaper, Beer, Coke {Milk,Beer}  {Diaper} (s=0.4,
4 Bread, Milk, Diaper, Beer c=1.0)
5 Bread, Milk, Diaper, Coke {Diaper,Beer}  {Milk} (s=0.4,
c=0.67)
{Beer}  {Milk,Diaper} (s=0.4,
Observations: c=0.67)
• All the above rules are binary{Diaper} {Milk,Beer}
 of
partitions the same (s=0.4,
itemset: c=0.5)
{Milk, Diaper, Beer} {Milk}  {Diaper,Beer} (s=0.4,
c=0.5)
• Rules originating from the same itemset have identical
support but
can have different confidence
• 16
Mining Association Rules
 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support 
minsup

2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset

 Frequent itemset generation is still


computationally expensive
17
Problem Decomposition

Two sub-problems:
 Find all itemsets that have transaction

support above minsup.


 These itemsets are called large itemsets.

 From all the large itemsets, generate the

set of association rules that have


confidence about minconf.

18
12/05/24
Second Sub-problem

Straightforward approach:
 For every large itemset l, find all non-empty

subsets of l.
 For every such subset a, output a rule of

the form a  (l – a) if ratio of support(l) to


support(a) is at least minconf.

19
12/05/24
Closed Patterns and Max-
Patterns
 A long pattern contains a combinatorial number of
sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002)
+ … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns
instead
 An itemset X is closed if X is frequent and there exists
no super-pattern Y ‫ כ‬X, with the same support as X
 An itemset X is a max-pattern if X is frequent and
there exists no frequent super-pattern Y ‫ כ‬X
 Closed pattern is a lossless compression of freq.
patterns
 Reducing the # of patterns and rules
20
Closed Patterns and Max-
Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !!
21
Computational Complexity of Frequent
Itemset Mining
 How many itemsets are potentially to be generated in the worst
case?

The number of frequent itemsets to be generated is senstive to
the minsup threshold

When minsup is low, there exist potentially an exponential
number of frequent itemsets

The worst case: MN where M: # distinct items, and N: max length
of transactions
 The worst case complexty vs. the expected probability

Ex. Suppose Walmart has 104 kinds of products

The chance to pick up one product 10-4

The chance to pick up a particular set of 10 products: ~10 -40

What is the chance this particular set of 10 products to be
frequent 103 times in 109 transactions?
22
The Downward Closure Property and
Scalable Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be

frequent
 If {beer, diaper, nuts} is frequent, so is {beer,

diaper}
 i.e., every transaction having {beer, diaper, nuts}

also contains {beer, diaper}


 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)

 Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)
 Vertical data format approach (Charm—Zaki &

Hsiao @SDM’02) 23
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

24
Apriori: A Candidate Generation & Test
Approach

 Apriori pruning principle: If there is any itemset


which is infrequent, its superset should not be
generated/tested!
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can
be generated
25
Apriori Candidate Generation
 Takes in Lk-1 and returns Ck.

Two steps:
 Join large itemsets L
k-1 with Lk-1.
 Prune out all itemsets in joined result which
contain a (k-1)subset not found in Lk-1.

26
12/05/24
Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk

Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
27
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
28
The Apriori Algorithm (Pseudo-
Code)

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 29
How to Count Supports of Candidates?

 Why counting supports of candidates a problem?


 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets
 Interior node contains a hash table
 Subset function: finds all the candidates
contained in a transaction

30
Counting Supports of Candidates Using Hash
Tree

Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8

1+2356

13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458

31
Support Counting Using a
Hash Tree
 Create a hash tree and hash all the
candidate k-itemsets to the leaf nodes of
the tree
 For each transaction, generate all k-

itemsubsets of the transaction


 E.g. for a transaction {1,2,3,4}, the 3-item
subsets are {1,2,3}, {1,2,4}, {1,3,4}, and
{2,3,4}

32
Support Counting Using a
HashTree
 For each k-item subset,
 hash it to a leaf node of the hash tree,

 and check it against the candidate k-

itemsets
 hashed to the same leaf node.

 If the k-item subset matches a candidate k-

itemset,
---- increment the support count of the
candidate k-itemset

33
Step 2: Generating rules from
frequent itemsets

 Frequent itemsets  association rules


 One more step is needed to generate
association rules
 For each frequent itemset X,
For each proper nonempty subset A of X,
 Let B = X - A


A  B is an association rule if

Confidence(A  B) ≥ minconf,
support(A  B) = support(AB) = support(X)
confidence(A  B) = support(A  B) /
support(A)
34
Generating rules: an example
 Suppose {2,3,4} is frequent, with sup=50%
 Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2},
{3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75%
respectively
 These generate these association rules:

2,3  4, confidence=100%

2,4  3, confidence=100%

3,4  2, confidence=67%

2  3,4, confidence=67%

3  2,4, confidence=67%

4  2,3, confidence=67%

All rules have support = 50%
35
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data

Format

 Mining Close Frequent Patterns and Maxpatterns


12/05/24 36
Further Improvement of the Apriori Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
37
Partition: Scan Database Only
Twice
 Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent
patterns

Scan 2: consolidate global frequent patterns
 A. Savasere, E. Omiecinski and S. Navathe, VLDB’95

DB1 + DB2 + + DBk = DB


sup1(i) < sup2(i) < supk(i) < sup(i) < σDB
σDB1 σDB2 σDBk 38
DHP: Reduce the Number of
Candidates

 A k-itemset whose corresponding hashing bucket count is


below the threshold cannot be frequent count itemset

Candidates: a, b, c, d, e s ad,
{ab,
35
88 ae}
{bd, be,

Hash entries de}
.

{ab, ad, ae} .
.
.
{bd, be, de}

. .


… 10 {yz, qs,
2Hash
wt}
Table

Frequent 1-itemset: a, b, d, e

ab is not a candidate 2-itemset if the sum of count of {ab,
ad, ae} is below support threshold
 J. Park, M. Chen, and P. Yu. An effective hash-based algorithm
for mining association rules. SIGMOD’95
39
Sampling for Frequent Patterns

 Select a sample of original database, mine


frequent patterns within sample using Apriori
 Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent
patterns
 H. Toivonen. Sampling large databases for
association rules. In VLDB’96
40
DIC: Reduce Number of Scans

ABCD
 Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD  Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets

{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication
rules for market basket data.
SIGMOD’97 41
Methods to Improve Apriori’s
Efficiency

 Hash-based itemset counting: A k-itemset whose


corresponding hashing bucket count is below the threshold
cannot be frequent
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
12/05/24
42
Exercise

 A dataset has five TID Items_bought


transactions, let min- T1 M, O, N, K, E,
support=60% and Y
min_confidence=80% T2
T3 D, O, N, K , E,
Y
 Find all frequent T4
M, A, K, E
itemsets using Apriori T5
algorithm M, U, C, K ,Y
C, O, O, K, I ,E

12/05/24 43
Association Rules with Apriori
K:5 KE:4 KE
E:4 KM:3 KM
M:3 KO:3 KO
O:3 => KY:3 => KY => KEO
Y:3 EM:2 EO
EO:3
EY:2
MO:1
MY:2
OY:2

44
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data

Format

 Mining Close Frequent Patterns and Maxpatterns


12/05/24 45
Introduction
 Apriori: uses a generate-and-test approach –
generates candidate itemsets and tests if they are
frequent
– Generation of candidate itemsets is expensive(in

both space and time)


– Support counting is expensive
• Subset checking (computationally expensive)
• Multiple Database scans (I/O)
 FP-Growth: allows frequent itemset discovery without
candidate itemset generation. Two step approach:
– Step 1: Build a compact data structure called the

FP-tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the
FP-tree
46
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test

Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on abc: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent
pattern
47
Step 1: FP-Tree Construction
 FP-Tree is constructed using 2 passes over
the data-set:
Pass 1:
– Scan data and find support for each item.

– Discard infrequent items.

– Sort frequent items in decreasing order

based on their support.


Use this order when building the FP-Tree,
so common prefixes can be shared.

48
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and
maps it to a path
2. Fixed order is used, so paths can overlap when
transactions share items (when they have the
same prefix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes
containing the same item, creating singly
linked lists (dotted lines)
– The more paths that overlap, the higher the

compression. FP-tree may fit in memory.


4. Frequent itemsets extracted from the FP-Tree.

49
Step 1: FP-Tree Construction (Example)

50
FP-Tree size
 The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions
share items (and hence prefixes).
– Best case scenario: all transactions contain the

same set of items.


• 1 path in the FP-tree
– Worst case scenario: every transaction has a
unique set of items (no items in common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to
store the pointers between the nodes and the counters.

 The size of the FP-tree depends on how the items are


ordered
 Ordering by decreasing support is typically used but
it does not always lead to the smallest tree (it's a
heuristic).
51
Construct FP-tree from a Transaction
Database

TID items Items bought (ordered) frequent


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list = f-c-a-b-m-p p:2 m:1
52
Partition Patterns and Databases

 Frequent patterns can be partitioned into


subsets according to f-list
 F-list = f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f

 Completeness and non-redundency

53
Step 2: Frequent Itemset Generation

 FP-Growth extracts frequent itemsets from


the FP-tree.
 Bottom-up algorithm - from the leaves
towards the root
 Divide and conquer: first look for frequent
itemsets ending in e, then de, etc. . . then
d, then cd, etc. . .
 First, extract prefix path sub-trees ending
in an item(set). (hint: use the linked lists)

54
Prefix path sub-trees (Example)

55
Step 2: Frequent Itemset Generation

 Each prefix path sub-tree is


processed recursively to extract
the frequent itemsets. Solutions
are then merged.
 E.g. the prefix path sub-tree

for e will be used to extract


frequent itemsets ending in e,
then in de, ce, be and ae, then
in cde, bde, cde, etc.
 Divide and conquer approach

56
Example

Let minSup = 2 and extract all frequent


itemsets containing e.
 1. Obtain the prefix path sub-tree for e:

57
Conditional FP-Tree

 The FP-Tree that would be built if we only


consider transactions containing a
particular itemset (and then removing that
itemset from all transactions).
 I Example: FP-Tree conditional on e.

58
Find Patterns Having P From P-conditional
Database

 Starting at the frequent item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent
item p
 Accumulate all of transformed prefix paths of item p to
form p’s conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
59
From Conditional Pattern-bases to Conditional FP-
trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of

the pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
60
Recursion: Mining Each Conditional FP-
tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3


c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree

61
A Special Case: Single Prefix Path in FP-
tree

 Suppose a (conditional) FP-tree T has a shared


single prefix-path P
 Mining can be decomposed into two parts
{}  Reduction of the single prefix path into one node
a1:n1
 Concatenation of the mining results of the two
parts
a2:n2

a3:n3
{} r1

b1:m1 C1:k1 a1:n1


 r1 =
a2:n2
+ b1:m1 C1:k1

C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
62
Association Rules
 Let’s have an example

 T100 1,2,5
 T200 2,4
 T300 2,3
 T400 1,2,4
 T500 1,3
 T600 2,3
 T700 1,3
 T800 1,2,3,5
 T900 1,2,3

63
FP Tree

64
Mining the FP tree

65
Exercise

 A dataset has five TID Items_bought


transactions, let min- T1 M, O, N, K, E,
support=60% and Y
min_confidence=80% T2
T3 D, O, N, K , E,
Y
 Find all frequent T4
M, A, K, E
itemsets using FP T5
Tree M, U, C, K ,Y
C, O, O, K, I ,E

12/05/24 66
Association Rules with FP Tree

K:5
E:4
M:3
O:3
Y:3

67
Association Rules with FP Tree

Y: KEMO:1 KEO:1 KY:1


K:3 KY
O: KEM:1 KE:2
KE:3KO EO KEO
M: KE:2 K:1
K:3 KM
E: K:4 KE

68
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent
pattern mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not
count node-links and the count field)

69
The Frequent Pattern Growth Mining
Method
 Idea: Frequent pattern growth
 Recursively grow frequent patterns by pattern

and database partition


 Method
 For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree


 Repeat the process on each newly created

conditional FP-tree
 Until the resulting FP-tree is empty, or it

contains only one path—single path will


generate all the combinations of its sub-paths,
each of which is a frequent pattern

70
Scaling FP-growth by Database
Projection
 What about if FP-tree cannot fit in memory?
 DB projection
 First partition a database into a set of projected DBs
 Then construct and mine FP-tree for each projected DB
 Parallel projection vs. partition projection techniques
 Parallel projection

Project the DB in parallel for each frequent item

Parallel projection is space costly

All the partitions can be processed in parallel
 Partition projection

Partition the DB based on the ordered frequent items

Passing the unprocessed parts to the subsequent
partitions
71
Partition-Based Projection

 Parallel projection needs a Tran. DB


lot of disk space fcamp
fcabm
 Partition projection saves it fb
cbp
fcamp

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
72
Performance of FPGrowth in Large
Datasets

100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Run time(sec.)

Runtime (sec.)
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60

30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)

FP-Growth vs. Apriori FP-Growth vs. Tree-


Projection

73
Advantages of the Pattern Growth
Approach

 Divide-and-conquer:

Decompose both the mining task and DB according to the
frequent patterns obtained so far

Lead to focused search of smaller databases
 Other factors

No candidate generation, no candidate test

Compressed database: FP-tree structure

No repeated scan of entire database

Basic ops: counting local freq items and building sub FP-
tree, no pattern search and matching
 A good open-source implementation and refinement of
FPGrowth

FPGrowth+ (Grahne and J. Zhu, FIMI'03)
74

You might also like