Concepts and Techniques: Data Mining
Concepts and Techniques: Data Mining
Concepts and Techniques: Data Mining
Chapter 6
1
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
■ Basic Concepts
Evaluation Methods
■ Summary
2
What Is Frequent Pattern Analysis?
■ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
■ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
■ Motivation: Finding inherent regularities in data
■ What products were often purchased together?— Beer and diapers?!
■ What are the subsequent purchases after buying a PC?
■ What kinds of DNA are sensitive to this new drug?
■ Can we automatically classify web documents?
■ Applications
■ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining Important?
■ Broad applications
4
Basic Concepts: Frequent Patterns
5
Basic Concepts: Association Rules
Tid Items bought ■ Find all the rules X Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs ■ support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X ∪ Y
50 Nuts, Coffee, Diaper, Eggs, Milk
■ confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Customer
buys beer ■ Association rules: (many more!)
■ Beer Diaper (60%, 100%)
■ Diaper Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
■ A long pattern contains a combinatorial number of
sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) +
… + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
■ Solution: Mine closed patterns and max-patterns instead
■ An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
■ An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
■ Closed pattern is a lossless compression of freq. patterns
■ Reducing the # of patterns and rules
7
Closed Patterns and Max-Patterns
■ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
■ Min_sup = 1.
■ What is the set of closed itemset?
■ <a1, …, a100>: 1
■ < a1, …, a50>: 2
■ What is the set of max-pattern?
■ <a1, …, a100>: 1
■ What is the set of all patterns?
■ !!
8
Computational Complexity of Frequent Itemset Mining
9
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
■ Basic Concepts
Evaluation Methods
■ Summary
10
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test
Approach
Data Format
11
The Downward Closure Property and Scalable
Mining Methods
■ The downward closure property of frequent patterns
■ Any subset of a frequent itemset must be frequent
diaper}
■ i.e., every transaction having {beer, diaper, nuts} also
@SIGMOD’00)
■ Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
12
Apriori: A Candidate Generation & Test Approach
13
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E 1st scan {D} 1
{C} 3
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
15
Implementation of Apriori
17
Counting Supports of Candidates Using Hash Tree
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
18
Candidate Generation: An SQL Implementation
■ SQL Implementation of candidate generation
■ Suppose the items in Lk-1 are listed in an order
■ Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
■ Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
■ Use object-relational extensions like UDFs, BLOBs, and Table functions for
efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with relational database systems:
Alternatives and implications. SIGMOD’98]
19
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test Approach
20
Further Improvement of the Apriori Method
21
Partition: Scan Database Only Twice
■ Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
■ Scan 1: partition database and find local frequent
patterns
■ Scan 2: consolidate global frequent patterns
■ …
102 {yz, qs, wt}
■ Frequent 1-itemset: a, b, d, e Hash Table
■ ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae}
is below support threshold
■ J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD’95
23
Sampling for Frequent Patterns
24
DIC: Reduce Number of Scans
ABCD
■ Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD ■ Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. SIGMOD’97
25
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test Approach
26
Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
■ Bottlenecks of the Apriori approach
■ Breadth-first (i.e., level-wise) search
■ Candidate generation and test
■ Often generates a huge number of candidates
■ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
■ Depth-first search
■ Avoid explicit candidate generation
■ Major philosophy: Grow long patterns from short ones using local
frequent items only
■ “abc” is a frequent pattern
■ Get all transactions having “abc”, i.e., project DB on abc: DB|abc
■ “d” is a local frequent item in DB|abc abcd is a frequent pattern
27
Construct FP-tree from a Transaction Database
■ Patterns containing p
■ …
■ Pattern f
29
Find Patterns Having P From P-conditional Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 itemcond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
30
From Conditional Pattern-bases to Conditional FP-trees
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
32
A Special Case: Single Prefix Path in FP-tree
■ Completeness
■ Preserve complete information for frequent pattern
mining
■ Never break a long pattern of any transaction
■ Compactness
■ Reduce irrelevant info—infrequent items are gone
■ Items in frequency descending order: the more
frequently occurring, the more likely to be shared
■ Never be larger than the original database (not count
node-links and the count field)
34
The Frequent Pattern Growth Mining Method
database partition
■ Method
■ For each frequent item, construct its conditional
FP-tree
■ Until the resulting FP-tree is empty, or it contains only
35
Scaling FP-growth by Database Projection
36
Partition-Based Projection
am-proj DB cm-proj DB
fc f …
fc f
fc f
37
Performance of FPGrowth in Large Datasets
38
Advantages of the Pattern Growth Approach
■ Divide-and-conquer:
■ Decompose both the mining task and DB according to the
frequent patterns obtained so far
■ Lead to focused search of smaller databases
■ Other factors
■ No candidate generation, no candidate test
■ Compressed database: FP-tree structure
■ No repeated scan of entire database
■ Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
■ A good open-source implementation and refinement of FPGrowth
■ FPGrowth+ (Grahne and J. Zhu, FIMI'03)
39
Further Improvements of Mining Methods
40
Extension of Pattern Growth Mining Methodology
41
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test Approach
42
ECLAT: Mining by Exploring Vertical Data Format
■ Vertical format: t(AB) = {T11, T25, …}
■ tid-list: list of trans.-ids containing an itemset
■ Deriving frequent patterns based on vertical intersections
■ t(X) = t(Y): X and Y always happen together
■ t(X) ⊂ t(Y): transaction having X always has Y
■ Using diffset to accelerate mining
■ Only keep track of differences of tids
■ t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
■ Diffset (XY, X) = {T2}
■ Eclat (Zaki et al. @KDD’97)
■ Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)
43
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test Approach
44
Mining Frequent Closed Patterns: CLOSET
■ Flist: list of all frequent items in support ascending order
■ Flist: d-a-f-e-c Min_sup=2
■ Divide search space TID Items
10 a, c, d, e, f
■ Patterns having d 20 a, b, e
30 c, e, f
■ Patterns having d but no a, etc. 40 a, c, d, f
50 c, e, f
■ Find frequent closed pattern recursively
■ Every transaction having d also has cfa cfad is a
frequent closed pattern
■ J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", DMKD'00.
CLOSET+: Mining Closed Itemsets by Pattern-Growth
49
Visualization of Association Rules: Rule Graph
50
Visualization of Association Rules
(SGI/MineSet 3.0)
51
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
■ Basic Concepts
Evaluation Methods
■ Summary
52
Interestingness Measure: Correlations (Lift)
■ play basketball ⇒ eat cereal [40%, 66.7%] is misleading
■ The overall % of students eating cereal is 75% > 66.7%.
■ play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
■ Measure of dependent/correlated events: lift
53
Are lift and χ2 Good Measures of Correlation?
54
Null-Invariant Measures
55
Comparison of Interestingness Measures
■ Null-(transaction) invariance is crucial for correlation analysis
■ Lift and χ2 are not null-invariant
■ 5 null-invariant measures
Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m Σ
Null-transactions Kulczynski
w.r.t. m and c measure (1927) Null-invariant
Evaluation Methods
■ Summary
59
Summary
■ Basic concepts: association rules,
support-confident framework, closed and
max-patterns
■ Scalable frequent pattern mining methods
■ Apriori (Candidate generation & test)
■ Projection-based (FPgrowth, CLOSET+, ...)
■ Vertical format approach (ECLAT, CHARM, ...)
▪ Which patterns are interesting?
▪ Pattern evaluation methods
60