Chap4 PatternMiningBasic

Pattern Mining: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern Evaluation Methods
 Summary
1
What Are Patterns?
What are patterns?
 Patterns: A set of items, subsequences, or substructures that occur
frequently together (or strongly correlated) in a data set
 Patterns represent intrinsic and important properties of datasets
Frequent item set Frequent sequences Frequent structures
2
What Is Pattern Discovery?
 Pattern discovery: Uncovering patterns from massive data sets
 It can answer questions such as:
 What products were often purchased together?
 What are the subsequent purchases after buying an iPad?
3
Pattern Discovery: Why Is It Important?
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Mining sequential, structural (e.g., sub-graph) patterns
 Classification: Discriminative pattern-based analysis
 Cluster analysis: Pattern-based subspace clustering
 Broad applications
 Market basket analysis, cross-marketing, catalog design, sale
campaign analysis, Web log analysis, biological sequence
analysis
 Many types of data: spatiotemporal, multimedia, time-series,
and stream data
4
Basic Concepts: Transactional Database
Transactional Database (TDB)
 Each transaction is associated with an identifier, called a TID.
 May also have counts associated with each item sold
Tid Items bought

1 Beer, Nuts, Diaper
2 Beer, Coffee, Diaper
3 Beer, Diaper, Eggs
4 Nuts, Eggs, Milk
5 Nuts, Coffee, Diaper, Eggs, Milk
5
Basic Concepts: k-Itemsets and Their Supports
Tid Items bought
 Itemset: A set of one or more items
 k-itemset: An itemset containing k items: 3 Beer, Diaper, Eggs
4 Nuts, Eggs, Milk
X = {x1, …, xk}
 Ex. {Beer, Nuts, Diaper} is a 3-itemset
 Relative support
Absolute support (count)
 sup{X} = occurrences of an itemset X
 s{X} = The fraction of transactions
that contains X (i.e., the probability
 Ex. sup{Beer} = 3
that a transaction contains X)
 Ex. sup{Diaper} = 4
 Ex. s{Beer} = 3/5 = 60%
 Ex. sup{Beer, Diaper} = 3
 Ex. s{Diaper} = 4/5 = 80%
 Ex. sup{Beer, Eggs} = 1
 Ex. s{Beer, Eggs} = 1/5 = 20%
6
Basic Concepts: Frequent Itemsets (Patterns)
 An itemset (or a pattern) X is frequent Tid Items bought
if the support of X is no less than a 1 Beer, Nuts, Diaper
minsup threshold σ 2 Beer, Coffee, Diaper
 Let σ = 50% (σ: minsup threshold) 3 Beer, Diaper, Eggs
for the given 5-transaction dataset 4 Nuts, Eggs, Milk

 All the frequent 1-itemsets:
 Beer: 3/5 (60%); Nuts: 3/5 (60%);  Why do these itemsets (shown on the
Diaper: 4/5 (80%); Eggs: 3/5 (60%) left) form the complete set of frequent
 All the frequent 2-itemsets: k-itemsets (patterns) for any k?
 {Beer, Diaper}: 3/5 (60%)  Observation: We may need an
 All the frequent 3-itemsets? efficient method to mine a complete
 None set of frequent patterns
7
From Frequent Itemsets to Association Rules
Compared with itemsets, association rules can be more telling
 Ex. Diaper  Beer
 Buying diapers may likely lead to buying beers
Containing both Containing diaper
{Beer} 
Beer Diaper
{Diaper}
Containing beer
{Beer}  {Diaper} = {Beer, Diaper}
Note: X  Y: the union of two itemsets

 The set contains both X and Y
8
Association Rules
 How do we compute the strength of an association
Tid Items bought
rule X  Y (Both X and Y are itemsets)?
 We first compute the following two metrics, s and c. 2 Beer, Coffee, Diaper
 Support of X  Y 3 Beer, Diaper, Eggs
4 Nuts, Eggs, Milk
 Ex. s{Diaper, Beer} = 3/5 = 0.6 (i.e., 60%)
 Confidence of X  Y
 The conditional probability that a transaction
containing X also contains Y:
c = sup(X, Y) / sup(X)
 Ex. c = sup{Diaper, Beer}/sup{Diaper} = ¾ = 0.75
 In pattern analysis, we are often interested in those rules that dominate
the database, and these two metrics ensure the popularity and correlation of X
and Y.
9
Mining Frequent Itemsets and Association Rules
 Association rule mining Tid Items bought
 Given two thresholds: minsup, minconf
 Find all of the rules, X  Y (s, c)
3 Beer, Diaper, Eggs
such that s ≥ minsup and c ≥ minconf 4 Nuts, Eggs, Milk
 Let minsup = 50%
 Freq. 1-itemsets: Beer: 3, Nuts: 3,
Observations:
Diaper: 4, Eggs: 3
 Mining association rules and
 Freq. 2-itemsets: {Beer, Diaper}: 3
mining frequent patterns are
very close problems
 Let minconf = 50%  Scalable methods are needed
 Beer  Diaper (60%, 100%)
for mining large datasets
 Diaper  Beer (60%, 75%)
(Q: Are these all the rules satisfying the two conditions?)
10
Challenge: There Are Too Many Frequent Patterns!
 A long pattern contains a combinatorial number of sub-patterns
 How many frequent itemsets does the following TDB1 contain (minsup = 1)?
 TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
 Let’s have a try
1-itemsets: {a1}: 2, {a2}: 2, …, {a50}: 2, {a51}: 1, …, {a100}: 1,
2-itemsets: {a1, a2}: 2, …, {a1, a50}: 2, {a1, a51}: 1 …, …, {a99, a100}: 1,
…, …, …, …
99-itemsets: {a1, a2, …, a99}: 1, …, {a2, a3, …, a100}: 1
100-itemset: {a1, a2, …, a100}: 1 A too huge set for any
one to compute or store!
 The total number of frequent itemsets:
11
Expressing Patterns in Compressed Form: Closed Patterns
 How to handle such a challenge?
 Solution 1: Closed patterns: A pattern (itemset) X is closed if X is
frequent, and there exists no super-pattern Y ‫ כ‬X, with the same
support as X
 Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
 Suppose minsup = 1. How many closed patterns does TDB1 contain?
 Two: P1: “{a1, …, a50}: 2”; P2: “{a1, …, a100}: 1”
 Closed pattern is a lossless compression of frequent patterns
 Reduces the # of patterns but does not lose the support
information!
 You will still be able to say: “{a2, …, a40}: 2”, “{a5, a51}: 1”
12
Expressing Patterns in Compressed Form: Max-Patterns
 Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent
and there exists no frequent super-pattern Y ‫ כ‬X
 Difference from close-patterns?
 Do not care the real support of the sub-patterns of a max-pattern
 Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
 Suppose minsup = 1. How many max-patterns does TDB1 contain?
 One: P: “{a1, …, a100}: 1”
Max-pattern is a lossy compression!
 We only know {a1, …, a40} is frequent
 But we do not know the real support of {a1, …, a40}, …, any more!
 Thus in many applications, mining close-patterns is more desirable
13
 Basic Concepts
 Summary
14
Efficient Pattern Mining Methods
 The Downward Closure Property of Frequent Patterns
 The Apriori Algorithm

 Extensions or Improvements of Apriori
 Mining Frequent Patterns by Exploring Vertical Data Format
 FPGrowth: A Frequent Pattern-Growth Approach
 Mining Closed Patterns
15
The Downward Closure Property of Frequent Patterns
 Observation: From TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
 We get a frequent itemset: {a1, …, a50}
 Also, its subsets are all frequent: {a1}, {a2}, …, {a50}, {a1, a2}, …, {a1, …, a49}, …
 There are some hidden relationships among frequent patterns!
 The downward closure (also called “Apriori”) property of frequent patterns
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}
 Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
 Apriori: Any subset of a frequent itemset must be frequent
 Efficient mining methodology
 If any subset of an itemset S is infrequent, then there is no chance for S to
be frequent—why do we even have to consider S!? A sharp knife for pruning!
16
Apriori Pruning and Scalable Mining Methods
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not even be generated! (Agrawal &
Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
 Scalable mining Methods: Three major approaches
 Level-wise, join-based approach: Apriori (Agrawal &
Srikant@VLDB’94)
 Vertical data format approach: Eclat (Zaki, Parthasarathy,
Ogihara, Li @KDD’97)
 Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin
@SIGMOD’00)
17
Apriori: A Candidate Generation & Test Approach
 Outline of Apriori (level-wise, candidate generation and test)
 Scan DB once to get frequent 1-itemset
 Repeat
 Generate length-(k+1) candidate itemsets from length-k frequent
itemsets
 Test the candidates against DB to find frequent (k+1)-itemsets
 Set k := k +1
 Until no frequent or candidate set can be generated
 Return all the frequent itemsets derived
18
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Fk : Frequent itemset of size k
K := 1;
Fk := {frequent items}; // frequent 1-itemset
While (Fk != ) do { // when Fk is non-empty
Ck+1 := candidates generated from Fk; // candidate generation
Derive Fk+1 by counting candidates in Ck+1 with respect to TDB at minsup;
k := k + 1
}
return k Fk // return Fk generated at each level
19
The Apriori Algorithm—An Example
minsup = 2 Itemset sup
Database TDB Itemset sup
{A} 2 F1
Tid Items C1 {B} 3
{A} 2
10 A, C, D {B} 3
{C} 3
20 B, C, E 1 scan
st
{C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
F2 Itemset sup {A, B} 1 {A, B}
{A, C} 2 {A, C} 2
{A, E} 1
2nd scan {A, C}
{B, C} 2 {A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}
C3 Itemset
3rd scan F3 Itemset sup
{B, C, E} {B, C, E} 2
20
Apriori: Implementation Tricks
 How to generate candidates?
self-join self-join
 Step 1: self-joining Fk
abc abd acd ace bcd
 Step 2: pruning
 Example of candidate-generation abcd acde
 F3 = {abc, abd, acd, ace, bcd}

pruned
 Self-joining: F3*F3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in F3
 C4 = {abcd}
21
Candidate Generation (Pseudo-Code)
 Suppose the items in Fk-1 are listed in an order
self-join self-join
 // Step 1: Joining
zabc zabd acd ace bcd
for each p in Fk-1
zabcd acde
for each q in Fk-1
if p.item1= q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1 {
c = join(p, q)
 // Step 2: pruning
if has_infrequent_subset(c, Fk-1)
continue // prune
else add c to Ck
22
Apriori: Improvements and Alternatives
 Reduce passes of transaction database scans
To be discussed in
 Partitioning (e.g., Savasere, et al., 1995) subsequent slides
 Dynamic itemset counting (Brin, et al., 1997)
 Shrink the number of candidates
To be discussed in
 Hashing (e.g., DHP: Park, et al., 1995) subsequent slides
 Pruning by support lower bounding (e.g., Bayardo 1998)
 Sampling (e.g., Toivonen, 1996)
 Exploring special data structures
 Tree projection (Agarwal, et al., 2001)
 H-miner (Pei, et al., 2001)
 Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
23
Partitioning: Scan Database Only Twice
 Theorem: Any itemset that is potentially frequent in TDB must be frequent in at least
one of the partitions of TDB
Here is the p
roof!
TDB1 + TDB2 + ... + TDBk = TDB

sup1(X) < σ|TDB1| sup2(X) < σ|TDB2| ... supk(X) < σ|TDBk| sup(X) < σ|TDB|
 Method: Scan DB twice (A. Savasere, E. Omiecinski and S. Navathe, VLDB’95)
 Scan 1: Partition database so that each partition can fit in main memory (why?)
 Mine local frequent patterns in this partition
 Scan 2: Consolidate global frequent patterns
 Find global frequent itemset candidates (those frequent in at least one partition)
 Find the true frequency of those candidates, by scanning TDBi one more time
24
Direct Hashing and Pruning (DHP)
 Hashing: v = hash(itemset) Itemsets Count
 1st scan: When counting the 1-itemset, hash 2- {ab, ad, ce} 35
itemset to calculate the bucket count {bd, be, de} 298
V might be same for
 Example: At the 1 scan of TDB, count 1-itemset, different itemset
st …… …
and hash 2-itemsets in the transaction to its {yz, qs, wt} 58
bucket Hash Table
 {ab, ad, ce}
 {bd, be, de} Check the minsup
 …
A key observation: A k-itemset cannot be
 At the end of the first scan,
frequent if its corresponding hashing
 if minsup = 80, remove ab, ad, ce, since bucket count is below the minsup threshold
count{ab, ad, ce} < 80
DHP (Direct Hashing and Pruning): (J. Park, M. Chen, and P. Yu, SIGMOD’95)
25
Exploring Vertical Data Format: ECLAT
 ECLAT (Equivalence Class Transformation): A depth-first search A transaction DB in Horizontal
Data Format
algorithm using set intersection [Zaki et al. @KDD’97] Tid Itemset
 Vertical format 10 a, c, d, e
20 a, b, e
 Properties of Tid-Lists
30 b, c, e
 t(X) = t(Y): X and Y always happen together (e.g., t(ac} = t(d})
The transaction DB in Vertical
 t(X)  t(Y): transaction having X always has Y (e.g., t(ac)  t(ce)) Data Format
Item TidList
 Frequent patterns: vertical intersections t(e) = {T10, T20, T30};
a 10, 20
 Using diffset to accelerate mining t(a) = {T10, T20};
b 20, 30
t(ae) = {T10, T20} c 10, 30
 Only keep track of differences of tids
d 10
 t(e) = {T10, T20, T30}, t(ce) = {T10, T30} → Diffset (ce, e) = {T20} e 10, 20, 30
26
Why Mining Frequent Patterns by Pattern Growth?
 Apriori: A breadth-first search mining algorithm
 First find the complete set of frequent k-itemsets
 Then derive frequent (k+1)-itemset candidates
 Scan DB again to find true frequent (k+1)-itemsets
 Motivation for a different mining methodology
 Can we develop a depth-first search mining algorithm?
 For a frequent itemset ρ, can subsequent search be confined to only those
transactions that containing ρ?
 Such thinking leads to a frequent pattern growth approach: FPGrowth
FPGrowth (J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns without Candidate Generation,” SIGMOD 2000)
27
Example: From Transactional DB to Ordered Frequent Itemlist
Example: A Sample Transactional Database TID Items in the Transaction
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
Let min_support = 3 500 {a, f, c, e, l, p, m, n}
 Scan DB once, find single item frequent pattern: f:4, a:3, c:4, b:3, m:3, p:3
 Sort frequent items in frequency descending order, f-list F-list = f-c-a-b-m-p
 Scan DB again, use the ordered frequent itemlist for each transaction to construct an
FP-tree TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b
400 {b, c, k, s, p} c, b, p
500 {a, f, c, e, l, p, m, n} f, c, a, m, p
28
Example: Construct FP-tree from Transaction DB
TID Ordered, After inserting
After inserting
frequent itemlist the 1st frequent
the 2nd frequent After inserting all the
Itemlist: {} {} {}
100 itemlist frequent itemlists
f, c, a, m, p “f, c, a, m, p”
“f, c, a, b, m”
200 f, c, a, b, m
300 f, b
f:1 f:2 f:4 c:1
Item Frqncy hdr Itm hdr Itm hdr
400 c, b, p
f 4 c:1 f c:2 f c:3 b:1 b:1
500 f, c, a, m, p
c 4 c c
a 3 a:1 a a:2 a:3 p:1
FP-Tree Construction: a
b 3 b b
For each transaction, m:1 m:1 b:1 m:2 b:1
m 3 m m
insert the ordered
frequent itemlist into an p 3 p p
p:1 p:1 m:1 p:2 m:1
FP-tree, with shared
sub-branches merged,
counts accumulated Header Table
29
Mining FP-Tree: Divide and Conquer
Based on Patterns and Data
 Pattern mining can be partitioned according to current patterns
 Patterns containing p: p’s conditional database: fcam:2, cb:1
 p’s conditional database (i.e., the database under the condition that p exists):
 transformed prefix paths of item p
 Patterns having m but no p: m’s conditional database: fca:2, fcab:1
 …… ……
{}
min_support = 3 Conditional database of each pattern
Item Frequency Header f:4 c:1 Item Conditional database
f 4 c f:3
c 4
c:3 b:1 b:1 a fc:3
a 3 b fca:1, f:1, c:1
a:3 p:1
b 3 m fca:2, fcab:1
m 3 m:2 b:1 p fcam:2, cb:1
p 3
p:2 m:1
30
Mine Each Conditional Database Recursively
min_support = 3 For each conditional database
Conditional Data Bases  Mine single-item patterns
item cond. data base
 Construct its FP-tree & mine it
c f:3
a fc:3
b fca:1, f:1, c:1 m’s conditional DB: fca:2, fcab:1 → fca: 3
m fca:2, fcab:1
p fcam:2, cb:1
Actually, for single branch FP-tree, all the
frequent patterns can be generated in one shot
m: 3
fm: 3, cm: 3, am: 3
fcm: 3, fam:3, cam: 3
fcam: 3
31
A Special Case: Single Prefix Path in FP-tree
 Suppose a (conditional) FP-tree T has a shared single prefix-path P
 Mining can be decomposed into two parts
{}  Reduction of the single prefix path into one node
a1:n1  Concatenation of the mining results of the two parts
a2:n2 r1
{}
a3:n3
a1:n1
 r1 = + b1:m1 c1:k1
c1:k1 a2:n2
b1:m1
a3:n3 c2:k2 c3:k3
c2:k2 c3:k3
32
FPGrowth: Mining Frequent Patterns by Pattern Growth
 Essence of frequent pattern growth (FPGrowth) methodology
 Find frequent single items and partition the database based on each such
single item pattern
 Recursively grow frequent patterns by doing the above for each
partitioned database (also called the pattern’s conditional database)
 To facilitate efficient processing, an efficient data structure, FP-tree, can
be constructed
 Mining becomes
 Recursively construct and mine (conditional) FP-trees
 Until the resulting FP-tree is empty, or until it contains only one path—
single path will generate all the combinations of its sub-paths, each of
which is a frequent pattern
33
Scaling FP-growth by Item-Based Data Projection
 What if FP-tree cannot fit in memory?—Do not construct FP-tree
 “Project” the database based on frequent single items
 Construct & mine FP-tree for each projected DB
 Parallel projection vs. partition projection
 Parallel projection: Project the DB on each frequent item
 Space costly, all partitions can be processed in parallel
 Partition projection: Partition the DB in order
 Passing the unprocessed parts to subsequent partitions
Trans. DB Parallel projection Partition projection
f2 f 3 f 4 g h f4-proj. DB f3-proj. DB f4-proj. DB f3-proj. DB
f3 f 4 i j Assume only f’s are f2 f 3 f2 f2 f 3 f1

frequent & the f3 f1 f3 …
f2 f 4 k
frequent item f2 will be projected to f3-proj.
f1 f 3 h f2 … f2
ordering is: f1-f2-f3-f4 DB only when processing f4-
… … … proj. DB
34
CLOSET+: Mining Closed Itemsets by Pattern-Growth
{}  Efficient, direct mining of closed itemsets TID Items
 Intuition: 1 acdef
a1:n1 2 abe
 If an FP-tree contains a single branch as shown 3 cefg
a2:n1 left 4 acdf
a3:n1  “a1,a2, a3” should be merged
Let minsupport = 2
 Itemset merging: If Y appears in every a:3, c:3, d:2, e:3, f:3
b1:m1 c1:k1 occurrence of X, then Y is merged with X
F-List: a-c-e-f-d
 d-proj. db: {acef, acf} → acfd-proj. db: {e}
c2:k2 c3:k3  Final closed itemset: acfd:2
 There are many other tricks developed
 For details, see J. Wang, et al,, “CLOSET+:
Searching for the Best Strategies for Mining
Frequent Closed Itemsets”, KDD'03
35
 Basic Concepts
 Summary
36
How to Judge if a Rule/Pattern Is Interesting?
 Pattern-mining will generate a large set of patterns/rules
 Not all the generated patterns/rules are interesting
 Interestingness measures: Objective vs. subjective
 Objective interestingness measures
 Support, confidence, correlation, …
 Subjective interestingness measures:
 Different users may judge interestingness differently
 Let a user specify
 Query-based: Relevant to a user’s particular request
 Judge against one’s knowledge-base
 unexpected, freshness, timeliness
37
Limitation of the Support-Confidence Framework
 Are s and c interesting in association rules: “A  B” [s, c]? Be careful!
 Example: Suppose one school may have the following statistics on #
of students who may play basketball and/or eat cereal:
play-basketball not play-basketball sum (row)
eat-cereal 400 350 750 2-way
contin
not eat-cereal 200 50 250 gency
ta ble
sum(col.) 600 400 1000
 Association rule mining may generate the following:

 play-basketball  eat-cereal [40%, 66.7%] (higher s & c)
 But this strong association rule is misleading: The overall % of
students eating cereal is 75% > 66.7%, a more telling rule:
 ¬ play-basketball  eat-cereal [35%, 87.5%] (high s & c)
38
Interestingness Measure: Lift
 Measure of dependent/correlated events: lift Lift is more telling than s & c
c( B C ) s ( B C ) B ¬B ∑row
lift ( B, C )  
s (C ) s ( B )  s (C ) C 400 350 750
¬C 200 50 250
 Lift(B, C) may tell how B and C are correlated ∑col. 600 400 1000
 Lift(B, C) = 1: B and C are independent
 > 1: positively correlated
 < 1: negatively correlated
 For our example, 400 / 1000
lift ( B, C )   0.89
600 / 1000  750 / 1000
200 / 1000
lift ( B, C )   1.33
600 / 1000  250 / 1000
 Thus, B and C are negatively correlated since lift(B, C) < 1;

 B and ¬C are positively correlated since lift(B, ¬C) > 1
39
Interestingness Measure: χ2
 Another measure to test correlated events: χ2 B ¬B ∑row
(Observed  Expected ) 2 C 400 (450) 350 (300) 750

 
2
¬C 200 (150) 50 (100) 250
Expected
∑col 600 400 1000
 For the table on the right,
Expected value
Observed value
 Lookup χ2 distribution table → B, C are correlated
 χ2-test shows B and C are negatively correlated since the expected
value is 450 but the observed is only 400
 Thus, χ2 is also more telling than the support-confidence framework
40
Lift and χ2 : Are They Always Good Measures?
B ¬B ∑row
 Null transactions: Transactions that contain
C 100 1000 1100
neither B nor C ¬C 1000 100000 101000
∑col. 1100 101000 102100
 Let’s examine the new dataset D
null transactions
 BC (100) is much rarer than B¬C (1000) and ¬BC
(1000), but there are many ¬B¬C (100000) Contingency table with expected values added
B ¬B ∑row
 Unlikely B & C will happen together!
C 100 (11.85) 1000 1100
 But, Lift(B, C) = 8.44 >> 1 (Lift shows B and C are ¬C 1000 (988.15) 100000 101000
strongly positively correlated!) ∑col. 1100 101000 102100
 χ2 = 670: Observed(BC) >> expected value (11.85)

 Too many null transactions may “spoil the soup”!
41
Interestingness Measures & Null-Invariance
 Null invariance means: The number of null transactions does not matter.
Does not change the measure value.
 A few interestingness measures: Some are null invariant Let
are null invariant
Essentially min,
max, mean variants
of
42
Null Invariance: An Important Property
 Why is null invariance crucial for the analysis of massive transaction data?
 Many transactions may contain neither milk nor coffee!
 Lift and 2 are not null-invariant: not good to
milk vs. coffee contingency table
evaluate data that contain too many or too
few null transactions!
 Many measures are not null-invariant!
Null-transactions
w.r.t. m and c
43
Comparison of Null-Invariant Measures
 Not all null-invariant measures are created equal
 Which one is better? 2-variable contingency table
 D4—D6 differentiate the null-invariant measures
 Kulc (Kulczynski 1927) holds firm and is in balance of
both directional implications
All 5 are null-invariant
Subtle: They disagree on those cases
44
Imbalance Ratio with Kulczynski Measure
 IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in
rule implications:
 Kulczynski and Imbalance Ratio (IR) together present a clear picture for all
the three datasets D4 through D6
 D4 is neutral & balanced; D5 is neutral but imbalanced
 D6 is neutral but very imbalanced
45
Example: Analysis of DBLP Coauthor Relationships
 DBLP: Computer science research publication bibliographic database
 > 3.8 million entries on authors, paper, venue, year, and other information
Advisor-advisee relation: Kulc: high, Jaccard: low,

cosine: middle
 Which pairs of authors are strongly related? Is A the advisor, or the advisee?
 Use Kulc to find Advisor-advisee, close collaborators
46
What Measures to Choose for Effective Pattern Evaluation?
 Null value cases are predominant in many large datasets
 Neither milk nor coffee is in most of the baskets; neither Mike nor Jim is an author
in most of the papers; ……
 Null-invariance is an important property
 Lift, χ2 and cosine are good measures if null transactions are not predominant
 Otherwise, Kulczynski + Imbalance Ratio should be used to judge the
interestingness of a pattern
 Exercise: Mining research collaborations from research bibliographic data
 Find a group of frequent collaborators from research bibliographic data (e.g., DBLP)
 Can you find the likely advisor-advisee relationship and during which years such a
relationship happened?
 Ref.: C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu, and J. Guo, "Mining Advisor-
Advisee Relationships from Research Publication Networks", KDD'10
47
 Basic Concepts
 Summary
48
Summary
 Basic Concepts
 What Is Pattern Discovery? Why Is It Important?
 Basic Concepts: Frequent Patterns and Association Rules
 Compressed Representation: Closed Patterns and Max-Patterns
 Efficient Pattern Mining Methods
 The Downward Closure Property of Frequent Patterns
 The Apriori Algorithm
 Extensions or Improvements of Apriori
 Mining Frequent Patterns by Exploring Vertical Data Format
 FPGrowth: A Frequent Pattern-Growth Approach
 Mining Closed Patterns
 Pattern Evaluation
 Interestingness Measures in Pattern Mining
 Interestingness Measures: Lift and χ2
 Null-Invariant Measures
 Comparison of Interestingness Measures
49
Recommended Readings (Basic Concepts)
 R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of
items in large databases”, in Proc. of SIGMOD'93
 R. J. Bayardo, “Efficiently mining long patterns from databases”, in Proc. of
SIGMOD'98
 N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets
for association rules”, in Proc. of ICDT'99
 J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent Pattern Mining: Current Status and
Future Directions”, Data Mining and Knowledge Discovery, 15(1): 55-86, 2007
50
Recommended Readings (Efficient Pattern Mining Methods)
 R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94
 A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large
databases”, VLDB'95
 J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”,
SIGMOD'95
 S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database
systems: Alternatives and implications”, SIGMOD'98
 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association
rules”, Data Mining and Knowledge Discovery, 1997
 J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
 M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
 J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed
Itemsets”, KDD'03
 C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in
Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014
51
Recommended Readings (Pattern Evaluation)
 C. C. Aggarwal and P. S. Yu. A New Framework for Itemset Generation. PODS’98
 S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97
 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94
 E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03
 P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for
Association Patterns. KDD'02
 T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397,
2010
52

Chap4 PatternMiningBasic

Uploaded by

Copyright:

Available Formats

Chap4 PatternMiningBasic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap4 PatternMiningBasic

Uploaded by

Copyright:

Available Formats

Pattern Mining: Basic Concepts and Methods

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

Frequent item set Frequent sequences Frequent structures

Tid Items bought

for the given 5-transaction dataset 4 Nuts, Eggs, Milk

Note: X  Y: the union of two itemsets

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

 The Apriori Algorithm

 Mining Frequent Patterns by Exploring Vertical Data Format

 FPGrowth: A Frequent Pattern-Growth Approach

 Mining Closed Patterns

 F3 = {abc, abd, acd, ace, bcd}

TDB1 + TDB2 + ... + TDBk = TDB

f3 f 4 i j Assume only f’s are f2 f 3 f2 f2 f 3 f1

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

 Association rule mining may generate the following:

 Thus, B and C are negatively correlated since lift(B, C) < 1;

(Observed  Expected ) 2 C 400 (450) 350 (300) 750

strongly positively correlated!) ∑col. 1100 101000 102100

 χ2 = 670: Observed(BC) >> expected value (11.85)

are null invariant

Subtle: They disagree on those cases

Advisor-advisee relation: Kulc: high, Jaccard: low,

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

You might also like