Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chap4 PatternMiningBasic

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 52

Pattern Mining: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

 Summary

1
What Are Patterns?
What are patterns?
 Patterns: A set of items, subsequences, or substructures that occur
frequently together (or strongly correlated) in a data set
 Patterns represent intrinsic and important properties of datasets

Frequent item set Frequent sequences Frequent structures

2
What Is Pattern Discovery?
 Pattern discovery: Uncovering patterns from massive data sets
 It can answer questions such as:
 What products were often purchased together?
 What are the subsequent purchases after buying an iPad?

3
Pattern Discovery: Why Is It Important?
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Mining sequential, structural (e.g., sub-graph) patterns
 Classification: Discriminative pattern-based analysis
 Cluster analysis: Pattern-based subspace clustering
 Broad applications
 Market basket analysis, cross-marketing, catalog design, sale
campaign analysis, Web log analysis, biological sequence
analysis
 Many types of data: spatiotemporal, multimedia, time-series,
and stream data
4
Basic Concepts: Transactional Database
Transactional Database (TDB)
 Each transaction is associated with an identifier, called a TID.
 May also have counts associated with each item sold

Tid Items bought


1 Beer, Nuts, Diaper
2 Beer, Coffee, Diaper
3 Beer, Diaper, Eggs
4 Nuts, Eggs, Milk
5 Nuts, Coffee, Diaper, Eggs, Milk

5
Basic Concepts: k-Itemsets and Their Supports
Tid Items bought
 Itemset: A set of one or more items
1 Beer, Nuts, Diaper
2 Beer, Coffee, Diaper
 k-itemset: An itemset containing k items: 3 Beer, Diaper, Eggs
4 Nuts, Eggs, Milk
X = {x1, …, xk}
5 Nuts, Coffee, Diaper, Eggs, Milk
 Ex. {Beer, Nuts, Diaper} is a 3-itemset
 Relative support
Absolute support (count)
 sup{X} = occurrences of an itemset X
 s{X} = The fraction of transactions
that contains X (i.e., the probability
 Ex. sup{Beer} = 3
that a transaction contains X)
 Ex. sup{Diaper} = 4
 Ex. s{Beer} = 3/5 = 60%
 Ex. sup{Beer, Diaper} = 3
 Ex. s{Diaper} = 4/5 = 80%
 Ex. sup{Beer, Eggs} = 1
 Ex. s{Beer, Eggs} = 1/5 = 20%
6
Basic Concepts: Frequent Itemsets (Patterns)
 An itemset (or a pattern) X is frequent Tid Items bought
if the support of X is no less than a 1 Beer, Nuts, Diaper
minsup threshold σ 2 Beer, Coffee, Diaper
 Let σ = 50% (σ: minsup threshold) 3 Beer, Diaper, Eggs

for the given 5-transaction dataset 4 Nuts, Eggs, Milk


5 Nuts, Coffee, Diaper, Eggs, Milk
 All the frequent 1-itemsets:
 Beer: 3/5 (60%); Nuts: 3/5 (60%);  Why do these itemsets (shown on the
Diaper: 4/5 (80%); Eggs: 3/5 (60%) left) form the complete set of frequent
 All the frequent 2-itemsets: k-itemsets (patterns) for any k?
 {Beer, Diaper}: 3/5 (60%)  Observation: We may need an
 All the frequent 3-itemsets? efficient method to mine a complete
 None set of frequent patterns

7
From Frequent Itemsets to Association Rules
Compared with itemsets, association rules can be more telling
 Ex. Diaper  Beer
 Buying diapers may likely lead to buying beers
Containing both Containing diaper

{Beer} 
Beer Diaper
{Diaper}

Containing beer
{Beer}  {Diaper} = {Beer, Diaper}

Note: X  Y: the union of two itemsets


 The set contains both X and Y
8
Association Rules
 How do we compute the strength of an association
Tid Items bought
rule X  Y (Both X and Y are itemsets)?
1 Beer, Nuts, Diaper
 We first compute the following two metrics, s and c. 2 Beer, Coffee, Diaper
 Support of X  Y 3 Beer, Diaper, Eggs
4 Nuts, Eggs, Milk
 Ex. s{Diaper, Beer} = 3/5 = 0.6 (i.e., 60%)
5 Nuts, Coffee, Diaper, Eggs, Milk
 Confidence of X  Y
 The conditional probability that a transaction
containing X also contains Y:
c = sup(X, Y) / sup(X)
 Ex. c = sup{Diaper, Beer}/sup{Diaper} = ¾ = 0.75
 In pattern analysis, we are often interested in those rules that dominate
the database, and these two metrics ensure the popularity and correlation of X
and Y.
9
Mining Frequent Itemsets and Association Rules
 Association rule mining Tid Items bought
1 Beer, Nuts, Diaper
 Given two thresholds: minsup, minconf
2 Beer, Coffee, Diaper
 Find all of the rules, X  Y (s, c)
3 Beer, Diaper, Eggs
such that s ≥ minsup and c ≥ minconf 4 Nuts, Eggs, Milk
5 Nuts, Coffee, Diaper, Eggs, Milk
 Let minsup = 50%
 Freq. 1-itemsets: Beer: 3, Nuts: 3,
Observations:
Diaper: 4, Eggs: 3
 Mining association rules and
 Freq. 2-itemsets: {Beer, Diaper}: 3
mining frequent patterns are
very close problems
 Let minconf = 50%  Scalable methods are needed
 Beer  Diaper (60%, 100%)
for mining large datasets
 Diaper  Beer (60%, 75%)

(Q: Are these all the rules satisfying the two conditions?)
10
Challenge: There Are Too Many Frequent Patterns!
 A long pattern contains a combinatorial number of sub-patterns
 How many frequent itemsets does the following TDB1 contain (minsup = 1)?
 TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
 Let’s have a try
1-itemsets: {a1}: 2, {a2}: 2, …, {a50}: 2, {a51}: 1, …, {a100}: 1,
2-itemsets: {a1, a2}: 2, …, {a1, a50}: 2, {a1, a51}: 1 …, …, {a99, a100}: 1,
…, …, …, …
99-itemsets: {a1, a2, …, a99}: 1, …, {a2, a3, …, a100}: 1
100-itemset: {a1, a2, …, a100}: 1 A too huge set for any
one to compute or store!
 The total number of frequent itemsets:

11
Expressing Patterns in Compressed Form: Closed Patterns
 How to handle such a challenge?
 Solution 1: Closed patterns: A pattern (itemset) X is closed if X is
frequent, and there exists no super-pattern Y ‫ כ‬X, with the same
support as X
 Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
 Suppose minsup = 1. How many closed patterns does TDB1 contain?
 Two: P1: “{a1, …, a50}: 2”; P2: “{a1, …, a100}: 1”
 Closed pattern is a lossless compression of frequent patterns
 Reduces the # of patterns but does not lose the support
information!
 You will still be able to say: “{a2, …, a40}: 2”, “{a5, a51}: 1”
12
Expressing Patterns in Compressed Form: Max-Patterns
 Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent
and there exists no frequent super-pattern Y ‫ כ‬X
 Difference from close-patterns?
 Do not care the real support of the sub-patterns of a max-pattern
 Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
 Suppose minsup = 1. How many max-patterns does TDB1 contain?
 One: P: “{a1, …, a100}: 1”
Max-pattern is a lossy compression!
 We only know {a1, …, a40} is frequent

 But we do not know the real support of {a1, …, a40}, …, any more!
 Thus in many applications, mining close-patterns is more desirable
13
Pattern Mining: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

 Summary

14
Efficient Pattern Mining Methods
 The Downward Closure Property of Frequent Patterns

 The Apriori Algorithm


 Extensions or Improvements of Apriori

 Mining Frequent Patterns by Exploring Vertical Data Format

 FPGrowth: A Frequent Pattern-Growth Approach

 Mining Closed Patterns

15
The Downward Closure Property of Frequent Patterns
 Observation: From TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
 We get a frequent itemset: {a1, …, a50}
 Also, its subsets are all frequent: {a1}, {a2}, …, {a50}, {a1, a2}, …, {a1, …, a49}, …
 There are some hidden relationships among frequent patterns!
 The downward closure (also called “Apriori”) property of frequent patterns
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}
 Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
 Apriori: Any subset of a frequent itemset must be frequent
 Efficient mining methodology
 If any subset of an itemset S is infrequent, then there is no chance for S to
be frequent—why do we even have to consider S!? A sharp knife for pruning!

16
Apriori Pruning and Scalable Mining Methods
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not even be generated! (Agrawal &
Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
 Scalable mining Methods: Three major approaches
 Level-wise, join-based approach: Apriori (Agrawal &
Srikant@VLDB’94)
 Vertical data format approach: Eclat (Zaki, Parthasarathy,
Ogihara, Li @KDD’97)
 Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin
@SIGMOD’00)

17
Apriori: A Candidate Generation & Test Approach
 Outline of Apriori (level-wise, candidate generation and test)
 Scan DB once to get frequent 1-itemset
 Repeat
 Generate length-(k+1) candidate itemsets from length-k frequent
itemsets
 Test the candidates against DB to find frequent (k+1)-itemsets
 Set k := k +1
 Until no frequent or candidate set can be generated
 Return all the frequent itemsets derived
18
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Fk : Frequent itemset of size k

K := 1;
Fk := {frequent items}; // frequent 1-itemset
While (Fk != ) do { // when Fk is non-empty
Ck+1 := candidates generated from Fk; // candidate generation
Derive Fk+1 by counting candidates in Ck+1 with respect to TDB at minsup;
k := k + 1
}
return k Fk // return Fk generated at each level

19
The Apriori Algorithm—An Example
minsup = 2 Itemset sup
Database TDB Itemset sup
{A} 2 F1
Tid Items C1 {B} 3
{A} 2
10 A, C, D {B} 3
{C} 3
20 B, C, E 1 scan
st
{C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
F2 Itemset sup {A, B} 1 {A, B}
{A, C} 2 {A, C} 2
{A, E} 1
2nd scan {A, C}
{B, C} 2 {A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}

C3 Itemset
3rd scan F3 Itemset sup
{B, C, E} {B, C, E} 2

20
Apriori: Implementation Tricks
 How to generate candidates?
self-join self-join
 Step 1: self-joining Fk
abc abd acd ace bcd
 Step 2: pruning
 Example of candidate-generation abcd acde

 F3 = {abc, abd, acd, ace, bcd}


pruned
 Self-joining: F3*F3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in F3
 C4 = {abcd}
21
Candidate Generation (Pseudo-Code)
 Suppose the items in Fk-1 are listed in an order
self-join self-join
 // Step 1: Joining
zabc zabd acd ace bcd
for each p in Fk-1
zabcd acde
for each q in Fk-1
if p.item1= q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1 {
c = join(p, q)
 // Step 2: pruning
if has_infrequent_subset(c, Fk-1)
continue // prune
else add c to Ck
22
Apriori: Improvements and Alternatives
 Reduce passes of transaction database scans
To be discussed in
 Partitioning (e.g., Savasere, et al., 1995) subsequent slides
 Dynamic itemset counting (Brin, et al., 1997)
 Shrink the number of candidates
To be discussed in
 Hashing (e.g., DHP: Park, et al., 1995) subsequent slides
 Pruning by support lower bounding (e.g., Bayardo 1998)
 Sampling (e.g., Toivonen, 1996)
 Exploring special data structures
 Tree projection (Agarwal, et al., 2001)
 H-miner (Pei, et al., 2001)
 Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
23
Partitioning: Scan Database Only Twice
 Theorem: Any itemset that is potentially frequent in TDB must be frequent in at least
one of the partitions of TDB

Here is the p
roof!

TDB1 + TDB2 + ... + TDBk = TDB


sup1(X) < σ|TDB1| sup2(X) < σ|TDB2| ... supk(X) < σ|TDBk| sup(X) < σ|TDB|
 Method: Scan DB twice (A. Savasere, E. Omiecinski and S. Navathe, VLDB’95)
 Scan 1: Partition database so that each partition can fit in main memory (why?)
 Mine local frequent patterns in this partition
 Scan 2: Consolidate global frequent patterns
 Find global frequent itemset candidates (those frequent in at least one partition)
 Find the true frequency of those candidates, by scanning TDBi one more time
24
Direct Hashing and Pruning (DHP)
 Hashing: v = hash(itemset) Itemsets Count
 1st scan: When counting the 1-itemset, hash 2- {ab, ad, ce} 35
itemset to calculate the bucket count {bd, be, de} 298
V might be same for
 Example: At the 1 scan of TDB, count 1-itemset, different itemset
st …… …
and hash 2-itemsets in the transaction to its {yz, qs, wt} 58
bucket Hash Table
 {ab, ad, ce}
 {bd, be, de} Check the minsup

 …
A key observation: A k-itemset cannot be
 At the end of the first scan,
frequent if its corresponding hashing
 if minsup = 80, remove ab, ad, ce, since bucket count is below the minsup threshold
count{ab, ad, ce} < 80
DHP (Direct Hashing and Pruning): (J. Park, M. Chen, and P. Yu, SIGMOD’95)
25
Exploring Vertical Data Format: ECLAT
 ECLAT (Equivalence Class Transformation): A depth-first search A transaction DB in Horizontal
Data Format
algorithm using set intersection [Zaki et al. @KDD’97] Tid Itemset
 Vertical format 10 a, c, d, e
20 a, b, e
 Properties of Tid-Lists
30 b, c, e
 t(X) = t(Y): X and Y always happen together (e.g., t(ac} = t(d})
The transaction DB in Vertical
 t(X)  t(Y): transaction having X always has Y (e.g., t(ac)  t(ce)) Data Format
Item TidList
 Frequent patterns: vertical intersections t(e) = {T10, T20, T30};
a 10, 20
 Using diffset to accelerate mining t(a) = {T10, T20};
b 20, 30
t(ae) = {T10, T20} c 10, 30
 Only keep track of differences of tids
d 10
 t(e) = {T10, T20, T30}, t(ce) = {T10, T30} → Diffset (ce, e) = {T20} e 10, 20, 30

26
Why Mining Frequent Patterns by Pattern Growth?
 Apriori: A breadth-first search mining algorithm
 First find the complete set of frequent k-itemsets
 Then derive frequent (k+1)-itemset candidates
 Scan DB again to find true frequent (k+1)-itemsets
 Motivation for a different mining methodology
 Can we develop a depth-first search mining algorithm?
 For a frequent itemset ρ, can subsequent search be confined to only those
transactions that containing ρ?
 Such thinking leads to a frequent pattern growth approach: FPGrowth
FPGrowth (J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns without Candidate Generation,” SIGMOD 2000)

27
Example: From Transactional DB to Ordered Frequent Itemlist
Example: A Sample Transactional Database TID Items in the Transaction
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
Let min_support = 3 500 {a, f, c, e, l, p, m, n}

 Scan DB once, find single item frequent pattern: f:4, a:3, c:4, b:3, m:3, p:3
 Sort frequent items in frequency descending order, f-list F-list = f-c-a-b-m-p
 Scan DB again, use the ordered frequent itemlist for each transaction to construct an
FP-tree TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b
400 {b, c, k, s, p} c, b, p
500 {a, f, c, e, l, p, m, n} f, c, a, m, p
28
Example: Construct FP-tree from Transaction DB
TID Ordered, After inserting
After inserting
frequent itemlist the 1st frequent
the 2nd frequent After inserting all the
Itemlist: {} {} {}
100 itemlist frequent itemlists
f, c, a, m, p “f, c, a, m, p”
“f, c, a, b, m”
200 f, c, a, b, m
300 f, b
f:1 f:2 f:4 c:1
Item Frqncy hdr Itm hdr Itm hdr
400 c, b, p
f 4 c:1 f c:2 f c:3 b:1 b:1
500 f, c, a, m, p
c 4 c c
a 3 a:1 a a:2 a:3 p:1
FP-Tree Construction: a
b 3 b b
For each transaction, m:1 m:1 b:1 m:2 b:1
m 3 m m
insert the ordered
frequent itemlist into an p 3 p p
p:1 p:1 m:1 p:2 m:1
FP-tree, with shared
sub-branches merged,
counts accumulated Header Table

29
Mining FP-Tree: Divide and Conquer
Based on Patterns and Data
 Pattern mining can be partitioned according to current patterns
 Patterns containing p: p’s conditional database: fcam:2, cb:1
 p’s conditional database (i.e., the database under the condition that p exists):
 transformed prefix paths of item p
 Patterns having m but no p: m’s conditional database: fca:2, fcab:1
 …… ……
{}
min_support = 3 Conditional database of each pattern
Item Frequency Header f:4 c:1 Item Conditional database
f 4 c f:3
c 4
c:3 b:1 b:1 a fc:3
a 3 b fca:1, f:1, c:1
a:3 p:1
b 3 m fca:2, fcab:1
m 3 m:2 b:1 p fcam:2, cb:1
p 3
p:2 m:1
30
Mine Each Conditional Database Recursively
min_support = 3 For each conditional database
Conditional Data Bases  Mine single-item patterns
item cond. data base
 Construct its FP-tree & mine it
c f:3
a fc:3
b fca:1, f:1, c:1 m’s conditional DB: fca:2, fcab:1 → fca: 3
m fca:2, fcab:1
p fcam:2, cb:1
Actually, for single branch FP-tree, all the
frequent patterns can be generated in one shot

m: 3
fm: 3, cm: 3, am: 3
fcm: 3, fam:3, cam: 3
fcam: 3

31
A Special Case: Single Prefix Path in FP-tree
 Suppose a (conditional) FP-tree T has a shared single prefix-path P
 Mining can be decomposed into two parts
{}  Reduction of the single prefix path into one node
a1:n1  Concatenation of the mining results of the two parts
a2:n2 r1
{}
a3:n3
a1:n1
 r1 = + b1:m1 c1:k1
c1:k1 a2:n2
b1:m1
a3:n3 c2:k2 c3:k3
c2:k2 c3:k3

32
FPGrowth: Mining Frequent Patterns by Pattern Growth
 Essence of frequent pattern growth (FPGrowth) methodology
 Find frequent single items and partition the database based on each such
single item pattern
 Recursively grow frequent patterns by doing the above for each
partitioned database (also called the pattern’s conditional database)
 To facilitate efficient processing, an efficient data structure, FP-tree, can
be constructed
 Mining becomes
 Recursively construct and mine (conditional) FP-trees
 Until the resulting FP-tree is empty, or until it contains only one path—
single path will generate all the combinations of its sub-paths, each of
which is a frequent pattern
33
Scaling FP-growth by Item-Based Data Projection
 What if FP-tree cannot fit in memory?—Do not construct FP-tree
 “Project” the database based on frequent single items
 Construct & mine FP-tree for each projected DB
 Parallel projection vs. partition projection
 Parallel projection: Project the DB on each frequent item
 Space costly, all partitions can be processed in parallel
 Partition projection: Partition the DB in order
 Passing the unprocessed parts to subsequent partitions
Trans. DB Parallel projection Partition projection
f2 f 3 f 4 g h f4-proj. DB f3-proj. DB f4-proj. DB f3-proj. DB

f3 f 4 i j Assume only f’s are f2 f 3 f2 f2 f 3 f1


frequent & the f3 f1 f3 …
f2 f 4 k
frequent item f2 will be projected to f3-proj.
f1 f 3 h f2 … f2
ordering is: f1-f2-f3-f4 DB only when processing f4-
… … … proj. DB
34
CLOSET+: Mining Closed Itemsets by Pattern-Growth
{}  Efficient, direct mining of closed itemsets TID Items
 Intuition: 1 acdef
a1:n1 2 abe
 If an FP-tree contains a single branch as shown 3 cefg
a2:n1 left 4 acdf
a3:n1  “a1,a2, a3” should be merged
Let minsupport = 2
 Itemset merging: If Y appears in every a:3, c:3, d:2, e:3, f:3
b1:m1 c1:k1 occurrence of X, then Y is merged with X
F-List: a-c-e-f-d
 d-proj. db: {acef, acf} → acfd-proj. db: {e}
c2:k2 c3:k3  Final closed itemset: acfd:2
 There are many other tricks developed
 For details, see J. Wang, et al,, “CLOSET+:
Searching for the Best Strategies for Mining
Frequent Closed Itemsets”, KDD'03
35
Pattern Mining: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

 Summary

36
How to Judge if a Rule/Pattern Is Interesting?
 Pattern-mining will generate a large set of patterns/rules
 Not all the generated patterns/rules are interesting
 Interestingness measures: Objective vs. subjective
 Objective interestingness measures
 Support, confidence, correlation, …
 Subjective interestingness measures:
 Different users may judge interestingness differently
 Let a user specify
 Query-based: Relevant to a user’s particular request
 Judge against one’s knowledge-base
 unexpected, freshness, timeliness
37
Limitation of the Support-Confidence Framework
 Are s and c interesting in association rules: “A  B” [s, c]? Be careful!
 Example: Suppose one school may have the following statistics on #
of students who may play basketball and/or eat cereal:
play-basketball not play-basketball sum (row)
eat-cereal 400 350 750 2-way
contin
not eat-cereal 200 50 250 gency
ta ble
sum(col.) 600 400 1000

 Association rule mining may generate the following:


 play-basketball  eat-cereal [40%, 66.7%] (higher s & c)
 But this strong association rule is misleading: The overall % of
students eating cereal is 75% > 66.7%, a more telling rule:
 ¬ play-basketball  eat-cereal [35%, 87.5%] (high s & c)
38
Interestingness Measure: Lift
 Measure of dependent/correlated events: lift Lift is more telling than s & c
c( B C ) s ( B C ) B ¬B ∑row
lift ( B, C )  
s (C ) s ( B )  s (C ) C 400 350 750
¬C 200 50 250
 Lift(B, C) may tell how B and C are correlated ∑col. 600 400 1000
 Lift(B, C) = 1: B and C are independent
 > 1: positively correlated
 < 1: negatively correlated
 For our example, 400 / 1000
lift ( B, C )   0.89
600 / 1000  750 / 1000
200 / 1000
lift ( B, C )   1.33
600 / 1000  250 / 1000

 Thus, B and C are negatively correlated since lift(B, C) < 1;


 B and ¬C are positively correlated since lift(B, ¬C) > 1
39
Interestingness Measure: χ2
 Another measure to test correlated events: χ2 B ¬B ∑row

(Observed  Expected ) 2 C 400 (450) 350 (300) 750


 
2
¬C 200 (150) 50 (100) 250
Expected
∑col 600 400 1000
 For the table on the right,
Expected value

Observed value
 Lookup χ2 distribution table → B, C are correlated
 χ2-test shows B and C are negatively correlated since the expected
value is 450 but the observed is only 400
 Thus, χ2 is also more telling than the support-confidence framework

40
Lift and χ2 : Are They Always Good Measures?
B ¬B ∑row
 Null transactions: Transactions that contain
C 100 1000 1100
neither B nor C ¬C 1000 100000 101000
∑col. 1100 101000 102100
 Let’s examine the new dataset D
null transactions
 BC (100) is much rarer than B¬C (1000) and ¬BC
(1000), but there are many ¬B¬C (100000) Contingency table with expected values added

B ¬B ∑row
 Unlikely B & C will happen together!
C 100 (11.85) 1000 1100
 But, Lift(B, C) = 8.44 >> 1 (Lift shows B and C are ¬C 1000 (988.15) 100000 101000

strongly positively correlated!) ∑col. 1100 101000 102100

 χ2 = 670: Observed(BC) >> expected value (11.85)


 Too many null transactions may “spoil the soup”!
41
Interestingness Measures & Null-Invariance
 Null invariance means: The number of null transactions does not matter.
Does not change the measure value.
 A few interestingness measures: Some are null invariant Let

are null invariant

Essentially min,
max, mean variants
of

42
Null Invariance: An Important Property
 Why is null invariance crucial for the analysis of massive transaction data?
 Many transactions may contain neither milk nor coffee!
 Lift and 2 are not null-invariant: not good to
milk vs. coffee contingency table
evaluate data that contain too many or too
few null transactions!
 Many measures are not null-invariant!
Null-transactions
w.r.t. m and c

43
Comparison of Null-Invariant Measures
 Not all null-invariant measures are created equal
 Which one is better? 2-variable contingency table
 D4—D6 differentiate the null-invariant measures
 Kulc (Kulczynski 1927) holds firm and is in balance of
both directional implications
All 5 are null-invariant

Subtle: They disagree on those cases

44
Imbalance Ratio with Kulczynski Measure
 IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in
rule implications:

 Kulczynski and Imbalance Ratio (IR) together present a clear picture for all
the three datasets D4 through D6
 D4 is neutral & balanced; D5 is neutral but imbalanced
 D6 is neutral but very imbalanced

45
Example: Analysis of DBLP Coauthor Relationships
 DBLP: Computer science research publication bibliographic database
 > 3.8 million entries on authors, paper, venue, year, and other information

Advisor-advisee relation: Kulc: high, Jaccard: low,


cosine: middle
 Which pairs of authors are strongly related? Is A the advisor, or the advisee?
 Use Kulc to find Advisor-advisee, close collaborators
46
What Measures to Choose for Effective Pattern Evaluation?
 Null value cases are predominant in many large datasets
 Neither milk nor coffee is in most of the baskets; neither Mike nor Jim is an author
in most of the papers; ……
 Null-invariance is an important property
 Lift, χ2 and cosine are good measures if null transactions are not predominant
 Otherwise, Kulczynski + Imbalance Ratio should be used to judge the
interestingness of a pattern
 Exercise: Mining research collaborations from research bibliographic data
 Find a group of frequent collaborators from research bibliographic data (e.g., DBLP)
 Can you find the likely advisor-advisee relationship and during which years such a
relationship happened?
 Ref.: C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu, and J. Guo, "Mining Advisor-
Advisee Relationships from Research Publication Networks", KDD'10
47
Pattern Mining: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

 Summary

48
Summary
 Basic Concepts
 What Is Pattern Discovery? Why Is It Important?
 Basic Concepts: Frequent Patterns and Association Rules
 Compressed Representation: Closed Patterns and Max-Patterns
 Efficient Pattern Mining Methods
 The Downward Closure Property of Frequent Patterns
 The Apriori Algorithm
 Extensions or Improvements of Apriori
 Mining Frequent Patterns by Exploring Vertical Data Format
 FPGrowth: A Frequent Pattern-Growth Approach
 Mining Closed Patterns
 Pattern Evaluation
 Interestingness Measures in Pattern Mining
 Interestingness Measures: Lift and χ2
 Null-Invariant Measures
 Comparison of Interestingness Measures
49
Recommended Readings (Basic Concepts)
 R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of
items in large databases”, in Proc. of SIGMOD'93
 R. J. Bayardo, “Efficiently mining long patterns from databases”, in Proc. of
SIGMOD'98
 N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets
for association rules”, in Proc. of ICDT'99
 J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent Pattern Mining: Current Status and
Future Directions”, Data Mining and Knowledge Discovery, 15(1): 55-86, 2007

50
Recommended Readings (Efficient Pattern Mining Methods)
 R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94
 A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large
databases”, VLDB'95
 J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”,
SIGMOD'95
 S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database
systems: Alternatives and implications”, SIGMOD'98
 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association
rules”, Data Mining and Knowledge Discovery, 1997
 J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
 M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
 J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed
Itemsets”, KDD'03
 C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in
Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014
51
Recommended Readings (Pattern Evaluation)
 C. C. Aggarwal and P. S. Yu. A New Framework for Itemset Generation. PODS’98
 S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97
 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94
 E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03
 P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for
Association Patterns. KDD'02
 T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397,
2010

52

You might also like