Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning
Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning
Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning
• Given a frequent itemset L, find all non-empty • How to efficiently generate rules from
subsets f ⊂ L such that f → L – f satisfies the frequent itemsets?
minimum confidence requirement – In general, confidence does not have an anti-
monotone property
– If {A,B,C,D} is a frequent itemset, candidate rules: c(ABC →D) can be larger or smaller than c(AB →D)
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC,
AB →CD, AC → BD, AD → BC, BC →AD,
– But confidence of rules generated from the same
BD →AC, CD →AB itemset has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
• If |L| = n, then there are 2n – 2 candidate
association rules (ignoring L → ∅ and ∅ → L) – Confidence is anti-monotone w.r.t. number of items
on the RHS of the rule
Data Mining: Association Rules 31 Data Mining: Association Rules 32
Key fact:
Moving items from the antecedent to the consequent
never changes support, and never increases confidence
Algorithm
– For each itemset I with minsup:
• Find all minconf rules with a single consequent of the form
(I - L1 ⇒ L1 )
• Guess candidate consequents Ck by appending items from
repeat
I - Lk-1 to Lk-1
• Verify confidence of each rule I - Ck ⇒ Ck using known
itemset support values
• Choice of minimum support threshold • Some itemsets are redundant because they have
– lowering support threshold results in more frequent itemsets identical support as their supersets
– this may increase number of candidates and max length of TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
frequent itemsets 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
• Dimensionality (number of items) of the data set 4
5
1 1
1 1
1 1
1 1
1 1 1
1 1 1
1 1
1 1
1 0
1 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
– more space is needed to store support count of each item 6
7
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 1
0 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 0
1 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
– if number of frequent items also increases, both computation 8
9
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 1
0 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 0
1 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
and I/O costs may also increase 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
• Size of database 12
13
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 1
1 1 1
1 1 1
1 1
1 1
1 1
1 1
1 1
1 1
– since Apriori makes multiple passes, run time of algorithm 14
15
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 1
1 1 1
1 1 1
1 1
1 1
1 1
1 1
1 1
1 1
may increase with number of transactions
• Average transaction width 10
• Number of frequent itemsets = 3 × ∑
10
An itemset is maximal frequent if none of its immediate supersets • An itemset is closed if none of its immediate
is frequent null
supersets has the same support as the itemset
Maximal
Itemsets A B C D E
Itemset Support
{A} 4
TID Items Itemset Support
{B} 5
1 {A,B} {A,B,C} 2
AB AC AD AE BC BD BE CD CE DE {C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A,B} 4
4 {A,B,D} {B,C,D} 3
{A,C} 2
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 5 {A,B,C,D} {A,B,C,D} 2
{A,D} 3
{B,C} 3
{B,D} 4
ABCD ABCE ABDE ACDE BCDE
{C,D} 3
Infrequent
Itemsets Border
ABCD
E
Data Mining: Association Rules 39 Data Mining: Association Rules 40
Maximal vs. Closed Itemsets Maximal vs. Closed Frequent Itemsets
Closed but
null Transaction Ids Minimum support = 2 null
not maximal
TID Items
124 123 1234 245 345 124 123 1234 245 345
1 ABC A B C D E A B C D E Closed and
maximal
2 ABCD
3 BCE 12 124 24 4 123 2 12 124 24 4 123 2 3 24
3 24 34 45 34 45
AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE
4 ACDE
5 DE
12 24 2 12 2 24 4 4 2 3 4
2 4 4 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9
ABCD ABCE ABDE ACDE BCDE
# Maximal = 4
Not supported by
any transactions ABCDE
ABCDE