Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning

Database D itemset sup.


L1 itemset sup. • Hash-based heuristic of generating candidate
TID Items C1 {1} 2 {1} 2 sets of high likelihood of being large itemsets
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
• Basic Idea:
300 1235 {4} 1 – Use hashing to filter out unnecessary itemsets for
{5} 3
400 25 {5} 3 the next candidate itemset generation
C2 itemset
{1 2} • Implementation:
{1 3} – Accumulate information about (k+1)-itemsets in
{1 5} advance in such a way so that all possible
Is there some “magic” way {2 3} (k+1)-itemsets of each transaction after some
to reduce the size of C2 ? {2 5} pruning are hashed into a hash table
{3 5}
• Each bucket in the hash table consists of the count of
itemsets that have been hashed into the bucket so far

Data Mining: Association Rules 29 Data Mining: Association Rules 30

Rule Generation Rule Generation

• Given a frequent itemset L, find all non-empty • How to efficiently generate rules from
subsets f ⊂ L such that f → L – f satisfies the frequent itemsets?
minimum confidence requirement – In general, confidence does not have an anti-
monotone property
– If {A,B,C,D} is a frequent itemset, candidate rules: c(ABC →D) can be larger or smaller than c(AB →D)
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC,
AB →CD, AC → BD, AD → BC, BC →AD,
– But confidence of rules generated from the same
BD →AC, CD →AB itemset has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
• If |L| = n, then there are 2n – 2 candidate
association rules (ignoring L → ∅ and ∅ → L) – Confidence is anti-monotone w.r.t. number of items
on the RHS of the rule
Data Mining: Association Rules 31 Data Mining: Association Rules 32

Rule Pruning Rule Generation

ABCD=>{ } Lattice of rules • Candidate rule is generated by


Low
Confidence merging two rules that share the
Rule same prefix in the rule consequent
BCD=>A ACD=>B ABD=>C ABC=>D

• join(CD → AB, BD → AC) CD→ AB BD→ AC


would produce the candidate
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD rule D → ABC

• Prune rule D → ABC if its


subset AD → BC does not have
D=>ABC C=>ABD B=>ACD A=>BCD high confidence D→ ABC
Pruned
Rules
Data Mining: Association Rules 33 Data Mining: Association Rules 34
Rule Generation Algorithm Algorithm to Generate Association Rules

Key fact:
Moving items from the antecedent to the consequent
never changes support, and never increases confidence

Algorithm
– For each itemset I with minsup:
• Find all minconf rules with a single consequent of the form
(I - L1 ⇒ L1 )
• Guess candidate consequents Ck by appending items from
repeat

I - Lk-1 to Lk-1
• Verify confidence of each rule I - Ck ⇒ Ck using known
itemset support values

Data Mining: Association Rules 35 Data Mining: Association Rules 36

Factors Affecting Complexity Compact Representation of Frequent Itemsets

• Choice of minimum support threshold • Some itemsets are redundant because they have
– lowering support threshold results in more frequent itemsets identical support as their supersets
– this may increase number of candidates and max length of TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
frequent itemsets 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
• Dimensionality (number of items) of the data set 4
5
1 1
1 1
1 1
1 1
1 1 1
1 1 1
1 1
1 1
1 0
1 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
– more space is needed to store support count of each item 6
7
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 1
0 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 0
1 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
– if number of frequent items also increases, both computation 8
9
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 1
0 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 0
1 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
and I/O costs may also increase 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
• Size of database 12
13
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 1
1 1 1
1 1 1
1 1
1 1
1 1
1 1
1 1
1 1
– since Apriori makes multiple passes, run time of algorithm 14
15
0 0
0 0
0 0
0 0
0 0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 1
1 1 1
1 1 1
1 1
1 1
1 1
1 1
1 1
1 1
may increase with number of transactions
• Average transaction width 10 
• Number of frequent itemsets = 3 × ∑  
10

– transaction width increases with denser data sets


k
k =1

– may increase max length of frequent itemsets


• Need a compact representation
Data Mining: Association Rules 37 Data Mining: Association Rules 38

Maximal Frequent Itemset Closed Itemset

An itemset is maximal frequent if none of its immediate supersets • An itemset is closed if none of its immediate
is frequent null
supersets has the same support as the itemset
Maximal
Itemsets A B C D E
Itemset Support
{A} 4
TID Items Itemset Support
{B} 5
1 {A,B} {A,B,C} 2
AB AC AD AE BC BD BE CD CE DE {C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A,B} 4
4 {A,B,D} {B,C,D} 3
{A,C} 2
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 5 {A,B,C,D} {A,B,C,D} 2
{A,D} 3
{B,C} 3
{B,D} 4
ABCD ABCE ABDE ACDE BCDE
{C,D} 3
Infrequent
Itemsets Border
ABCD
E
Data Mining: Association Rules 39 Data Mining: Association Rules 40
Maximal vs. Closed Itemsets Maximal vs. Closed Frequent Itemsets
Closed but
null Transaction Ids Minimum support = 2 null
not maximal
TID Items
124 123 1234 245 345 124 123 1234 245 345
1 ABC A B C D E A B C D E Closed and
maximal
2 ABCD
3 BCE 12 124 24 4 123 2 12 124 24 4 123 2 3 24
3 24 34 45 34 45
AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE
4 ACDE
5 DE
12 24 2 12 2 24 4 4 2 3 4
2 4 4 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9
ABCD ABCE ABDE ACDE BCDE
# Maximal = 4
Not supported by
any transactions ABCDE
ABCDE

Data Mining: Association Rules 41 Data Mining: Association Rules 42

Maximal vs. Closed Itemsets Subsequent Research on Association Rules

• Mining association rules from sequences


Frequent
e.g. stocks with similar movements in stock prices,
Itemsets grocery items bought over a sequence of visits, etc.
• Finding "interesting" rules
Closed
Frequent
– Low-support, high-correlation mining
Itemsets • Efficiently handling long itemsets
Maximal
• Integration with query optimizers
Frequent • Adjustments to handle dense/relational
Itemsets
databases
• Apply constraints to further filter association
rules
Data Mining: Association Rules 43 Data Mining: Association Rules 44

You might also like