1.2 Association Rule Mining: Abdulfetah Abdulahi A
1.2 Association Rule Mining: Abdulfetah Abdulahi A
1.2 Association Rule Mining: Abdulfetah Abdulahi A
Abdulfetah Abdulahi A.
2
on
cti r
cts
ns
a me u
tra sto od
cu pr t
id
id u gh
bo
T1 C33 p2, p5, p8
sales T2 C45 p5, p8, p11
market-basket
T3 C12 p1, p9
records: T4 C14 p5, p8, p11 data
T5 C12 p2, p9
T6 C12 p9
3
Basic Concepts: Frequent Patterns and
Association Rules
A C (50%, 66.7%)
C A (50%, 100%)
Customer
buys milk
4
Association Rule: Basic Concepts
5
Definition: Frequent Itemset
TID Items
• Itemset 1 Bread, Milk
– A collection of one or more items 2 Bread, Diaper, Soda, Eggs
3 Milk, Diaper, Soda, Coke
• Example: {Milk, Bread, Diaper}
4 Bread, Milk, Diaper, Soda
– k-itemset 5 Bread, Milk, Diaper, Coke
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold
6
Definition: Association Rule
Association Rule TID Items
1 Bread, Milk
– An implication expression of the form X 2 Bread, Diaper, Soda, Eggs
Y, where X and Y are itemsets 3 Milk, Diaper, Soda, Coke
– Example: 4 Bread, Milk, Diaper, Soda
{Milk, Diaper} {Soda} 5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support of the rule
• Either expressed in count form or
{milk ,diaper} soda
percentage form
– Confidence (c)
• In general LHS RHS, Confidence = P(RHS|LHS) s={milk ,diaper, soda}=2
7
Exercise
Outlook Temperature Humidity Play • Minimum support=2
sunny hot high no
– {sunny, hot, no}
sunny hot high no
– {sunny, hot, high, no}
overcast hot high yes
rainy mild high yes
– {rainy, normal}
rainy cool normal yes • Min Support =3
rainy cool normal no –?
overcast cool normal yes • How strong is {sunny, no}?
sunny mild high no – Count =
sunny cool normal yes
– Percentage =
rainy mild normal yes
sunny mild normal yes
overcast mild high yes – What is the confidence of
overcast hot normal yes Outlook=sunnyPlay=no?
rainy mild high no
8
Types of Association Rules
• Quantitative
– Age(X, “30…39”) and income(X, “42K…48K”) buys(X, TV)
• Single vs. Multi dimensions:
– Buys(X, computer) buys(X, “financial soft”);
– Multi: above example
• Levels of abstraction
– Age(X, ..) buys(X, “laptop computer”)
– Age(X, ..) buys(X, “computer);
• Extensions
– Correlation, causality analysis
•Association does not necessarily imply correlation or causality
– Maxpatterns (a frequent pattern s.t. any proper subpattern is not frequent) and
closed itemsets (if there exist no proper superset c’ of c s.t. any transaction
containing c also contains c’)
9
Closed and max Patterns/itemset
• Closed Itemset: support of all parents are Min_sup=2
not equal to the support of the itemset.
Tid Items
• Maximal Itemset: all parents of that
itemset must be infrequent. 10 A,B,C,D,E
Example: 20 B,C,D,E,
• {C} is closed as support of 30 A,C,D,F
– parents (supersets) of {c} not equal to 3
• {A C}:2, {B C}:2, {C D}:1, {C E}:2 not equal support
of {c}:3. And the same for {A C}, {B E} & {B C E}.
• {A C} is maximal as
• all parents (supersets) {A B C}, {A C D}, {A C E} are
infrequent. And the same for {B C E}.
10
Find frequent Max Patterns
Outlook Temperature Humidity Play
sunny hot high no
• Minimum support=2
sunny hot high no – {sunny, hot, no} ??
overcast hot high yes
rainy mild high yes
rainy cool normal yes
rainy cool normal no
overcast cool normal yes
sunny mild high no
sunny cool normal yes
rainy mild normal yes
sunny mild normal yes
overcast mild high yes
overcast hot normal yes
rainy mild high no
11
Find closed Patterns/itemsets
TID Items
Minsup=2
10 a, b, c
20 a, b, c Frequent itemset?
30 a, b, d Closed itemsets?
40 a, b, d,
50 c, e, f
12
Example max and closed itemsets
Frequent, but
superset BC
Support Maximal(s=3) Closed
also frequent.
Frequent, and
A 4 No No
13
Association Rule Mining Task
14
Two methods:
Apriori and FP growth
15
Mining Association Rules—an Example
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
For rule A C: {A, C} 50%
16
Method 1:
Apriori: A Candidate Generation-and-test Approach
17
The Apriori Algorithm — An Example
minSup=0.5 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
st
{C} 3
20 B, C, E 1 scan
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2
Itemset sup
C2 Itemset
{A, B} 1 nd
L2 Itemset sup 2 scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
Itemset rd
C3 3 scan
L3 Itemset sup
{B, C, E} {B, C, E} 2 18
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
19
Important Details of Apriori
– Step 1: self-joining Lk
– Step 2: pruning
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– acde is removed because ade is not in L3
• C4={abcd}
21
Derive rules from frequent itemsets
22
Example – deriving rules from frequent itemses
23
How to Count Supports of Candidates?
25
Is Apriori Fast Enough? — Performance Bottlenecks
26
Mining Frequent Patterns Without Candidate Generation
27
Example
Items Bought
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n
28
Scan the database
29
Scanning TID=100
Transaction
{}
Node
f:1
TID Items
Item count head
100 f,a,c,d,g,i,m,p f 1
c:1
c 1
a 1
a:1
m 1
m:1
p 1 p:1
30
Scanning TID=200
Items Frequent Single Items:
Bought F1=<f,c,a,b,m,p>
f,a,c,d,g,i,m,p TID=200
f,c,a,b,m
b,f,h,j,o Along the first branch of <f,c,a,m,p>,
intersect:
b,c,k,s,p
<f,c,a>
31
Scanning TID=200
{}
Transaction Header Table
Database Node
f:2
TID Items Item count head
200 f,c,a,b,m f 1
c:2
c 1
a 1
a:2
b 1
m 2
m:1 b:1
p 1 p:1 m:1
32
The final FP-tree
{}
Transaction Header Table
Database Node
f:4 c:1
TID Items Item count head
100 f,a,c,d,g,i,m,p f 1
c:3 b:1 b:1
200 a,b,c,f,l,m,o c 2
300 b,f,h,j,o a 1
a:3 p:1
400 b,c,k,s,p b 3
500 a,f,c,e,l,p,m,n m 2
m:2 b:1
p 2
Frequent 1-items in
frequency descending order:
f,c,a,b,m,p 33
FP-Tree Construction
34
How to Mine an FP-tree?
35
Conditional Pattern Base
p:2 m:1
36
Conditional Pattern Tree
• <f,c,a>: support =2
m:2
• <f,c,a,b>: support = 1
– {m}’s conditional pattern tree
37
Composition of patterns a and b
38
Why Is Frequent Pattern Growth Fast?
• Divide-and-conquer:
– decompose both the mining task and DB according to the
frequent patterns obtained so far
– leads to focused search of smaller databases
• Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic ops—counting and FP-tree building, not pattern search
and matching
39
Multi-Dimensional Association:
Concepts
• Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
• Multi-dimensional rules: 2 dimensions or predicates
– Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
– hybrid-dimension association rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
• Categorical Attributes
– finite number of possible values, no ordering among values
• Quantitative Attributes
– numeric, implicit ordering among values
40
From association mining to correlation analysis
41
• Association rule mining using weka.
• Next: Classification
42
Reading from text book:
43