Lec5 Association Rule
Lec5 Association Rule
Lec5 Association Rule
Association Rules
Motivation
Discovering relations among transactional data
Example – market basket analysis
Discovery of buying habits of customers: what items
are frequently purchased by a customer in a single
trip?
Help developing market strategies
Issues:
How to formulate association rules
How to determine interesting association rules
How to discover interesting association rules
efficiently in large data set? 2
Formulating Association Rules
Example: a customer that 1 coffee, bread
purchases coffee tends to 2 coffee, meat, apple
also buy sugar is 3 coffee, sugar, noodle, salt
represented as: 4 coffee, sugar, orange, potato
coffee sugar [support = 10%, 5 coffee, sugar, tomato
confidence = 70%] 6 bread, sugar, bean
support = 10%: 10% of all 7 milk, egg
customers purchase both 8 milk, fish
coffee and sugar
confidence = 70%: 70% of the
customers who buy coffee also Total customers: 8
buy sugar Customers who bought coffee: 5
Customers who bought both
Thresholds: support must be at
coffee and sugar: 3
least r, confidence at least c
Support: 3/8 = 37%
Users set thresholds to
Confidents: 3/5 = 60%
indicate interestingness
3
Formulating Association Rule (cont.)
In terms of probability
Let X = (X1, X2) be defined as
For a random customer c, X1 = 1 if c buys coffee,
and 0 otherwise; X2 = 1 if c buys sugar, 0
otherwise
coffee sugar [support = 10%, confidence
= 70%] is interpreted as:
p(X1 = 1, X2 = 1) = 10% and p(X2 = 1|X1 = 1) = 70%
or simply
p(coffee, sugar) = 10% and p(sugar | coffee) = 70%
4
Formulating Association Rule (cont.)
Concepts
I = {i1,…, im} is a set of items
D = {T1,…, Tn} is a set where for all i, Ti I. (Ti is called a
transaction, D is referred as a transaction database.)
An association rule is an implication: A B where A, B
I and A B =
A B holds in D with a support s and confidence r if
| {T : A B T & T D} | = s and | {T : A B T & T D} | = r
|D| | {T : A T & T D} |
5
I = {i1,…, im}
Formulating Association Rule (cont.) D = {T1,…, Tn}
A I, B I
Association rule A => B is valid with respect AB=
to the support threshold r and confidence threshold c if
A => B holds with a support s r and confidence f c
Additional concepts
k-itemset: any subset of I that contains exactly k items
Occurrence frequency of itemset t, denoted as frequency(t): #
of transactions in D that contain t (other terms used: support
count)
Itemset t is frequent with respect to support threshold r if
frequency(t)/|D| r
Implication: A B being frequent with respect to r is a
necessary condition for A => B to be valid
6
Formulating Association Rule
Let I = {apple, bread, bean, coffee, egg, fish, 1 coffee, bread
milk, meat, noodle, orange, potato, salt, sugar} 2 coffee, meat, apple
Let D be the transaction set on the right 3 coffee, sugar, noodle, salt
Let support threshold be 30%, confidence 4 coffee, sugar, orange, potato
threshold be 60% 5 coffee, sugar, tomato
Consider association rule {coffee} => {sugar} 6 bread, sugar, bean
The occ. freq. of (coffee, sugar} is 3 7 milk, egg
{coffee, sugar} is a frequent 2-itemset, since 8 milk, fish
3/8 30%
The occurrence frequency of {coffee} is 5
The confidence for {coffee} => {sugar} is 3/5
60%
So, {coffee} => {sugar} is a valid association
rule w. r. t the giving support the confidence
threshold
Formulating Association Rule
Let I = {apple, bread, bean, coffee, egg, fish, 1 coffee, bread
milk, meat, noodle, orange, potato, salt, sugar} 2 coffee, meat, apple
Let D be the transaction set on the right 3 coffee, sugar, noodle, salt
Let support threshold be 30%, confidence 4 coffee, sugar, orange, potato
threshold be 60% 5 coffee, sugar, tomato
Consider association rule {milk} => {egg} 6 bread, sugar, bean
The occu. freq. of {milk, egg} is 1 7 milk, egg
{milk, egg} is not a frequent 2-itemset, since 8 milk, fish
1/8 < 30%
{milk} => {egg} is not a valid association rule
w.r.t the given thresholds
Mining Association Rules
Goal: discover all the valid association rules with
respect to the given support threshold r and
confidence threshold c
Steps:
Find all frequent item sets w.r.t. r
Generate association rules from the frequent item sets
w.r.t c
Approaches to frequent item set searching
Naive approach
scan the itemset space
for each itemset, count its frequency (scan all the
transactions), and compare with r
high cost – # of itemsets is huge
9
An naive approach for finding all
frequent itemsets ??
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE 10
Apriori Algorithm for AR Mining
Apriori property
Let t1 and t2 be any itemsets and t2 t1. Then
t1 is frequent t2 is frequent
or equivalently, t2 is not frequent t1 is not frequent
So if we know that an itemset is not frequent, then no need
to check its supersets
Based on the second step, we can prune the search space
After pruning, the remaining itemsets are called
candidate itemsets
For each candidate itemset, we count the transactions
that contain it to determine if it is frequent
11
Illustrating the Apriori principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
not frequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
supersets ABCDE
12
Apriori Algorithm (cont.)
Assumes the items are ordered in any itemset
as well as transactions
Work out in the ascending order of i-itemsets
1. Find all the frequent 1-itemsets (by counting)
2. Join (i.e., union) each pair of frq 1-itemsets into a 2-itemset
3. Join each pair of frq (k-1)-itemsets into a k-itemset
4. Among them generate candidate k-itemsets
5. Get the transaction count for each candidate k-itemset and
then collect the frequent ones
6. Repeat these process until candidate sets become
Issues
How to join (step 3)?
How to generate (step 4)?
13
Apriori Algorithm (cont.)
Let U and V be a pair of (k-1)-itemsets, we join them
in the following way
Condition: they share the first k-2 items
Keep these k-2 items, then add the remaining two items,
from one set each
Example:
join {1,4,5,7} and {1,4,5,9}, ok, get {1,4,5,7,9}
join {1,4,5,7} and {1,2,4,8}, no
join {1,4,5,7} and {4,5,7,9}, no
Let W be the resulting set after joining U and V
discard it if one of its (k-1)-subitemsets is not frequent
(this is where apriori property is applied)
all the k-itemsets that have not been discarded constitute
the candidate k-itemsets
14
Apriori Algorithm – an Example
I = {1,2,3,4,5}
D = { {1,2,3,4}, {1,2,4}, {2,4,5}, {1,2,5},{2,4} }
Support threshold: 40% (min support count: 2)
Steps
1. 1-itemsets: {1}, {2}, {3}, {4}, {5}
2. Frequent 1-itemsets: {1}, {2}, {4}, {5}
3. Join frq 1-itemsets: {1,2}, {1, 4}, {1, 5}, {2, 4}, {2, 5}, {4, 5}
4. Candidate 2-itemsets: {1,2}, {1, 4}, {1, 5}, {2, 4}, {2, 5}, {4, 5}
5. Frequent 2-itemsets: {1,2}, {1,4}, {2,4}, {2,5}
6. Join frq 2-itemsets: {1,2,4}, {2,4,5}
7. Candidate 3-itemsets: {1,2,4}
8. Frequent 3-itemsets: {1,2,4}
9. Join frq 3-itemsets:
10.Candidate 4-itemsets:
11.stop
15
Correctness
Does Apriori algorithm find all frequent itemsets?
i.e., does the candidate k-itemsets include all the
frequent k-itemsets?
We require two (k-1)-itemsets U and V to share the first k-2
items to be joined. Does this condition jeopardize correctness?
Suppose U and V do not share the first k-2 items, let W = U V
be a k-itemset. It will not be generated from joining U and V.
Case 1, W is not frequent: not a problem.
Case 2, W is frequent: can we conclude that its frequent itemset
status will not be discovered?
16
Generating Association Rules
Let S be any frequent itemset
For each a S, calculate
𝑓𝑟𝑒𝑞 𝑆
𝑓𝑟𝑒𝑞 𝑎
If this value is not smaller than the
confidence threshold then output the
following association rule:
aS–a
17
Pattern Evaluation
Support and confidence framework can only
help exclude uninteresting rules
But they do not necessarily guarantee the
interestingness of the rules generated
How to make a judgement?
❖ Mostly determined by users subjectively
❖ May be different by different users
❖ Some objective measures may be used in
limited contexts
18
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading when
The overall % of students eating cereal is 75% > 66.7%.
22
Multi-level AR
24
Multi-level AR
Variation 1: Uniform minimum support for all levels
❖ Pros: simplicity
❖ Cons: lower level concepts unlikely to occur with same frequency
as higher level concepts
25
Multi-level AR
Variation 2: reduced minimum support at lower levels
❖ Pros: higher flexibility
❖ Cons: increased complexity in mining process
❖ Note: Apriori property may not always hold
26