Associationrule 1
Associationrule 1
Associationrule 1
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
• Discloses an intrinsic and important property of data sets
• Forms the foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data
– Classification: associative classification
– Cluster analysis: frequent pattern-based clustering
– Data warehousing: iceberg cube and cube-gradient
– Semantic data compression: fascicles
– Broad applications
Basic Concepts: Frequent
Patterns and Association Rules
Transaction-id Items bought • Itemset X = {x1, …, xk}
10 A, B, D
• Find all the rules X Y with
20 A, C, D
minimum support and confidence
30 A, D, E
40 B, E, F
– support, s, probability that a
50 B, C, D, E, F
transaction contains X Y
– confidence, c, conditional
Customer Customer
buys both probability that a transaction
buys diaper
having X also contains Y
# trans containing ( X Y )
P( X | Y )
# trans containing X
Example of Support and
Confidence
To calculate the support and
TID Items confidence of rule
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs {Milk, Diaper} {Beer}
3 Milk, Diaper, Beer, Coke • # of transactions: 5
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
• # of transactions containing
{Milk, Diaper, Beer}: 2
• Support: 2/5=0.4
• # of transactions containing
{Milk, Diaper}: 3
• Confidence: 2/3=0.67
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Bread, Milk, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– # transactions containing an itemset
– E.g. ({Bread, Milk, Diaper}) = 2
• Support (s)
– Fraction of transactions containing an itemset
– E.g. s({Bread, Milk, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a min_sup threshold
Association Rule Mining Task
• An association rule r is strong if
– Support(r) ≥ min_sup
– Confidence(r) ≥ min_conf
• Given a transactions database D, the goal of
association rule mining is to find all strong rules
• Two-step approach:
1. Frequent Itemset Identification
– Find all itemsets whose support min_sup
2. Rule Generation
– From each frequent itemset, generate all confident
rules whose confidence min_conf
Rule Generation
Suppose min_sup=0.3, min_conf=0.6,
Support({Beer, Diaper, Milk})=0.4
TID Items All candidate rules:
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs {Beer} {Diaper, Milk} (s=0.4, c=0.67)
{Diaper} {Beer, Milk} (s=0.4, c=0.5)
3 Milk, Diaper, Beer, Coke
{Milk} {Beer, Diaper} (s=0.4, c=0.5)
4 Bread, Milk, Diaper, Beer
{Beer, Diaper} {Milk} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Beer, Milk} {Diaper} (s=0.4, c=0.67)
{Diaper, Milk} {Beer} (s=0.4, c=0.67)
All non-empty real subsets Strong rules:
{Beer} , {Diaper} , {Milk}, {Beer,
Diaper}, {Beer, Milk} , {Diaper, {Beer} {Diaper, Milk} (s=0.4, c=0.67)
Milk} {Beer, Diaper} {Milk} (s=0.4, c=0.67)
{Beer, Milk} {Diaper} (s=0.4, c=0.67)
{Diaper, Milk} {Beer} (s=0.4, c=0.67)
Frequent Itemset Indentification: the Itemset Lattice
null Level 0
A B C D E Level 1
AB AC AD AE BC BD BE CD CE DE Level 2
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Level 3
Level 5 ABCDE
Given I items, there
are 2I-1 candidate
itemsets!
Frequent Itemset Identification: Brute-Force Approach
• Brute-force approach:
– Set up a counter for each itemset in the lattice
– Scan the database once, for each transaction T,
• check for each itemset S whether T S
• if yes, increase the counter of S by 1
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Disadvantages:
1) Given 20 attributes, number of combinations is
220-1 = 1048576. Therefore array storage
requirements will be 4.2MB.
2) Given a data sets with (say) 100 attributes it is
likely that many combinations will not be
present in the data set --- therefore store only
those combinations present in the dataset!
How to Get an Efficient Method?
• The complexity of a brute-force method is O(MNw)
– M=2I-1, I is the number of items
• How to get an efficient method?
– Reduce the number of candidate itemsets
– Check the supports of candidate itemsets
efficiently
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Anti-Monotone Property
• Any subset of a frequent itemset must be
also frequent — an anti-monotone property
– Any transaction containing {beer, diaper, milk}
also contains {beer, diaper}
– {beer, diaper, milk} is frequent {beer, diaper}
must also be frequent
• In other words, any superset of an
infrequent itemset must also be infrequent
– No superset of any infrequent itemset should
be generated or tested
– Many item combinations can be pruned!
Illustrating Apriori Principle
Level 0 null
Level 1 A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to
be ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Pruned ABCDE
Supersets
An Example
Transaction ID Items Bought Min. support 50%
2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
For rule A C: {A,C} 50%
support = support({A λC}) = 50%
confidence = support({A λC})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Mining Frequent Itemsets: the Key
Step
• Find the frequent itemsets: the sets of items that
have minimum support
– A subset of a frequent itemset must also be a
frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be
frequent itemsets
– Iteratively find frequent itemsets with cardinality from
1 to k (k-itemset)
• Use the frequent itemsets to generate
association rules.
Apriori: A Candidate Generation-and-
Test Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’
94)
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated
Intro of Apriori Algorithm
• Basic idea of Apriori
– Using anti-monotone property to reduce candidate
itemsets
– Any subset of a frequent itemset must be also frequent
– In other words, any superset of an infrequent itemset must
also be infrequent
• Basic operations of Apriori
– Candidate generation
– Candidate counting
• How to generate the candidate itemsets?
– Self-joining
– Pruning infrequent candidates
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Apriori-based Mining
Data base D 1-candidates Freq 1-itemsets 2-candidates
TID Items Itemset Sup Itemset Sup Itemset
10 a, c, d a 2 a 2 ab
20 b, c, e Scan D b 3 b 3 ac
30 a, b, c, e c 3 c 3 ae
40 b, e d 1 e 3 bc
Min_sup=0.5 e 3 be
ce
3-candidates Freq 2-itemsets Counting
Scan D Itemset Itemset Sup Itemset Sup
bce ac 2 ab 1
bc 2 ac 2 Scan D
Freq 3-itemsets be 3 ae 1
ce 2 bc 2
Itemset Sup
be 3
bce 2
ce 2
The Apriori Algorithm
• Ck: Candidate itemset of size k
• Lk : frequent itemset of size k
• L1 = {frequent items};
• for (k = 1; Lk !=; k++) do
– Candidate Generation: Ck+1 = candidates
generated from Lk;
– Candidate Counting: for each transaction t in
database do increment the count of all
candidates in Ck+1 that are contained in t
– Lk+1 = candidates in Ck+1 with min_sup
• return k Lk;
Candidate-generation: Self-joining
• Given Lk, how to generate Ck+1?
Step 1: self-joining Lk
INSERT INTO Ck+1
SELECT p.item1, p.item2, …, p.itemk, q.itemk
FROM Lk p, Lk q
WHERE p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk <
q.itemk abc abc n/a
• Example abd abd abcd
L3={abc, abd, acd, ace, bcd} acd acd n/a
Self-joining: L3*L3 ace ace n/a
– abcd abc * abd bcd bcd n/a
– acde acd * ace
C4={abcd, acde}
Candidate Generation: Pruning
• Can we further reduce the candidates in Ck+1?
For each itemset c in Ck+1 do
For each k-subsets s of c do
If (s is not in Lk) Then delete c from Ck+1
End For
End For
• Example
L3={abc, abd, acd, ace, bcd}, C4={abcd, acde}
acde cannot be frequent since ade (and also cde) is
not in L3, so acde can be pruned from C4.
How to Count Supports of Candidates?
• Why counting supports of candidates a
problem?
– The total number of candidates can be very
huge
– One transaction may contain many candidates
• Method
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets
and counts
– Interior node contains a hash table
– Subset function: finds all the candidates
contained in a transaction
Challenges of Apriori Algorithm
• Challenges
• Improving
– Multiple Apriori:
scans ofthe general database
transaction ideas
––Reduce the number
Huge number of transaction database
of candidates
scans
– Tedious workload of support counting for
• DIC: Start count k-itemset as early as possible
candidates
• S. Brin R. Motwani, J. Ullman, and S. Tsur,
• Improving Apriori: the general ideas
SIGMOD’97.
––Shrink the number of candidates
Reduce the number of transaction database
• DHP: A k-itemset whose corresponding hashing bucket count is
scans
below the threshold cannot be frequent
– Shrink the
• J. Park, M. number
Chen, andofP.candidates
Yu, SIGMOD’95
––Facilitate
Facilitatesupport
supportcounting
counting of of candidates
candidates
Performance Bottlenecks
• The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
– Use database scan and pattern matching to collect counts for the
candidate itemsets
• The bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one
needs to generate 2100 1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest pattern