Frequent Patterns and Association Rule Mining: Outline

Chapter 5
Frequent Patterns and

Association Rule Mining
Outline
 Frequent Itemsets and Association Rule
 APRIORI
 Post-processing
 Applications
1
Transactional Data
Tid Items bought

10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
 Definitions:
 An item: an article in a basket, or an attribute-value pair
 A transaction: set of items purchased in a basket; it may have
TID (transaction ID)
 A transactional dataset: A set of transactions
Itemsets & Frequent Itemsets
 itemset: A set of one or more items

 k-itemset X = {x1, …, xk}
 (absolute) support, or, support count of X:
Frequency or occurrence of an itemset X
 (relative) support, s, is the fraction of
transactions that contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s support is no less
than a minsup threshold
2
Association Rule
 Find all the rules X  Y with minimum

support and confidence
 support, s, probability that a transaction contains
XY
 s = P(X  Y)
= support count (X  Y) / number of all transactions
 confidence, c, conditional probability that a
transaction having X also contains Y
 c = P(X|Y)
= support count (X  Y) / support count (X)
Example
Tid Items bought Customer Customer
buys both
10 Beer, Nuts, Diaper buys
Beer, Coffee, Diaper diaper
20
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk

Customer
50 Nuts, Coffee, Diaper, Eggs, Milk
buys beer
 Let minsup = 50%, minconf = 50%

 Number of all transactions = 5 & Min. support count = 5*50%=2.5 ⇒ 3
 Items: Beer, Nuts, Diaper, Coffee, Eggs, Milk
 Freq. Pat.: {Beer}:3, {Nuts}:3, {Diaper}:4, {Eggs}:3, {Beer, Diaper}:3
 Association rules (support, confidence):
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
3
Use of Association Rules
 Association rules do not necessarily represent causality
or correlation between the two itemsets.
 X  Y does not mean X causes Y, no Causality
 X  Y can be different from Y  X, unlike correlation
 Association rules assist in Basket data analysis, cross-
marketing, catalog design, sale campaign analysis, Web
log (click stream) analysis, and DNA sequence analysis.
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
Computational Complexity
of Frequent Itemset Mining
 How many itemsets are potentially to be generated in the worst case?
 The number of frequent itemsets to be generated is senstive to the minsup
threshold
 When minsup is low, there exist potentially an exponential number of
frequent itemsets
 The worst case is close to MN where M: # distinct items, and N: max length
of transactions when M is large.
 The worst case complexity vs. the expected probability
 Ex. Suppose Walmart has 104 kinds of products
 The chance to pick up one product 10 -4
 The chance to pick up a particular set of 10 products: ~10 -40
 What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
4
Outline
 APRIORI
 Post-processing
 Applications
Association Rule Mining
 Major steps in association rule mining

 Frequent itemsets computation
 Rule derivation
 Use of support and confidence to measure
strength
10
5
The Downward Closure Property
 The downward closure property of frequent

patterns
 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
 i.e., every transaction having {beer, diaper, nuts}
also contains {beer, diaper}
11
Apriori: A Candidate Generation & Test Approach
 A frequent (used to be called large) itemset is an

itemset whose support (S) is ≥ minSup.
 Apriori pruning principle:
 If there is any itemset which is infrequent, its superset should
not be generated/tested!
ABC ABD ACD BCD
AB AC AD BC BD CD
A B C D
12
6
APRIORI
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can
be generated
13
Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={{a,b,c}, {a,b,d}, {a,c,d}, {a,c,e}, {b,c,d}}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 abcd is kept since abc, abd, acd, and bcd are in L3
 acde is removed because cde and ade are not in L3
 C4 = {abcd}
14
7
Self-joining
 Lk is generated by joining Lk-1 with itself

 Apriori assumes that items within a transaction or
itemset are sorted in lexicographic order.
 l1 from Lk-1 = {l1[1], l1[2],…., l1[k-2], l1[k-1] }
 l2 from Lk-1 = {l2[1], l2[2],…., l2[k-2], l2[k-1] }
 Only when first k-2 items of l1 and l2 are in
common, and l1[k-1] < l2[k-1], these two itemsets
are joinable
15
The Apriori Algorithm—An Example

Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E 1st scan {D} 1
{C} 3
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup
3rd scan
{B, C, E} {B, C, E} 2
16
8
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
17
Further Improvement of the Apriori Method
 Major computational challenges

 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
 Completeness: any association rule mining algorithm should get
the same set of frequent itemsets.
18
9
Partition: Scan Database Only Twice
 Any itemset that is potentially frequent in DB must be frequent in

at least one of the partitions of DB
 Scan 1: partition database and find local frequent patterns
 Since the sub-database is relatively smaller than original database,
all frequent itemsets are tested in one scan.
 Scan 2: consolidate global frequent patterns
 A. Savasere, E. Omiecinski, and S. Navathe. An efficient

algorithm for mining association in large databases. In VLDB’95
19
DHP: Reduce the Number of Candidates

 A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent
 Candidates: a, b, c, d, e
 Frequent 1-itemset: a, b, d, e
 Using some hash function, get hash entries: {ab, ad, ae} with bucket
count as 3, {bd, bd, be, de} with bucket count as 4…
 ab is not a candidate 2-itemset if the count of bucket - {ab, ad, ae}
is below support threshold
 J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for
mining association rules. In SIGMOD’95
20
10
Transaction Reduction
 A transaction that does not contain any

frequent k-itemsets cannot contain any
frequent (k+1)-itemsets.
 Such a transaction could be marked or
removed from further consideration of
(k+1)-itemsets.
 Less support counting
21
Sampling for Frequent Patterns

 Select a sample of original database, mine frequent
patterns within sample using Apriori with a lower support
count threshold than min. support.
 Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent patterns
 H. Toivonen. Sampling large databases for association
rules. In VLDB’96
22
11
Outline
 APRIORI
 Post-processing
 Applications
Derive rules from frequent itemsets
 Frequent itemsets != association rules

 One more step is required to find association
rules
 For each frequent itemset X,
For each proper nonempty subset A of X,
 Let B = X - A
 A B is an association rule if
 Confidence (A  B) ≥ minConf,
where support (A  B) = support (AB) and
confidence (A  B) = support (AB) / support (A)
12
Example – deriving rules from frequent itemsets
 Suppose {2,3,4} is frequent, with supp=50%

 Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2},
{3}, {4}, with supp = 50%, 50%, 75%, 75%, 75%,
75% respectively
 These generate these association rules:
 2,3 => 4, confidence=100%
 2,4 => 3, confidence=100%
 3,4 => 2, confidence=67%
 2 => 3,4, confidence=67%
 3 => 2,4, confidence=67%
 4 => 2,3, confidence=67%
 All rules have support = 50%
Mining Various Kinds of Association Rules
 Mining multilevel association
 Mining multidimensional association
 Mining quantitative association
13
Mining Multiple-Level Association Rules
 Items often form hierarchies

 A Top-down strategy is employed, any algorithm
could be used for mining at each level.
 Flexible support settings
 Uniform minimum support for all levels.
 Items at the lower level are expected to have lower
support.
 Set up user-specific, item or group-based minimum
support.
 Setting particularly low support thresholds for laptop
computers and flash drives.
Example
uniform support reduced support

Level 1
min_sup = 5%
Milk Level 1
[support = 10%] min_sup = 5%
Level 2 2% Milk Skim Milk Level 2

min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%
14
Multi-level Association: Redundancy Filtering
 Some rules may be redundant due to “ancestor”
relationships between items
 Example
 milk  wheat bread [support = 8%, confidence = 70%]
 2% milk  wheat bread [support = 2%, confidence = 72%]
 We say the first rule is an ancestor of the second
rule
 A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor
Mining Multi-Dimensional Association

 Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or
predicates
 Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
 hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
15
Mining Quantitative Associations
 Categorical Attributes and Quantitative
Attributes
 Techniques can be categorized by how
numerical attributes, such as age or salary are
treated
 Static discretization based on predefined concept
hierarchies
 Dynamic discretization based on data distribution.
 Clustering: Distance-based association.
 One dimensional clustering then association
Id Age Income Student Credit_Rating Buy_Computer

1 27 75,000 No 600 No
2 25 72,000 No 730 No
3 33 88,000 No 640 Yes
4 50 55,000 No 620 Yes
5 52 34,000 Yes 640 Yes
6 45 30,000 Yes 720 No
7 32 25,000 Yes 740 Yes
8 25 54,000 No 630 No
9 22 35,000 Yes 640 Yes
10 48 67,000 Yes 660 Yes
11 24 64,000 Yes 715 Yes
12 37 62,000 No 710 Yes
13 33 90,000 Yes 650 Yes
14 45 59,000 No 705 No
16
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
age income student credit_rating buy_computer

1 3 0 1 0
1 3 0 2 0
2 3 0 1 1
3 2 0 1 1
3 1 1 1 1
3 1 1 2 0
2 1 1 2 1
1 2 0 1 0
1 1 1 1 1
3 2 1 1 1
1 2 1 2 1
2 2 0 2 1
2 3 1 1 1
3 2 0 2 0
17
Interestingness Measure: Correlations (Lift)
 The occurrence of itemset A is independent of the

occurrence of itemset B if P(AUB)=P(A)P(B); otherwise,
itemsets A and B are dependent and correlated as
events.
 Measure of dependent/correlated events: lift
P ( A B )
lift 
P ( A) P ( B )
 < 1, A is negatively correlated with B.
 > 1, A and B are positively correlated.
 = 1, A and B are independent.
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading

 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more
accurate, although with lower support and confidence
2000 / 5000
lift ( B, C )   0.89 Basketbal Not Sum (row)
3000 / 5000 * 3750 / 5000 l basketball
Cereal 2000 1750 3750
1000 / 5000
lift ( B, C )   1.33 Not cereal 1000 250 1250
3000 / 5000 *1250 / 5000 Sum(col.) 3000 2000 5000
18
Outline
 APRIORI
 Post-processing
 Applications
Synthetic Data on Purchase of Phone

Faceplates
 A store that sells accessories for cellular
phones runs a promotion of faceplates.
Customers who purchase multiple faceplates
from a choice of six different colors get a
discount.
 The store manager wants to know what
colors of faceplates customers are likely to
purchase together.
19
Transactions for Purchase of Different-
Colored Cellular Phone Faceplates
Transaction Faceplate Colors Purchased
1 Red, white, green
2 White, orange
3 White, blue
4 Red, white, orange
5 Red, blue
6 White, blue
7 White, orange
8 Red, white, blue, green
9 Red, white, blue
10 Yellow
Phone Faceplate Data in Binary Matrix

Format
Transaction Red White Blue Orange Green Yellow
1 1 1 0 0 1 0
2 0 1 0 1 0 0
3 0 1 1 0 0 0
4 1 1 0 1 0 0
5 1 0 1 0 0 0
6 0 1 1 0 0 0
7 1 0 1 0 0 0
8 1 1 1 0 1 0
9 1 1 1 0 0 0
10 0 0 0 0 0 1
20
Item Sets with Support Count of At Least Two
(20%)
Item Set Support (Count)
{red} 6
{white} 7
{blue} 6
{orange} 2
{green} 2
{red, white} 4
{red, blue} 4
{red, green} 2
{white, blue} 4
{white, orange} 2
{white, green} 2
{red, white, blue} 2
{red, white, green} 2
Generating Association Rule

 For itemset {red, white, green}
 Rule 1: {red, white} => {green},
 conf = sup {red, white, green} / sup {red, white} = 2/4 = 50%
 Rule 2: {red, green} => {white},
 conf = sup {red, white, green} / sup {red, green} = 2/2 = 100%
 Rule 3: {white, green} => {red},
 conf = sup {red, white, green} / sup {white, green} = 2/2 = 100%
 Rule 4: {red} =>{white, green},
 conf = sup {red, white, green} / sup {red} = 2/6 = 33%
 Rule 5: {white} => {red, green},
 conf = sup {red, white, green} / sup {white} = 2/7 = 29%
 Rule 6: {green} => {red, white}
 conf = sup {red, white, green} / sup {green} = 2/2 = 100%
 If the desired min_conf is 70%, we got Rule 2, 3, 6.
21
Final Results for Phone Faceplate Transactions
Rule # Conf.% X Y Supp.(X) Supp.(Y) Supp.(XUY) Lift
1 100 Green Red, White 2 (20%) 4 (40%) 2 (20%) 2.5

2 100 Green Red 2 (20%) 6 (60%) 2 (20%) 1.67
3 100 Green, White Red 2 (20%) 6 (60%) 2 (20%) 1.67
4 100 Green White 2 (20%) 7 (70%) 2 (20%) 1.43
5 100 Green, Red White 2 (20%) 7 (70%) 2 (20%) 1.43
6 100 Orange White 2 (20%) 7 (70%) 2 (20%) 1.43
 The support for the rule indicates its impact in terms of overall size: What
proportion of transactions is affected?
 The confidence indicates what rate Y will be found, is useful in determining
the business or operational usefulness of a rule.
 The lift ratio indicates how efficient the rule is in finding Y, compared to
random selection.
 The more records the rule is based on, the more solid the conclusion since
the key evaluative statistics are based on ratios and proportion.
Image Processing: Image Scene
22
Yield Map
Image Processing : Data
http://www.cs.washington.edu/research/metip/about/digital.html
23
Color Image : Data
 A color image can be represented by a two-

dimensional array of Red, Green and Blue
triples. Typically, each number in the triple
also ranges from 0 to 255, where 0 indicates
that none of that primary color is present in
that pixel and 255 indicates a maximum
amount of that primary color.
Pixel Band1 (Red) Band2 (Green) Band3 (Blue)
1 40 140 200
2 50 130 210
Data
 Three bands of Yield map were converted into a

gray scale.
Pixel Band1 Band2 Band3 Band4 (Gray Scale)
(Red) (Green) (Blue) Yield
1 40 140 200 240
2 50 130 210 250
 The problem is to discover the associations between

band1, band2, band3 and band4. This will help
farmers to understand what combination of spectral
bands will have a high crop yield.
24
Preprocessing of Data
 Data Discretization: [0,63] [64,127] [128,191] [192,255]
divide the range of a Band1 B11 B12 B13 B14
continuous attribute into Band4 B41 B42 B43 B44
intervals.
[0,31] [32,63] [64,95] [96,127] [128,159] [160,191] [192,225] [226,255]
Band2 B21 B22 B23 B24 B25 B26 B27 B28

Band3 B31 B32 B33 B34 B35 B36 B37 B38
Pixel Band1 (Red) Band2 (Green) Band3 (Blue) Band4 (Gray Scale)
Yield
1 B11(40) B25(140) B37(200) B44(240)
2 B11(50) B25(130) B37(210) B44(250)
Improvement of ARM Algorithm Based

on Domain Knowledge
 Dimensions: B11- B14, B21- B28, B31- B38,
B41- B44.
 These candidates should not be generated
since the combination of k intervals from
same band has support zero.
 Possible 2-itemsets candidates: {B11, B14} if B11
and B14 are frequent 1-itemsets.
25
Evaluate Rules
 From the domain knowledge, we know that

band1, band2, and band3 refer to reflectance
data and band4 refers to yield data. The
association rules the user likes to mine are of
the form: band1  band2  band3 => band4.
 When minsup = 40%, we found two rules:
 B12  B26  B32 => B42
 B11  B25  B37 => B44
Reference
 “The application of Association Rule Mining

to Remotely Sensed Data”, J. Dong, W.
Perrizo, Q. Ding and J. Zhou, SAC’2000.
26

Frequent Patterns and Association Rule Mining: Outline

Uploaded by

Copyright:

Available Formats

Frequent Patterns and Association Rule Mining: Outline

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Frequent Patterns and Association Rule Mining: Outline

Uploaded by

Copyright:

Available Formats

Chapter 5

Frequent Patterns and

Tid Items bought

Itemsets & Frequent Itemsets

 itemset: A set of one or more items

 Find all the rules X  Y with minimum

40 Nuts, Eggs, Milk

 Let minsup = 50%, minconf = 50%

Association Rule Mining

 Major steps in association rule mining

 The downward closure property of frequent

Apriori: A Candidate Generation & Test Approach

 A frequent (used to be called large) itemset is an

 Lk is generated by joining Lk-1 with itself

The Apriori Algorithm—An Example

Further Improvement of the Apriori Method

 Major computational challenges

 Any itemset that is potentially frequent in DB must be frequent in

 Scan 2: consolidate global frequent patterns

 A. Savasere, E. Omiecinski, and S. Navathe. An efficient

DHP: Reduce the Number of Candidates

 A transaction that does not contain any

Sampling for Frequent Patterns

Derive rules from frequent itemsets

 Frequent itemsets != association rules

 Suppose {2,3,4} is frequent, with supp=50%

Mining Various Kinds of Association Rules

 Mining multilevel association

 Mining multidimensional association

 Mining quantitative association

 Items often form hierarchies

uniform support reduced support

Level 2 2% Milk Skim Milk Level 2

Mining Multi-Dimensional Association

Id Age Income Student Credit_Rating Buy_Computer

age income student credit_rating buy_computer

 The occurrence of itemset A is independent of the

Interestingness Measure: Correlations (Lift)

 play basketball  eat cereal [40%, 66.7%] is misleading

Synthetic Data on Purchase of Phone

Phone Faceplate Data in Binary Matrix

{red, white, blue} 2

{red, white, green} 2

Generating Association Rule

1 100 Green Red, White 2 (20%) 4 (40%) 2 (20%) 2.5

Image Processing: Image Scene

Image Processing : Data

 A color image can be represented by a two-

 Three bands of Yield map were converted into a

 The problem is to discover the associations between

divide the range of a Band1 B11 B12 B13 B14

continuous attribute into Band4 B41 B42 B43 B44

Band2 B21 B22 B23 B24 B25 B26 B27 B28

Improvement of ARM Algorithm Based

 From the domain knowledge, we know that

 “The application of Association Rule Mining

You might also like