Unit-8 (2)

8.
Association Rule Mining

Subject : Machine Learning(3170724)
Dr. Ami Tusharkant Choksi
Associate Professor and HOD, Computer
Department,
C.K.Pithawalla College of Engineering &
Technology, Surat.
Website: www.ckpcet.ac.in
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)

Contents
Association rules

What is association rule mining?
● Association rule mining is a procedure which aims to
observe frequently occurring patterns, correlations, or
associations from datasets found in various kinds of
databases such as relational databases, transactional
databases, and other forms of repositories.
3
Bread - 100
Butter - 97
Minimum count=75
(bread, butter)-77
T1:(butter, chocolates,bread)
T2:(bread, kurkure, butter)
………..
4
Background
● Proposed by Agrawal et al in 1993.
● Assume all data are categorical.
● No good algorithm for numeric data.
● Initially used for Market Basket Analysis to find how items
purchased by customers are related.
Bread → Milk [sup = 5%, conf = 100%]
5
Basket Data
● Retail organizations, e.g.,
supermarkets, collect and store
massive amounts sales data,
called basket data.
● A record consist of
○ transaction date
○ items bought
● Or, basket data may consist of
items bought by a customer over
a period.
6
The model: data
■ I = {i1, i2, …, im}: a set of items.
■ Transaction t :
❑ t a set of items, and t ⊆ I.
■ Transaction Database T: a set of transactions

T = {t1, t2, …, tn}.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 7

Transaction data: supermarket data
■ Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
■ Concepts:
❑ An item: an item/article in a basket
❑ I: the set of all items sold in the store
❑ A transaction: items purchased in a basket; it may
have TID (transaction ID)
❑ A transactional dataset: A set of transactions
Transaction data: a set of documents
■ A text document data set. Each document is treated as a
“bag” of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game
I={Student, Teach, School, City, Game, Basketball, Player,
Spectator, Coach, Team Machine
Dr. Ami Tusharkant Choksi@CKPCET
} Learning (3170724) 9
The model: rules
■ A transaction t contains X, a set of items (itemset) in I, if
X ⊆ t.
■ An association rule is an implication of the form:
X → Y, where X, Y ⊂ I, and X ∩Y = ∅
■ An itemset is a set of items.
❑ E.g., X = {milk, bread, cereal} is an itemset.
■ A k-itemset is an itemset with k items.
❑ E.g., {milk, bread, cereal} is a 3-itemset

Rule strength measures
■ Support: The rule holds with support sup in T (the
transaction data set) if sup% of transactions contain X ∪
Y.
❑ sup = Pr(X ∪ Y).
■ Confidence: The rule holds in T with confidence conf if
conf% of transactions that contain X also contain Y.
❑ conf = Pr(Y | X)
■ An association rule is a pattern that states when X
occurs, Y occurs with certain probability.

Support and Confidence
■ Support count: The support count of an itemset X,
denoted by X.count, in a data set T is the number of
transactions in T that contain X. Assume T has n
transactions.
■ Then,

Goal and key features
■ Goal: Find all rules that satisfy the user-specified
minimum support (minsup) and minimum confidence
(minconf).
■ Key Features
❑ Completeness: find all rules.
❑ No target item(s) on the right-hand-side
❑ Mining with data on hard disk (not in memory)

Example TID
● Itemset Items
○ A collection of one or more items,e.g., 1 Bread, Peanuts, Milk, Fruit, Jam
{milk, bread, jam}
○ k-itemset, an itemset that contains k items 2 Bread, Jam, Soda, Chips, Milk,
● Support count Fruit
○ Frequency of occurrence of an itemset 3 Jam, Soda, Chips, Bread
○ ({Milk, Bread}) = 3
○ ({Soda, Chips}) = 4 4 Jam, Soda, Peanuts, Milk, Fruit
● Support
5 Jam, Soda, Chips, Milk, Bread
○ Fraction of transactions that contain an
itemset 6 Fruit, Soda, Chips, Milk
○ s({Milk, Bread}) = ⅜
7 Fruit, Soda, Peanuts, Milk
○ s({Soda, Chips}) = 4/8
● Frequent Itemset 8 Fruit, Peanuts, Cheese, Yogurt
○ An itemset whose support is greater than
or equal to a minsup threshold
14
What is an association rule?
● Implication of the form X ->Y, where X and Y are itemsets
● Example, {bread} {milk}
● Rule Evaluation Metrics, Support & Confidence
● Support (s): Fraction of transactions that contain both X and
Y
● Confidence (c): Measures how often items in Y appear in
transactions that contain X
● support(bread, milk).count/number of transactions = ⅜ = .38
● confidence(bread=>milk) = support(bread,
milk)/support(bread) = (⅜)/(4/8) =.75=(⅜)/(4/8)
15
Support and Confidence
16
What is goal?
● Given a set of transactions T, the goal of association rule
mining is to find all rules having support ≥ minsup
confidence ≥ minconf threshold
● Mining Association Rules
● {Bread, Jam}=> {Milk} s=0.4 c=0.75
● {Milk, Jam}=> {Bread} s=0.4 c=0.75
● {Bread} =>{Milk, Jam} s=0.4 c=0.75
● {Jam}=> {Bread, Milk} s=0.4 c=0.6
● {Milk} =>{Bread, Jam} s=0.4 c=0.5
17
Mining Association Rules
● All the above rules are binary partitions of the same
itemset: {Milk, Bread, Jam}
● Rules originating from the same itemset have identical
support but can have different confidence
● x=>y support(xuy).count/x.count
● y=>x support(xuy).count/y.count
● x=>y and y=>x are not same
● We can decouple the support and confidence
requirements!
18
Mining Association Rules: Two Step
Approach
● Frequent Itemset Generation: Generate all itemsets
whose support minsup
● Rule Generation Generate high confidence rules from
frequent itemset Each rule is a binary partitioning of a
frequent itemset
● Frequent itemset generation is computationally expensive
19
Applications
● Market Basket Analysis: given a database of customer transactions,

where each transaction is a set of items the goal is to find groups of
items which are frequently purchased together.
● Telecommunication (each customer is a transaction containing the
set of phone calls)
● Credit Cards/ Banking Services (each card/account is a transaction
containing the set of customer’s payments)
● Medical Treatments (each patient is represented as a transaction
containing the ordered set of diseases)
● Basketball-Game Analysis (each game is represented as a
transaction containing the ordered set of ball passes)
20
Next Lecture

Apriori Algorithm
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support (denoted by Li
for ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by

joining Lk-1 with itself.
22
(bread, butter)-2 itemset frequent Only if
Bread-frequent Butter -frequent
(bread, butter, cheese)-(bread, butter), (butter, cheese), (bread, cheese)
To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself.
(bread, butter, cheese)-3 itemset
{(bread, butter), (butter, cheese), (bread, cheese)} -2 itemset
23
The Apriori Algorithm in a Nutshell
• Find the frequent itemsets: the sets of items that have
minimum support
– A subset of a frequent itemset must also be a frequent
itemset
• i.e., if {AB} is a frequent itemset, both { A} and { B} should
be a frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
• Use the frequent itemsets to generate association rules.
24
Frequent Itemset Generation
25
The Apriori Algorithm : Pseudo code
Lk: Set of frequent itemsets of size k (with min support)
Ck: Set of candidate itemset of size k (potentially frequent itemsets)
L1 = {frequent items};
for (k = 1; Lk !=ϕ; k++) do
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
return ∪k Lk;
26
The Apriori Algorithm : Pseudo code (snapshot)
27
The Apriori Algorithm — Example
28
Explanation
No of transactions=4
Minimum support=50%
Minimum Support count=4*50/100 = 2

Minimum support 50%
No of transactions=4
50%*4=2->=Minimum support count
30
The Apriori Algorithm — Example...
● {1,3,5} is not considered as part of C3 because apriori
property is not satisfied.
● According to apriori property, {1,3,5} can be part of C3 ONLY
if all of its subsets are part of L2 frequent 2-itemset.
● I.e.{1,3},{3,5} and {1,5} must be part of L2, which are not.
● Same way {1,2,3} is also not part of C3. because {1,3},{2,3}
are part of L2 but {1,2} is not part of L2.
31
How to Generate Candidates
Input: Li-1 : set of frequent itemsets of size i-1
Output: Ci : set of candidate itemsets of size i
Ci = empty set;
for each itemset J in Li-1 do
for each itemset K in Li-1 s.t. K<> J do
if i-2 of the elements in J and K are equal then
if all subsets of {K ∪ J} are in Li-1 then
Ci = Ci ∪ {K ∪ J}
return Ci;
32
Example of Generating Candidates
•L3={abc, abd, acd, ace, bcd}
•Generating C4 from L3
–abcd from abc and abd
–acde from acd and ace
•Pruning:
–acde is removed because ade is not in L3
•C4={abcd}
33
Example of Discovering Rules
Let use consider the 3-itemset {I1, I2, I5}:
I1 ^ I2 => I5
I1 ^ I5 => I2
I2 ^ I5 => I1
I1 => I2 ^ I5
I2 => I1 ^ I5
I5 => I1 ^ I2
34
Discovering Rules
● If Confidence{A=>B} =support(A U B)/support(A)>=minconf
Then Rule A=>B is accepted
● One more example,
>=minconf
Then rule is accepted

35
Example of Discovering Rules
TID List of item_IDs
Let use consider the 3-itemset
T100 I1, I2, I5 {I1, I2, I5} with support of 0.22(2)%.
Let generate all the association rules
T200 I2, I4
from this itemset:
T300 I2, I3
I1^ I2 => I5 confidence= 2/4 = 50%
T400 I1, I2, I4
I1 ^ I5 => I2 confidence= 2/2 = 100%
T500 I1, I3 I2 ^ I5 => I1 confidence= 2/2 = 100%
T600 I2, I3 I1 => I2 ^ I5 confidence= 2/6 = 33%
T700 I1, I3 I2 => I1 ^ I5 confidence= 2/7 = 29%
T800 I1, I2, I3, I5 I5 => I1 ^ I2 confidence= 2/2 = 100%
T900 I1, I2, I3 36
Aprioi complete example
TID List of item_IDs • Consider a database, D , consisting
T100 I1, I2, I5 of 9 transactions.
• Suppose min. support count
T200 I2, I4
required is 2 (i.e. min_sup = 2/9 = 22
T300 I2, I3 %)
T400 I1, I2, I4 • Let minimum confidence required is
70%.
T500 I1, I3
• We have to first find out the
T600 I2, I3 frequent itemset using Apriori
T700 I1, I3 algorithm.
• Then, Association rules will be
T800 I1, I2, I3, I5
generated using min. support & min.
T900 I1, I2, I3 confidence. 37
Step 1: Generating 1-itemset Frequent Pattern
Ite Support Compare

with Ite Support 2/9 = .22 >= mini_support
m Count
mini_supp m Count
ort
{I1} 6 All items with support_count>=2 will
{I1} 6
be having support>=mini_support
{I2} 7
{I2} 7
{I3} 6
{I3} 6
{I4} 2
{I4} 2
{I5} 2
{I5} 2
C1 L1
38
Itemset support
Itemset count
Itemset support
{I1, I2} {I1, I2} 4 Compare count
with ● 2/9 = .22 >=
{I1, I3} {I1, I3} 4 mini_supp {I1, I2} 4
ort
mini_support
{I1, I4} {I1, I4} 1 {I1, I3} 4 ● All items with
{I1, I5} {I1, I5} 2 {I1, I5} 2 support_count
{I2, I3} {I2, I3} 4 {I2, I3} 4
>=2 will be
having
{I2, I4} {I2, I4} 2 {I2, I4} 2
support>=mini
{I2, I5} {I2, I5} 2 {I2, I5} 2 _support
{I3, I4} {I3, I4} 0
L2
{I3, I5} {I3, I5} 1
C2
{I4, I5} {I4, I5} 0
39
Itemset Itemset support Itemset support ● 2/9 = .22 >=
count count mini_support
{I1, I2, I3}
{I1, I2, I3} 2 {I1, I2, I3} 2 ● All items with
{I1, I2, I5}
{I1, I2, I5} 2 Compare {I1, I2, I5} 2
support_count
with >=2 will be
C3 mini_supp
ort L3 having
support>=mini
The generation of the set of candidate 3-itemsets, C3 , involves
_support
use of the Apriori Property.
• In order to find C3, we compute L2 Join L2.
• C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},
{I2, I3, I5}, {I2, I4, I5}}.
• Now, Join step is complete and Prune step will be used to
reduce the size of C3. Prune step helps to avoid heavy
computation due to large Ck.
40
Based on the Apriori property that all subsets of a frequent itemset must also be
frequent, we can determine that four later candidates cannot possibly be frequent. How
?
• For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2,
I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in
C3.
• Let's take another example of {I2, I3, I5} which shows how the pruning is performed.
The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
• BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2, I3, I5} from C3.
• Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join
operation for Pruning. • Now, the transactions in D are scanned in order to determine
L3, consisting of those candidates 3-itemsets in C3 having minimum support.
41
Step 4: Generating 4-itemset Frequent
Pattern
• The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4.
Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset
{{I2, I3, I5}} is not frequent.
• Thus, C4 = φ , and algorithm terminates, having found all of the frequent items.
This completes our Apriori Algorithm.
• What’s Next ?
These frequent itemsets will be used to generate strong association rules ( where
strong association rules satisfy both minimum support & minimum confidence).
42
Step 5: Generating Association Rules from Frequent
Itemsets
Procedure:
• For each frequent itemset “l”, generate all nonempty subsets of l.
• For every nonempty subset s of l, output the rule “s Æ (l-s)” if support_count(l) /
support_count(s) >= min_conf where min_conf is minimum confidence threshold.
• Back To Example: We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5},
{I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
– Lets take l = {I1,I2,I5}.
– Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
43
Itemsets
Let minimum confidence threshold is , say 70%.
• The resulting association rules are shown below,
each listed with its confidence.
– R1: I1 ^ I2 => I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected.
– R2: I1 ^ I5 => I2
• R2 is Selected.
– R3: I2 ^ I5 => I1
• R3
Dr. Amiis Selected.
Tusharkant Choksi@CKPCET Machine Learning (3170724)
44
Sampling for Frequent Patterns
■ Select a sample of original database, mine frequent
patterns within sample using Apriori
■ Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
■ Example: check abcd instead of ab, ac, …, etc.
■ Scan database again to find missed frequent patterns
■ H. Toivonen. Sampling large databases for association
rules. In VLDB’96
Itemsets
R4: I1 => I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
– R5: I2 => I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
– R6: I5 => I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected.
In this way, We have found three strong
association rules.
46
Strength and weakness of apriori
algorithm

End for ML Other slides are extra
material for the chapter

Multiple minimum class supports
● The multiple minimum support idea can also be applied here.
● The user can specify different minimum supports to different classes,
which effectively assign a different minimum support to rules of each
class.
● For example, we have a data set with two classes, Yes and No. We
may want
○ rules of class Yes to have the minimum support of 5% and
○ rules of class No to have the minimum support of 10%.
● By setting minimum class supports to 100% (or more for some
classes), we tell the algorithm not to generate rules of those classes.
● This is a very useful trick in applications.
49
Advantages and Disadvantages
● Advantages:
- Uses large itemset property.
- Easily parallelized
- Easy to implement.
● Disadvantages:
- Assumes transaction database is memory resident.
- Requires many database scans.
50
Methods to Improve Apriori’s Efficiency
• Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold cannot
be frequent.
• Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans.
• Partitioning: Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB.
• Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness.
• Dynamic itemset counting: Add new candidate itemsets only
when all of their subsets are estimated to be frequent. 51
Further Improvement of the Apriori Method
■ Major computational challenges
■ Multiple scans of transaction database
■ Huge number of candidates
■ Tedious workload of support counting for candidates
■ Improving Apriori: general ideas
■ Reduce passes of transaction database scans
■ Shrink number of candidates
■ Facilitate support counting of candidates
Partition: Scan Database Only Twice
■ Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
■ Scan 1: partition database and find local frequent
patterns
■ Scan 2: consolidate global frequent patterns
DB1 DB2 DBk = DB

+ + +
sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
Partitioning method example
Transaction Itemset ● Minimum support =20%
● I.e.support count = 2
T1 I1,I5 (20*6/100=1.2=2)
T2 I2,I4 ● We partition the database in 3
parts. So each partition will be
T3 I4,I5 having 2 transactions and
minimum support count=1
T4 I2,I3
● (20*2/100=0.4=1)
T5 I5
T6 I2,I3,I4

Partitioning method example

DHP(Direct Hashing with Efficient Pruning)
1. Scan all the transaction. Create possible 2-itemsets.
2. Let the Hash table of size 8.
3. For each bucket assign an candidate pairs using the order values of the
itemsets, apply hash function and put in the hash table.
4. Each bucket in the hash table has a count, which is increased by 1 each item
an itemset is hashed to that bucket.
5. If the bucket count is equal or above the minimum support count, the bit vector
is set to 1. Otherwise it is set to 0.
6. The candidate pairs that hash to locations where the bit vector bit is not set are
removed.
7. Modify the transaction database to include only these candidate pairs.
8. Support count is calculated from the entries in the buckets of hash table,
instead if scanning whole database
56
Hash based itemset counting example
Table:1 Database Table:2 Support count Table:3 2-itemset Generation from table-1
TID Items Item Support TID Items 2-itemset Generation

count
100 A, C, D 100 A, C, D {A,C}, {A,D}, {C,D}
A 2
200 B, C, E 200 B, C, E {B,C},{B,E},{C,E}
B 3
300 A, B, C, E 300 A, B, C, E {A,B}, {A,C}, {A,E},
C 3 {B,C}, {B,E}, {C,E}
400 B, E
D 1 400 B, E {B,E}
Minimum support=2 E 3
● Hash function: h(x,y) = ((order of x)*10 + (order of y)) e.g.h(A,B)=5, count=1
mod 7 Inserting (A,C), (C,D) int hash
● (order = sequence number is as in the table-2 table,
● h(A,B) = ((1*10) + 2) mod 7 = 5 h(A,C) = 6, count=1
● Add each 2-itemset into hash table according to h h(C,D) = 6, count=2
value and increment the count of each bucket 57
Mining Multiple-Level
Generation of C2
-{A,C}
Order of A=1, C=3
h(x,y)=(1*10 + 3)%7=6
-{A,B}
Order of A=1, B=2
h(x,y)=(1*10 + 2)%7=5
-{B,E}
Order of B=2, E=5
h(x,y)=(2*10 + 5)%7=4
-{B,C}
Order of B=2, E=3
h(x,y)=(2*10 + 3)%7=2
…
-L1={A,B,C,E}
-
L2={{A,C},{B,C},{B,E},{
C,E}}
58
- Counting support based on the contents of hash table.
- Using apriori property to generate 3-itemset.
59
Table:5 3-itemset Generation from table-1
Table:1 Database Table:4 Support count
TID Items TID Items 3-itemset Generation
Item Support
100 A, C, D count 100 A, C, D {A,C,D}-x-apriori property
200 B, C, E A,C 2
200 B, C, E {B,C,E}-v-right
300 A, B, C, E B,C 2
300 A, B, C, E {A,B,C}, {A,B,E}, {A,C,E}-x-
400 B, E B,E 3 apriori property
C,E 2 400 B, E empty

V-right, X-wrong
But applying any function, count cannot be more than support count
- Scanning database is reduced here, as no need to find support count every
time from database, that we can find from hash table.
- B,C,E support count, we can find from transaction
- We have reduced checking support count for 2-itemset generation in the
example
60
Hash based itemset counting: one more
example
61
62
63
DHP: Reduce the Number of Candidates
■ DHP(Direct Hashing with Efficient Pruning)
■ A k-itemset whose corresponding hashing bucket count is below the
threshold cannot be frequent count itemsets
■ Candidates: a, b, c, d, e 35 {ab, ad, ae}
■ Hash entries 88 {bd, be, de}
■ {ab, ad, ae}
■ {bd, be, de} ...
... ...
■ Frequent 1-itemset: a, b, d, e 102 {yz, qs, wt}

■ ab is not a candidate 2-itemset Hash Table
if the sum of count of {ab, ad, ae} is below support threshold

Transaction Reduction Method
Transaction that does not contain any frequent k-itemsets, cannot
contain any frequent (k+1)itemsets. Such transaction can be
removed from further consideration.
TID Items A B C D E sum
100 A, C, D T1 1 0 1 1 0 3>minsup
200 B, C, E T2 0 1 1 0 1 3>minsup
300 A, B, C, E T3 1 1 1 0 1 4>minsup
400 B, E T4 0 1 0 0 1 2=minsup
sum 2=mi 3>mi 3>min 1<mi 3>minsu
Minimum support = 2
nsup nsup sup nsup p
65
Remaining one after removing ‘D’ column who doesn’t satisfy the
minimum support.
A B C E sum
TID Items
T1 1 0 1 0 2=minsup
100 A, C, D
T2 0 1 1 1 3>minsup
200 B, C, E
T3 1 1 1 1 4>minsup
300 A, B, C, E
T4 0 1 0 1 2=minsup
400 B, E
sum 2=mi 3>mi 3>min 3>minsu
Minimum support = 2 nsup nsup sup p
Frequent items are
{A,B,C,E}
66
2-itemset generation
TID Items A,B A,C A,E B,C B,E C,E sum

100 A, C, D T1 0 1 0 0 0 0 1<minsup
200 B, C, E T2 0 0 0 1 1 1 3>minsup
300 A, B, C, E T3 1 1 1 1 1 1 6>minsup
400 B, E T4 0 0 0 0 1 0 1<minsup
Minimum support = 2 su 1< 2= 1<m 2=mi 3>mi 3>mi
L2={{A,C},{B,C},{B,E},{ m min min insu nsup nsup nsup
C,E sup sup p

After removal, remaining ones
TID Items A,C B,C B,E C,E sum

100 A, C, D T2 0 1 1 1 3>minsup
200 B, C, E T3 1 1 1 1 6>minsup
300 A, B, C, E su 2= 2=mi 3 3
400 B, E m min nsup
sup
Minimum support = 2

3-Itemset Generation
TID Items B,C,E

100 A, C, D So, for 3-itemset generation
T2 1 ONLY {B,C,E} is frequent
200 B, C, E T3 1
300 A, B, C, E su 2=mi
400 B, E m nsup
Minimum support = 2

Transaction reduction one more example
You can see on

https://www.youtube.com/watch?v=asWqVHex9kY
70
Dynamic Itemset Counting:Algorithm
■ It reduces the number of passes made over the
data while keeping the number of itemsets which
are counted in any pass relatively low.
■ The technique can add new candidate itemsets
at any marked start point of the database during
the scanning of the database.

■ Itemsets are dynamically added and deleted as
transactions are read
■ Relies on the fact that for an itemset to be
frequent, all of its subsets must also be frequent,
so we only examine those itemsets whose
subsets are all frequent

■ Algorithm stops after every M transactions to
add more itemsets.
■ Train analogy: There are stations every M
transactions. The passengers are itemsets.
Itemsets can get on at any stop as long as they
get off at the same stop in the next pass around
the database.

■ Only itemsets on the train are counted when
they occur in transactions. At the very beginning
we can start counting 1-itemsets, at the first
station we can start counting some of the 2-
itemsets. At the second station we can start
counting 3-itemsets as well as any more 2-
itemsets that can be counted and so on.

■ Itemsets are marked in four different ways as
they are counted:
■ Solid box: DIC Solid Box confirmed frequent
itemset - an itemset we have finished counting
and exceeds the support threshold minsupp
■ Solid circle: DIC Solid Circle confirmed infrequent
itemset - we have finished counting and it is
below minsupp

■ Dashed box: DIC Dashed Box suspected
frequent itemset - an itemset we are still
counting that exceeds minsupp
■ Dashed circle: DIC Dashed Circle suspected
infrequent itemset - an itemset we are still
counting that is below minsupp

■ Mark the empty itemset with a solid square.
Mark all the 1-itemsets with dashed circles.
Leave all other itemsets unmarked.
■ While any dashed itemsets remain:
■ Read M transactions (if we reach the end of
the transaction file, continue from the

beginning). For each transaction, increment
the respective counters for the itemsets that
appear in the transaction and are marked
with dashes.
■ If a dashed circle's count exceeds minsupp,
turn it into a dashed square. If any immediate
superset of it has all of its subsets as solid or
dashed squares, add a new counter for it and
make it a dashed circle.
■ Once a dashed itemset has been counted
through all the transactions, make it solid and
stop counting it.

■ Itemset lattices: An itemset lattice contains all of
the possible itemsets for a transaction database.
Each itemset in the lattice points to all of its
supersets. When represented graphically, a
itemset lattice can help us to understand the
concepts behind the DIC algorithm.

Summary of Previous Lecture
● We are learning Dynamic Itemset Counting(DIC):Algorithm
● Solid - confirmed
● Dashed - suspected
● Box - frequent
● Circle- infrequent
● Solid box - confirmed frequent
● Solid circle - confirmed infrequent
● Dashed box - suspected frequent
● Dashed circle - suspected infrequent
80
Summary of Previous Lecture
● We read m transactions at a time.
● In 1st pass, one itemset counts are changed typically
● In 2nd pass, two itemset counts are changed typically
● Transitions of dashed-> solid and from circle-> box
○ Dashed circle -> dashed box
○ Dashed box-> solid box if count>=min support
○ Dashed box-> solid circle if count< min support
81
Dynamic Itemset Counting:EXAMPLE
TID A B C
T1 1 1 0
T2 1 0 0
T3 0 1 1
T4 0 0 0
Minimum support=25%, M=2 no of transactions read

Minimum support count=25/100*4 = 1
Itemset lattice for the

transaction database:

Itemset lattice before any
transactions are read:
● Counters: A = 0, B = 0, C = 0
● Empty itemset is marked with
a solid box. All 1-itemsets are
marked with dashed circles.

After M transactions are read:
T1: A, B, T2: A
● Counters: A = 2, B = 1, C = 0,
AB = 0
● We change A and B to dashed
boxes because their counters
are greater than minsup (1)
and add a counter for AB
because both of its subsets
are boxes.
After 2M transactions are read:
T3: B,C, T4: {}
-Counters: A = 2, B = 2, C = 1, AB
= 0, AC = 0, BC = 0
-C changes to a square because
its counter is >= minsup.A, B and
C have been counted all the way
through so we stop counting them
and make their boxes solid. Add
counters for AC and BC because
their subsets are all boxes.
T1: A, B, T2: A
● Counters: A = 2, B = 2, C = 1,
AB = 1, AC = 0, BC = 0
● AB has been counted all the
way through and its counter
satisfies minsup so we change
it to a solid box. BC remains to
a dashed circle.

T3:B, C T4: {}
● Counters: A = 2, B = 2, C = 1,
AB = 1, AC = 0, BC = 1
● AC and BC are counted all the
way through. We do not count
ABC because one of its
subsets is a circle. There are
no dashed itemsets left so the
algorithm is done.
Discussion of the Apriori algorithm
● Much faster than the Brute-force algorithm
○ It avoids checking all elements in the lattice
● The running time is in the worst case O(2d)
○ Pruning really prunes in practice
● It makes multiple passes over the dataset
○ One pass for every level k
● Multiple passes over the dataset is inefficient when we
have thousands of candidates and millions of transactions
89
Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
■ Bottlenecks of the Apriori approach
■ Breadth-first (i.e., level-wise) search
■ Candidate generation and test
■ Often generates a huge number of candidates
■ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
■ Depth-first search
■ Avoid explicit candidate generation
■ Major philosophy: Grow long patterns from short ones using local frequent
items only
■ “abc” is a frequent pattern
■ Get all transactions having “abc”, i.e., project DB on abc: DB|abc
■ “d” is a local frequent item in DB|abc → abcd is a frequent pattern
90
Mining Frequent Patterns Without Candidate
Generation
● Compress a large database into a compact, Frequent-Pattern tree (FP-tree)
structure
○ highly condensed, but complete for frequent pattern mining
○ avoid costly database scans
● Develop an efficient, FP-tree-based frequent pattern mining method
○ A divide-and-conquer methodology: decompose mining tasks into smaller ones
○ Avoid candidate generation: sub-database test only!
91
FP-Growth Method
● First, create the root of the tree, labeled with “null”.
● Scan the database D a second time. (First time we scanned it to
create 1-itemset and then L).
● The items in each transaction are processed in L order (i.e. sorted
order). Bread:5, butter:3
● A branch is created for each transaction with items having their
support count separated by colon. I:2
● Whenever the same node is encountered in another transaction, we
just increment the support count of the common node or Prefix.
● To facilitate tree traversal, an item header table is built so that each
item points to its occurrences in the tree via a chain of node-links.
● Now, The problem of mining frequent patterns in database is
transformed to that of mining the FP-Tree.
92
FP-Growth Biggest Advantages
● The biggest advantage found in FP-Growth is the fact that the
algorithm only needs to read the file twice, as opposed to apriori
who reads it once for every iteration.
● Another huge advantage is that it removes the need to calculate
the pairs to be counted, which is very processing heavy, because
it uses the FP-Tree. This makes it O(n) which is much faster than
apriori.
● The FP-Growth algorithm stores in memory a compact version of
the database.
93
FP-Growth Method : An Example
TID List of item_IDs • Consider a database, D ,
T100 I1, I2, I5
consisting of 9 transactions.
• Suppose min. support count
T200 I2, I4 required is 2 (i.e. min_sup = 2/9 = 22
T300 I2, I3 %)
T400 I1, I2, I4
• Let minimum confidence required is
70%.
T500 I1, I3
• We have to first find out the
T600 I2, I3 frequent itemset using Apriori
T700 I1, I3 algorithm.
• Then, Association rules will be
T800 I1, I2, I3, I5
generated using min. support & min.
T900 I1, I2, I3 confidence. 94
FP-Growth Method : An Example
● Step 1: The first step is we count all
TID Support Compar TID Support the items in all the transactions
e with
Count
mini_su
Count ● Step 2: Next we apply the threshold
pport we had set previously.
{I1} 6 and {I2} 7
sorting 2/9 = .22 >= mini_support
{I2} 7 {I1} 6
{I3} 6 All items with support_count>=2 will be
{I3} 6
having support>=mini_support
{I4} 2 {I4} 2
{I5} 2 ● Step 3: Now we sort the list according
{I5} 2 to the count of each item.In sorting
order
Sorted L=[I2:7, I1:6, I3:6, I4:2, I5:2]

95
FP Growth Method Example
Step 4: Now we build the tree. We go

through each of the transactions and null
add all the items in the order they
appear in our sorted list.
I2:1
1.Transaction to add= [I1, I2, I5]

I1:1
a/c sorted list it will be added as
[I2, I1, I5] - set count=1 for each item I5:1
96

I2:2
2.Transaction to add= [I2, I4]

I4:1
I1:1
[I2, I4] - increment count if revisited I5:1

the node, else set count=1 for each
item
97

I2:3

I1:1 I4:1
I3:1
[I2, I3] - increment count if revisited I5:1

the node, else set count=1 for each
item
98

I2:4
4.Transaction to add= [I1, I2, I4]

I1:2 I4:1
a/c sorted list it will be added as I3:1
[I2, I1, I4] - increment count if I5:1

I4:1
revisited the node, else set count=1
for each item
99

through each of the transactions and
add all the items in the order they null

I1:1
I2:4
I3:1
a/c sorted list it will be added as I1:2 I4:1
I3:1
[I1, I3] - increment count if revisited

I5:1
the node, else set count=1 for each I4:1
item
100


I1:1
appear in our sorted list. I2:5

I3:1
I1:2 I4:1

I5:1
item
101


I1:2

I3:2
I1:2 I4:1

I5:1
item
102

I1:2
8.Transaction to add= [I1, I2, I3, I5]

I3:2
I1:3 I4:1
[I2, I1, I3, I5] - increment count if I5:1

revisited the node, else set count=1 I4:1 I3:1
for each item
I5:1
103
null
through each of the transactions and I1:2
I2:7
I3:2
I4:1
9.Transaction to add= [I1, I2, I3] I1:4 I3:2

I4:1 I3:2
[I2, I1, I3] - increment count if
revisited the node, else set count=1
for each item I5:1
104
null
Step 5: In order to get the

associations now we go through I1:2
I2:7
every branch of the tree and only
include in the association all the
I3:2
nodes whose count passed the
I3:2 I4:1
I1:4
threshold.
[I2:7, I1:4] in the available branch. I5:1

Together they appear 4 times in I4:1 I3:2
branch. So association,
{I2, I1}=4/9 = 44% I5:1
105
null

I2:7
I3:2
I3:2 I4:1
I1:4
threshold.
[I2:7, I1:4, I3:2] in the available I5:1

branch. Together they appear 2 I4:1 I3:2
times in branch. So association,
{I2:7, I1:4, I3:2}=2/9 = 22% I5:1
106
null

I2:7
I3:2
I3:2 I4:1
I1:4
threshold.

{I2:7, I3:2}=2/9 = 22% I5:1
107
null

I2:7
I3:2
I3:2 I4:1
I1:4
threshold.

{I1:2, I2:2}=2/9 = 22% I5:1
108
Step 5: So associations,
{I2, I1}=4/9 = 44%
{I2:7, I1:4, I3:2}=2/9 = 22%
{I2:7, I3:2}=2/9 = 22%
{I1:2, I2:2}=2/9 = 22%
109
Items
Exercise Ordered didn’t pass
Frequent min_support
TID Items bought (ordered) threshold
frequent items
Item Support
items
Ite Support Count
10 {f, a, c, d, g, i, m, p} {f, c, a, m, p} m Count
l 2
0
f 4
d 1
20 {a, b, c, f, l, m, o} {f, c, a, b, m}
0 c 4
g 1
30 {b, f, h, j, o, w} {f, b} a 3
h 1
0 b 3
i 1
40 {b, c, k, s, p} {c, b, p}
m 3
0 j 1
min_support = 3
p 3
50 {a, f, c, e, l, p, m, n} {f, c, a, m, p} w 1 110
0 Ami Tusharkant Choksi@CKPCET
Dr. Machine Learning (3170724)
FP-Growth Example
For ordered transaction: {f, c, a, m, p} ro

ot
f:1
c:1
a:1
m:
1
p:1
111
FP-Growth Example
For ordered transaction: {f, c, a, b, m} ro

ot
f:2
c:2
a:2
m:
1
b:1
p:1
m:
1
112
FP-Growth Example
For ordered transaction: {f, b} ro

ot
f:3
c:2
b:1
a:2
m:
1
b:1
p:1
m:
1
113
FP-Growth Example
For ordered transaction:{c, b, p} ro

ot
f:3 c:1
c:2
b:1 b:1
a:2
m: p:1
1
b:1
p:1
m:
1
114
FP-Growth Example
Check for item’s support count from tree and table
Ordered ro
Frequent items ot
Item Support f:4 c:1

Count
f 4 c:3
b:1 b:1
c 4
a:3
a 3
m: p:1
2
b 3 b:1
m 3 p:2
m:
p 3 1
115
Create Conditional FP-tree to mine association rules Example
-Create conditional FP-tree: create list of paths to reach a particular node.
Write path and the count of reach node. root
-E.g. reaching p node having two paths f-c-a-m and c-b. The count of p
through f-c-a-m is 2 and through c-b is 1, so will write in table as fcam:2 f: c
4 :
and cb:1 in CBP 1
c
It Conditional Conditional FP : b b
3 : :
e Pattern tree a 1 1
m Base(CBP) :
3 p
m
f Empty Empty :
: b 1
2 :
c {f:3} {f:3}|c p 1
:
a {fc:3} {f:3,c:3}|a 2 m
:
1
b {fca:1}, {f:1}, Empty(no For conditional FP tree: take whatever is common in more
{c:1} common) than one path for reaching a particular node and add their
count. e.g.{fca:2}, {fcab:1}->fca is common, and 2+1=3
m {fca:2}, {fcab:1} {f:3,c:3,a:3}|m {f:3,c:3,a:3}|m
p {fcam:2},{cb:1} {c:3}|p Explanation: https://www.youtube.com/watch?v=y8iHL6vKgIo

116
FP-Growth DisAdvantages
● The biggest problem is the interdependency of data. The interdependency
problem is that for the parallelization of the algorithm some that still needs to
be shared, which creates a bottleneck in the shared memory.
117
Apriori vs FP-Growth
118
Algorithm Technique Runtime Memory Parallelizability
usage
Apriori Generate Candidate Saves Candidate

singletons, pairs, generation is singletons, generation is very
triplets, etc. extremely slow. pairs, triplets, parallelizable
Runtime increases etc.
exponentially
depending on the
number of different
items.
FP- Insert sorted Runtime increases Stores a Data are very

Growth items by linearly, depending compact interdependent,
frequency into a on the number of version of the each node needs
pattern tree transactions and database. the root.
items
119
Associative Classification Mining
● Given a labeled training data set, the problem is to derive a set of
class association rules (CARs) from the training data set which
satisfy certain user-constraints, i.e support and confidence
thresholds.
● Common Associative Algorithms:
○ CBA : Class based Association
○ CPAR : Classification based on Predictive Association Rule
○ CMAR : Classification Based on Multiple Class-Association Rules
○ MCAR : Multi-class Classification based on Association Rule
120
In detail few algorithms
● CBA : is a Class based Association Rule Mining (CARM) algorithm
developed by Bing Liu, Wynne Hsu and Yiming Ma (Liu et al. 1998).
● CBA operates using a two stage approach to generate a classifier:
○ Generating a complete set of CARs (Classification Association Rules).
○ Prune the set of CARs to produce a classifier.
● Uses apriori algorithm to generate candidate sets.
● CMAR:-Classification Based on Multiple Class-Association Rules
○ uses approaches based on the frequent pattern (FP)- growth method to discover rules
○ CBA and CMAR are time consuming
● CPAR -Classification based on Predictive Association Rule
○ The CPAR and other predictive mining algorithms overcome this problem by generating
a small set of predictive rules directly from the dataset based on the rule prediction and
coverage analysis, as opposed to generating candidate rules.
121
Associative Classification Mining-
algorithms
● MCAR- Multi-class Classification based on Association Rule
○ MCAR uses an efficient technique for discovering frequent items and
employs a rule ranking method which ensures detailed rules with high
confidence are part of the classifier.
122
Incremental ARM
● In incremental association rule mining as the time goes new
transaction are added and old transaction are being obsolete.
● old rule may be dropped out and new rule may be arrived in.
● Incremental algorithms are:
○ FUP (fast update)
○ FUP2
○ UPDATE WITH EARLY PRUNING(UWEP)
○ Negative Border
123
FUP
● FUP is the first algorithm of incremental association rule mining.
● It works with insertion transaction only
● It cannot work with deletion transaction.
● It performs multiple scanning of database i.e. it scans incremented
database as well as old database.
● it performs similar operation for k itemset.
● Original database D and its corresponding frequent itemsets L = {L1...
Lk}.
● The goal is D’= D ∪ Δ+. Here Δ+ is an incremented database.
124
FUP2
● It is an extension of FUP algorithm.
● It works with incremented database as well as decremented
database.
● So i.e. it will handle deletion of transaction from old database also.
● FUP2 is equivalent to FUP for the case of insertion, and is, however,
a complementary algorithm of FUP for the case of deletion.
● For a general case that transactions are added and deleted, algorithm
FUP2 can work smoothly with both the deleted portion Δ− and the
added portion Δ+ of the whole dataset.
● It gives poor result if it used with temporal database. [Temporal
database stores data relating to time instances. It offers temporal data
types and stores information relating to past, present and future time.]
125
UPDATE WITH EARLY
●
PRUNING(UWEP)
It is a subset of FUP algorithm.
● In update, with early pruning algorithm it prunes the itemset in original dataset
as soon as it became infrequent in updated database D’.
● It will not wait until all kth iteration is completed. So it reduces the candidate
set generation in incremented database.
126
Negative Border
● Negative border algorithm is used for improving efficiency of FUP-based
algorithm
● Given a collection of frequent itemsets L negative border Bd−(L) of L consists
of itemset R which are not in L.
● In other words, the negative border consists of all itemsets that were
candidates of the level-wise method which did not have enough support.
● This algorithm first scans incremented part of database and then whole
database is scanned if and only if itemset outside of negative border gets
added to frequent itemset. This may result into increasing size of candidate
set generation.
127
Next Lecture
128
Mining Multiple-Level Association Rules
■ Items often form hierarchies
■ Flexible support settings
■ Items at the lower level are expected to have lower support
■ Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)
uniform support reduced support

Level 1 Milk Level 1
min_sup = 5% [support = 10%] min_sup = 5%
Level 2 2% Milk Skim Milk Level 2

min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

Multi-level Association: Flexible Support and
Redundancy filtering
■ Flexible min-support thresholds: Some items are more valuable but
less frequent
■ Use non-uniform, group-based min-support
■ E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …
■ Redundancy Filtering: Some rules may be redundant due to

“ancestor” relationships between items
■ milk ⇒ wheat bread [support = 8%, confidence = 70%]
■ 2% milk ⇒ wheat bread [support = 2%, confidence = 72%]
The first rule is an ancestor of the second rule

■ A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor
References:
1. Apriori, http://www.cs.uic.edu/~liub/teach/cs583-fall-11/CS583-association-sequential-patterns.ppt
2. Apriori, http://cse.iitkgp.ac.in/~bivasm/uc_notes/07apriori.pdf
3. Apriori, https://en.wikipedia.org/wiki/Apriori_algorithm
4. Apriori, https://project.dke.maastrichtuniversity.nl/datamining/material/lecture07.ppt
5. Apriori, https://www.cs.sjsu.edu/~lee/cs157b/Gaurang%20Negandhi--Apriori%20Algorithm%20Presentation.ppt
6. Apriori, http://cs-people.bu.edu/evimaria/cs565-12/lect2.pptx
7. Apriori video, https://www.youtube.com/watch?v=l7n4K12EjY0
8. Full calculation of apriori and fp-algorithms, http://www3.cs.stonybrook.edu/~cse634/lecture_notes/07apriori.pdf
9. FP Growth, http://www.singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining
10. FP conditional tree, http://www.cis.hut.fi/Opinnot/T-61.6020/2008/fptree.pdf
11. Conditional FP tree video, https://www.youtube.com/watch?v=LXx1xKF9oDg
12. FP tree video, https://www.youtube.com/watch?v=W2Cp0uuFO1s&t=159s
13. FP Growth, https://www.youtube.com/watch?v=UbR1qXuIeJY
14. Associative Classification Mining, https://pdfs.semanticscholar.org/6145/5083a199a3844d209d19636248d63a0fec9f.pdf
15. Wei-Guang Teng and Ming-Syan Chen, “Incremental Mining on Association Rules”,
https://pdfs.semanticscholar.org/e0bd/1ea23f79e427d933505ef2899ccf148874b6.pdf
16. Ms. Anju k.kakkad, Ms. Anita Zala, “Incremental Association Rule Mining by Modified Approach of Promising Frequent Itemset
Algorithm Based on Bucket Sort Approach”, http://www.ijarcce.com/upload/2013/november/45-s-anju_kakkad-incremental.pdf
17. Jong Soo Park, Ming-syan Chen, Philip S. Yu, “An effective hash based algorithm”,1995,
http://user.it.uu.se/~kostis/Teaching/DM-01/Handouts/PCY.pdf
18. Wenmin Li Jiawei Han Jian Pei, “CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules”,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.219&rep=rep1&type=pdf
131
References:
19. Data characterization and discrimination, https://www.youtube.com/watch?v=SW7-o86iL3w&t=181s
20. Data characterization and discrimination exampleas, https://www.sciencedirect.com/topics/computer-science/data-
characterization#:~:text=1.4.,customers%20include%20bigSpenders%20and%20budgetSpenders.
21. Apriori algorithm improvement methods, https://www.youtube.com/watch?v=asWqVHex9kY
22. Dynamic Itemset Counting example,
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/DIC.html#:~:text=Each%20itemset%20in%20the%20lattice,25%25%20an
d%20M%20%3D%202.
132
HASH BASED APRIORI ALGORITHM
1. Scan all the transaction. Create possible 2-itemsets.
2. Let the Hash table of size 8.
3. For each bucket assign an candidate pairs using the ASCII
values of the itemsets.
4. Each bucket in the hash table has a count, which is increased by
1 each item an itemset is hashed to that bucket.
5. If the bucket count is equal or above the minimum support count,
the bit vector is set to 1. Otherwise it is set to 0.
6. The candidate pairs that hash to locations where the bit vector
bit is not set are removed.
7. Modify the transaction database to include only these candidate
pairs. 133
Example of the hash-tree for C3
134
135
136
Can be added
Lift
Drawing conditional fp-tree and derive rule from it.
137
MODIFY THE NEXT EXAMPLE
138
● Transaction TMario= [ [beer, bread, butter, milk] , [beer, milk, butter],
[beer, milk, cheese] , [beer, butter, diapers, cheese] , [beer, cheese,
bread] ]
● Step 1: The first step is we count all the items in all the transactions
TMario= [ beer: 5, bread: 2, butter: 3, milk: 3, cheese: 3, diapers: 1]
● Step 2: Next we apply the threshold. For this example let's say we have a
threshold of 30% so each item has to appear at least twice.
TMario= [ beer: 5, bread: 2, butter: 3, milk: 3, cheese: 3, diapers: 1]
● Step 3: Now we sort the list according to the count of each item.
TMarioSorted = [ beer: 5, butter: 3, milk: 3, cheese: 3, bread: 2], removed
diapers:1
139

through each of the transactions and
1.Transaction to add= [beer, bread,

butter, milk]
[beer, butter, milk, bread] - set

count=1 for each item
140
2.Transaction 2: [beer, milk, butter]
[beer, butter, milk] - increment count

if revisited the node, else set count=1
for each item
141
3.Transaction 3=[beer, milk, cheese]
[beer, milk, cheese] - increment

count if revisited the node, else set
142
4.Transaction 4=[beer, butter,

diapers, cheese]
-diaper has not passed the minimum

support threshold, so removed it
-a/c sorted list it will be added as
[beer, butter, cheese] - increment

count if revisited the node, else set
143
5.Transaction 5= [beer, cheese,

bread]
[beer, cheese, bread].
- increment count if revisited the

node, else set count=1 for each item
144
associations now we go through
threshold.
[beer:4, butter:2, milk:2] in the

available branch. Together they
appear 2 times in branch. So
association,
{beer, butter, milk}=⅖ = 40%

145

associations now we go through
threshold.
[beer:4, cheese:2] in the available

branch. Together they appear 2
times in branch. So association,
{beer, cheese}=⅖ = 40%

146

Unit-8 (2)

Uploaded by

Copyright:

Available Formats

Unit-8 (2)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-8 (2)

Uploaded by

Copyright:

Available Formats

8.

Association Rule Mining

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)

T2:(bread, kurkure, butter)

Bread → Milk [sup = 5%, conf = 100%]

■ Transaction Database T: a set of transactions

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 7

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 10

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 11

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 12

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 13

● Market Basket Analysis: given a database of customer transactions,

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)

• Apriori Property: Any subset of frequent itemset must be frequent.

• Join Operation: To find Lk , a set of candidate k-itemsets is generated by

Bread-frequent Butter -frequent

(bread, butter, cheese)-(bread, butter), (butter, cheese), (bread, cheese)

To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself.

(bread, butter, cheese)-3 itemset

{(bread, butter), (butter, cheese), (bread, cheese)} -2 itemset

Minimum Support count=4*50/100 = 2

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)

50%*4=2->=Minimum support count

Then Rule A=>B is accepted

● One more example,

Then rule is accepted

Ite Support Compare

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)

DB1 DB2 DBk = DB

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 54

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 55

TID Items Item Support TID Items 2-itemset Generation

C,E 2 400 B, E empty

■ Frequent 1-itemset: a, b, d, e 102 {yz, qs, wt}

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 64

TID Items A,B A,C A,E B,C B,E C,E sum

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 67

TID Items A,C B,C B,E C,E sum

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 68

TID Items B,C,E

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 69

You can see on

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 71

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 72

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 73

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 74

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 75

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 76

the transaction file, continue from the

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 78

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 79

Minimum support=25%, M=2 no of transactions read

Itemset lattice for the

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 83

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 84

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 87

Sorted L=[I2:7, I1:6, I3:6, I4:2, I5:2]

Step 4: Now we build the tree. We go