Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit-8 (2)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 146

8.

Association Rule Mining


Subject : Machine Learning(3170724)
Dr. Ami Tusharkant Choksi
Associate Professor and HOD, Computer
Department,
C.K.Pithawalla College of Engineering &
Technology, Surat.
Website: www.ckpcet.ac.in

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)


Contents
Association rules

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)


What is association rule mining?
● Association rule mining is a procedure which aims to
observe frequently occurring patterns, correlations, or
associations from datasets found in various kinds of
databases such as relational databases, transactional
databases, and other forms of repositories.

3
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Bread - 100

Butter - 97

Minimum count=75

(bread, butter)-77

T1:(butter, chocolates,bread)

T2:(bread, kurkure, butter)

………..

4
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Background
● Proposed by Agrawal et al in 1993.
● Assume all data are categorical.
● No good algorithm for numeric data.
● Initially used for Market Basket Analysis to find how items
purchased by customers are related.

Bread → Milk [sup = 5%, conf = 100%]

5
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Basket Data
● Retail organizations, e.g.,
supermarkets, collect and store
massive amounts sales data,
called basket data.
● A record consist of
○ transaction date
○ items bought
● Or, basket data may consist of
items bought by a customer over
a period.
6
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
The model: data
■ I = {i1, i2, …, im}: a set of items.
■ Transaction t :
❑ t a set of items, and t ⊆ I.

■ Transaction Database T: a set of transactions


T = {t1, t2, …, tn}.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 7


Transaction data: supermarket data
■ Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
■ Concepts:
❑ An item: an item/article in a basket
❑ I: the set of all items sold in the store
❑ A transaction: items purchased in a basket; it may
have TID (transaction ID)
❑ A transactional dataset: A set of transactions
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 8
Transaction data: a set of documents
■ A text document data set. Each document is treated as a
“bag” of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game
I={Student, Teach, School, City, Game, Basketball, Player,
Spectator, Coach, Team Machine
Dr. Ami Tusharkant Choksi@CKPCET
} Learning (3170724) 9
The model: rules
■ A transaction t contains X, a set of items (itemset) in I, if
X ⊆ t.
■ An association rule is an implication of the form:
X → Y, where X, Y ⊂ I, and X ∩Y = ∅
■ An itemset is a set of items.
❑ E.g., X = {milk, bread, cereal} is an itemset.
■ A k-itemset is an itemset with k items.
❑ E.g., {milk, bread, cereal} is a 3-itemset

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 10


Rule strength measures
■ Support: The rule holds with support sup in T (the
transaction data set) if sup% of transactions contain X ∪
Y.
❑ sup = Pr(X ∪ Y).
■ Confidence: The rule holds in T with confidence conf if
conf% of transactions that contain X also contain Y.
❑ conf = Pr(Y | X)
■ An association rule is a pattern that states when X
occurs, Y occurs with certain probability.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 11


Support and Confidence
■ Support count: The support count of an itemset X,
denoted by X.count, in a data set T is the number of
transactions in T that contain X. Assume T has n
transactions.
■ Then,

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 12


Goal and key features
■ Goal: Find all rules that satisfy the user-specified
minimum support (minsup) and minimum confidence
(minconf).

■ Key Features
❑ Completeness: find all rules.
❑ No target item(s) on the right-hand-side
❑ Mining with data on hard disk (not in memory)

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 13


Example TID
● Itemset Items
○ A collection of one or more items,e.g., 1 Bread, Peanuts, Milk, Fruit, Jam
{milk, bread, jam}
○ k-itemset, an itemset that contains k items 2 Bread, Jam, Soda, Chips, Milk,
● Support count Fruit
○ Frequency of occurrence of an itemset 3 Jam, Soda, Chips, Bread
○ ({Milk, Bread}) = 3
○ ({Soda, Chips}) = 4 4 Jam, Soda, Peanuts, Milk, Fruit
● Support
5 Jam, Soda, Chips, Milk, Bread
○ Fraction of transactions that contain an
itemset 6 Fruit, Soda, Chips, Milk
○ s({Milk, Bread}) = ⅜
7 Fruit, Soda, Peanuts, Milk
○ s({Soda, Chips}) = 4/8
● Frequent Itemset 8 Fruit, Peanuts, Cheese, Yogurt
○ An itemset whose support is greater than
or equal to a minsup threshold
14
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
What is an association rule?
● Implication of the form X ->Y, where X and Y are itemsets
● Example, {bread} {milk}
● Rule Evaluation Metrics, Support & Confidence
● Support (s): Fraction of transactions that contain both X and
Y
● Confidence (c): Measures how often items in Y appear in
transactions that contain X
● support(bread, milk).count/number of transactions = ⅜ = .38
● confidence(bread=>milk) = support(bread,
milk)/support(bread) = (⅜)/(4/8) =.75=(⅜)/(4/8)
15
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Support and Confidence

16
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
What is goal?
● Given a set of transactions T, the goal of association rule
mining is to find all rules having support ≥ minsup
confidence ≥ minconf threshold
● Mining Association Rules
● {Bread, Jam}=> {Milk} s=0.4 c=0.75
● {Milk, Jam}=> {Bread} s=0.4 c=0.75
● {Bread} =>{Milk, Jam} s=0.4 c=0.75
● {Jam}=> {Bread, Milk} s=0.4 c=0.6
● {Milk} =>{Bread, Jam} s=0.4 c=0.5
17
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Mining Association Rules
● All the above rules are binary partitions of the same
itemset: {Milk, Bread, Jam}
● Rules originating from the same itemset have identical
support but can have different confidence
● x=>y support(xuy).count/x.count
● y=>x support(xuy).count/y.count
● x=>y and y=>x are not same
● We can decouple the support and confidence
requirements!
18
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Mining Association Rules: Two Step
Approach
● Frequent Itemset Generation: Generate all itemsets
whose support minsup
● Rule Generation Generate high confidence rules from
frequent itemset Each rule is a binary partitioning of a
frequent itemset
● Frequent itemset generation is computationally expensive

19
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Applications

● Market Basket Analysis: given a database of customer transactions,


where each transaction is a set of items the goal is to find groups of
items which are frequently purchased together.
● Telecommunication (each customer is a transaction containing the
set of phone calls)
● Credit Cards/ Banking Services (each card/account is a transaction
containing the set of customer’s payments)
● Medical Treatments (each patient is represented as a transaction
containing the ordered set of diseases)
● Basketball-Game Analysis (each game is represented as a
transaction containing the ordered set of ball passes)
20
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Next Lecture

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)


Apriori Algorithm
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules

Key Concepts :

• Frequent Itemsets: The sets of item which has minimum support (denoted by Li
for ith-Itemset).

• Apriori Property: Any subset of frequent itemset must be frequent.

• Join Operation: To find Lk , a set of candidate k-itemsets is generated by


joining Lk-1 with itself.

22
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
(bread, butter)-2 itemset frequent Only if

Bread-frequent Butter -frequent

(bread, butter, cheese)-(bread, butter), (butter, cheese), (bread, cheese)

To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself.

(bread, butter, cheese)-3 itemset

{(bread, butter), (butter, cheese), (bread, cheese)} -2 itemset

23
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
The Apriori Algorithm in a Nutshell
• Find the frequent itemsets: the sets of items that have
minimum support
– A subset of a frequent itemset must also be a frequent
itemset
• i.e., if {AB} is a frequent itemset, both { A} and { B} should
be a frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
• Use the frequent itemsets to generate association rules.
24
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Frequent Itemset Generation

25
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
The Apriori Algorithm : Pseudo code
Lk: Set of frequent itemsets of size k (with min support)
Ck: Set of candidate itemset of size k (potentially frequent itemsets)

L1 = {frequent items};
for (k = 1; Lk !=ϕ; k++) do
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
return ∪k Lk;

26
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
The Apriori Algorithm : Pseudo code (snapshot)

27
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
The Apriori Algorithm — Example

28
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Explanation
No of transactions=4

Minimum support=50%

Minimum Support count=4*50/100 = 2

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)


Minimum support 50%

No of transactions=4

50%*4=2->=Minimum support count

30
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
The Apriori Algorithm — Example...
● {1,3,5} is not considered as part of C3 because apriori
property is not satisfied.
● According to apriori property, {1,3,5} can be part of C3 ONLY
if all of its subsets are part of L2 frequent 2-itemset.
● I.e.{1,3},{3,5} and {1,5} must be part of L2, which are not.
● Same way {1,2,3} is also not part of C3. because {1,3},{2,3}
are part of L2 but {1,2} is not part of L2.

31
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
How to Generate Candidates
Input: Li-1 : set of frequent itemsets of size i-1
Output: Ci : set of candidate itemsets of size i
Ci = empty set;
for each itemset J in Li-1 do
for each itemset K in Li-1 s.t. K<> J do
if i-2 of the elements in J and K are equal then
if all subsets of {K ∪ J} are in Li-1 then
Ci = Ci ∪ {K ∪ J}
return Ci;

32
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Example of Generating Candidates
•L3={abc, abd, acd, ace, bcd}
•Generating C4 from L3
–abcd from abc and abd
–acde from acd and ace
•Pruning:
–acde is removed because ade is not in L3
•C4={abcd}

33
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Example of Discovering Rules
Let use consider the 3-itemset {I1, I2, I5}:

I1 ^ I2 => I5

I1 ^ I5 => I2

I2 ^ I5 => I1

I1 => I2 ^ I5

I2 => I1 ^ I5

I5 => I1 ^ I2
34
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Discovering Rules
● If Confidence{A=>B} =support(A U B)/support(A)>=minconf

Then Rule A=>B is accepted

● One more example,

>=minconf

Then rule is accepted


35
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Example of Discovering Rules
TID List of item_IDs
Let use consider the 3-itemset
T100 I1, I2, I5 {I1, I2, I5} with support of 0.22(2)%.
Let generate all the association rules
T200 I2, I4
from this itemset:
T300 I2, I3
I1^ I2 => I5 confidence= 2/4 = 50%
T400 I1, I2, I4
I1 ^ I5 => I2 confidence= 2/2 = 100%
T500 I1, I3 I2 ^ I5 => I1 confidence= 2/2 = 100%
T600 I2, I3 I1 => I2 ^ I5 confidence= 2/6 = 33%
T700 I1, I3 I2 => I1 ^ I5 confidence= 2/7 = 29%
T800 I1, I2, I3, I5 I5 => I1 ^ I2 confidence= 2/2 = 100%
T900 I1, I2, I3 36
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Aprioi complete example
TID List of item_IDs • Consider a database, D , consisting
T100 I1, I2, I5 of 9 transactions.
• Suppose min. support count
T200 I2, I4
required is 2 (i.e. min_sup = 2/9 = 22
T300 I2, I3 %)
T400 I1, I2, I4 • Let minimum confidence required is
70%.
T500 I1, I3
• We have to first find out the
T600 I2, I3 frequent itemset using Apriori
T700 I1, I3 algorithm.
• Then, Association rules will be
T800 I1, I2, I3, I5
generated using min. support & min.
T900 I1, I2, I3 confidence. 37
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Step 1: Generating 1-itemset Frequent Pattern

Ite Support Compare


with Ite Support 2/9 = .22 >= mini_support
m Count
mini_supp m Count
ort
{I1} 6 All items with support_count>=2 will
{I1} 6
be having support>=mini_support
{I2} 7
{I2} 7
{I3} 6
{I3} 6
{I4} 2
{I4} 2
{I5} 2
{I5} 2

C1 L1

38
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Step 2: Generating 2-itemset Frequent Pattern
Itemset support
Itemset count
Itemset support
{I1, I2} {I1, I2} 4 Compare count
with ● 2/9 = .22 >=
{I1, I3} {I1, I3} 4 mini_supp {I1, I2} 4
ort
mini_support
{I1, I4} {I1, I4} 1 {I1, I3} 4 ● All items with
{I1, I5} {I1, I5} 2 {I1, I5} 2 support_count
{I2, I3} {I2, I3} 4 {I2, I3} 4
>=2 will be
having
{I2, I4} {I2, I4} 2 {I2, I4} 2
support>=mini
{I2, I5} {I2, I5} 2 {I2, I5} 2 _support
{I3, I4} {I3, I4} 0
L2
{I3, I5} {I3, I5} 1
C2
{I4, I5} {I4, I5} 0
39
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Step 3: Generating 3-itemset Frequent Pattern
Itemset Itemset support Itemset support ● 2/9 = .22 >=
count count mini_support
{I1, I2, I3}
{I1, I2, I3} 2 {I1, I2, I3} 2 ● All items with
{I1, I2, I5}
{I1, I2, I5} 2 Compare {I1, I2, I5} 2
support_count
with >=2 will be
C3 mini_supp
ort L3 having
support>=mini
The generation of the set of candidate 3-itemsets, C3 , involves
_support
use of the Apriori Property.
• In order to find C3, we compute L2 Join L2.
• C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},
{I2, I3, I5}, {I2, I4, I5}}.
• Now, Join step is complete and Prune step will be used to
reduce the size of C3. Prune step helps to avoid heavy
computation due to large Ck.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
40
Step 3: Generating 3-itemset Frequent Pattern
Based on the Apriori property that all subsets of a frequent itemset must also be
frequent, we can determine that four later candidates cannot possibly be frequent. How
?
• For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2,
I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in
C3.
• Let's take another example of {I2, I3, I5} which shows how the pruning is performed.
The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
• BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2, I3, I5} from C3.
• Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join
operation for Pruning. • Now, the transactions in D are scanned in order to determine
L3, consisting of those candidates 3-itemsets in C3 having minimum support.
41
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Step 4: Generating 4-itemset Frequent
Pattern
• The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4.
Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset
{{I2, I3, I5}} is not frequent.

• Thus, C4 = φ , and algorithm terminates, having found all of the frequent items.
This completes our Apriori Algorithm.

• What’s Next ?

These frequent itemsets will be used to generate strong association rules ( where
strong association rules satisfy both minimum support & minimum confidence).

42
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Step 5: Generating Association Rules from Frequent
Itemsets
Procedure:
• For each frequent itemset “l”, generate all nonempty subsets of l.
• For every nonempty subset s of l, output the rule “s Æ (l-s)” if support_count(l) /
support_count(s) >= min_conf where min_conf is minimum confidence threshold.
• Back To Example: We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5},
{I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
– Lets take l = {I1,I2,I5}.
– Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.

43
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Step 5: Generating Association Rules from Frequent
Itemsets
Let minimum confidence threshold is , say 70%.
• The resulting association rules are shown below,
each listed with its confidence.
– R1: I1 ^ I2 => I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected.
– R2: I1 ^ I5 => I2
• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
• R2 is Selected.
– R3: I2 ^ I5 => I1
• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
• R3
Dr. Amiis Selected.
Tusharkant Choksi@CKPCET Machine Learning (3170724)
44
Sampling for Frequent Patterns
■ Select a sample of original database, mine frequent
patterns within sample using Apriori
■ Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
■ Example: check abcd instead of ab, ac, …, etc.
■ Scan database again to find missed frequent patterns
■ H. Toivonen. Sampling large databases for association
rules. In VLDB’96
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 45
Step 5: Generating Association Rules from Frequent
Itemsets
R4: I1 => I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
– R5: I2 => I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
– R6: I5 => I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected.
In this way, We have found three strong
association rules.
46
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Strength and weakness of apriori
algorithm

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)


End for ML Other slides are extra
material for the chapter

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)


Multiple minimum class supports
● The multiple minimum support idea can also be applied here.
● The user can specify different minimum supports to different classes,
which effectively assign a different minimum support to rules of each
class.
● For example, we have a data set with two classes, Yes and No. We
may want
○ rules of class Yes to have the minimum support of 5% and
○ rules of class No to have the minimum support of 10%.
● By setting minimum class supports to 100% (or more for some
classes), we tell the algorithm not to generate rules of those classes.
● This is a very useful trick in applications.
49
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Advantages and Disadvantages
● Advantages:
- Uses large itemset property.
- Easily parallelized
- Easy to implement.
● Disadvantages:
- Assumes transaction database is memory resident.
- Requires many database scans.

50
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Methods to Improve Apriori’s Efficiency
• Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold cannot
be frequent.
• Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans.
• Partitioning: Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB.
• Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness.
• Dynamic itemset counting: Add new candidate itemsets only
when all of their subsets are estimated to be frequent. 51
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Further Improvement of the Apriori Method
■ Major computational challenges
■ Multiple scans of transaction database
■ Huge number of candidates
■ Tedious workload of support counting for candidates
■ Improving Apriori: general ideas
■ Reduce passes of transaction database scans
■ Shrink number of candidates
■ Facilitate support counting of candidates
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 52
Partition: Scan Database Only Twice
■ Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
■ Scan 1: partition database and find local frequent

patterns
■ Scan 2: consolidate global frequent patterns

DB1 DB2 DBk = DB


+ + +
sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 53
Partitioning method example
Transaction Itemset ● Minimum support =20%
● I.e.support count = 2
T1 I1,I5 (20*6/100=1.2=2)
T2 I2,I4 ● We partition the database in 3
parts. So each partition will be
T3 I4,I5 having 2 transactions and
minimum support count=1
T4 I2,I3
● (20*2/100=0.4=1)
T5 I5
T6 I2,I3,I4

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 54


Partitioning method example

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 55


DHP(Direct Hashing with Efficient Pruning)
1. Scan all the transaction. Create possible 2-itemsets.
2. Let the Hash table of size 8.
3. For each bucket assign an candidate pairs using the order values of the
itemsets, apply hash function and put in the hash table.
4. Each bucket in the hash table has a count, which is increased by 1 each item
an itemset is hashed to that bucket.
5. If the bucket count is equal or above the minimum support count, the bit vector
is set to 1. Otherwise it is set to 0.
6. The candidate pairs that hash to locations where the bit vector bit is not set are
removed.
7. Modify the transaction database to include only these candidate pairs.
8. Support count is calculated from the entries in the buckets of hash table,
instead if scanning whole database

56
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Hash based itemset counting example
Table:1 Database Table:2 Support count Table:3 2-itemset Generation from table-1

TID Items Item Support TID Items 2-itemset Generation


count
100 A, C, D 100 A, C, D {A,C}, {A,D}, {C,D}
A 2
200 B, C, E 200 B, C, E {B,C},{B,E},{C,E}
B 3
300 A, B, C, E 300 A, B, C, E {A,B}, {A,C}, {A,E},
C 3 {B,C}, {B,E}, {C,E}
400 B, E
D 1 400 B, E {B,E}
Minimum support=2 E 3
● Hash function: h(x,y) = ((order of x)*10 + (order of y)) e.g.h(A,B)=5, count=1
mod 7 Inserting (A,C), (C,D) int hash
● (order = sequence number is as in the table-2 table,
● h(A,B) = ((1*10) + 2) mod 7 = 5 h(A,C) = 6, count=1
● Add each 2-itemset into hash table according to h h(C,D) = 6, count=2
value and increment the count of each bucket 57
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Mining Multiple-Level

Generation of C2
-{A,C}
Order of A=1, C=3
h(x,y)=(1*10 + 3)%7=6
-{A,B}
Order of A=1, B=2
h(x,y)=(1*10 + 2)%7=5
-{B,E}
Order of B=2, E=5
h(x,y)=(2*10 + 5)%7=4
-{B,C}
Order of B=2, E=3
h(x,y)=(2*10 + 3)%7=2

-L1={A,B,C,E}
-
L2={{A,C},{B,C},{B,E},{
C,E}}
58
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
- Counting support based on the contents of hash table.
- Using apriori property to generate 3-itemset.

59
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Table:5 3-itemset Generation from table-1
Table:1 Database Table:4 Support count
TID Items TID Items 3-itemset Generation
Item Support
100 A, C, D count 100 A, C, D {A,C,D}-x-apriori property

200 B, C, E A,C 2
200 B, C, E {B,C,E}-v-right
300 A, B, C, E B,C 2
300 A, B, C, E {A,B,C}, {A,B,E}, {A,C,E}-x-
400 B, E B,E 3 apriori property

C,E 2 400 B, E empty


V-right, X-wrong
But applying any function, count cannot be more than support count
- Scanning database is reduced here, as no need to find support count every
time from database, that we can find from hash table.
- B,C,E support count, we can find from transaction
- We have reduced checking support count for 2-itemset generation in the
example
60
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Hash based itemset counting: one more
example

61
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
62
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
63
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
DHP: Reduce the Number of Candidates
■ DHP(Direct Hashing with Efficient Pruning)
■ A k-itemset whose corresponding hashing bucket count is below the
threshold cannot be frequent count itemsets
■ Candidates: a, b, c, d, e 35 {ab, ad, ae}
■ Hash entries 88 {bd, be, de}
■ {ab, ad, ae}
■ {bd, be, de} ...
... ...

■ Frequent 1-itemset: a, b, d, e 102 {yz, qs, wt}


■ ab is not a candidate 2-itemset Hash Table
if the sum of count of {ab, ad, ae} is below support threshold

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 64


Transaction Reduction Method
Transaction that does not contain any frequent k-itemsets, cannot
contain any frequent (k+1)itemsets. Such transaction can be
removed from further consideration.
TID Items A B C D E sum
100 A, C, D T1 1 0 1 1 0 3>minsup
200 B, C, E T2 0 1 1 0 1 3>minsup
300 A, B, C, E T3 1 1 1 0 1 4>minsup
400 B, E T4 0 1 0 0 1 2=minsup
sum 2=mi 3>mi 3>min 1<mi 3>minsu
Minimum support = 2
nsup nsup sup nsup p
65
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Transaction Reduction Method
Remaining one after removing ‘D’ column who doesn’t satisfy the
minimum support.
A B C E sum
TID Items
T1 1 0 1 0 2=minsup
100 A, C, D
T2 0 1 1 1 3>minsup
200 B, C, E
T3 1 1 1 1 4>minsup
300 A, B, C, E
T4 0 1 0 1 2=minsup
400 B, E
sum 2=mi 3>mi 3>min 3>minsu
Minimum support = 2 nsup nsup sup p
Frequent items are
{A,B,C,E}
66
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Transaction Reduction Method
2-itemset generation

TID Items A,B A,C A,E B,C B,E C,E sum


100 A, C, D T1 0 1 0 0 0 0 1<minsup
200 B, C, E T2 0 0 0 1 1 1 3>minsup
300 A, B, C, E T3 1 1 1 1 1 1 6>minsup
400 B, E T4 0 0 0 0 1 0 1<minsup
Minimum support = 2 su 1< 2= 1<m 2=mi 3>mi 3>mi
L2={{A,C},{B,C},{B,E},{ m min min insu nsup nsup nsup
C,E sup sup p

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 67


Transaction Reduction Method
After removal, remaining ones

TID Items A,C B,C B,E C,E sum


100 A, C, D T2 0 1 1 1 3>minsup
200 B, C, E T3 1 1 1 1 6>minsup
300 A, B, C, E su 2= 2=mi 3 3
400 B, E m min nsup
sup

Minimum support = 2

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 68


Transaction Reduction Method
3-Itemset Generation

TID Items B,C,E


100 A, C, D So, for 3-itemset generation
T2 1 ONLY {B,C,E} is frequent
200 B, C, E T3 1
300 A, B, C, E su 2=mi
400 B, E m nsup

Minimum support = 2

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 69


Transaction reduction one more example

You can see on


https://www.youtube.com/watch?v=asWqVHex9kY

70
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Dynamic Itemset Counting:Algorithm
■ It reduces the number of passes made over the
data while keeping the number of itemsets which
are counted in any pass relatively low.
■ The technique can add new candidate itemsets
at any marked start point of the database during
the scanning of the database.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 71


Dynamic Itemset Counting:Algorithm
■ Itemsets are dynamically added and deleted as
transactions are read
■ Relies on the fact that for an itemset to be
frequent, all of its subsets must also be frequent,
so we only examine those itemsets whose
subsets are all frequent

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 72


Dynamic Itemset Counting:Algorithm
■ Algorithm stops after every M transactions to
add more itemsets.
■ Train analogy: There are stations every M
transactions. The passengers are itemsets.
Itemsets can get on at any stop as long as they
get off at the same stop in the next pass around
the database.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 73


Dynamic Itemset Counting:Algorithm
■ Only itemsets on the train are counted when
they occur in transactions. At the very beginning
we can start counting 1-itemsets, at the first
station we can start counting some of the 2-
itemsets. At the second station we can start
counting 3-itemsets as well as any more 2-
itemsets that can be counted and so on.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 74


Dynamic Itemset Counting:Algorithm
■ Itemsets are marked in four different ways as
they are counted:
■ Solid box: DIC Solid Box confirmed frequent
itemset - an itemset we have finished counting
and exceeds the support threshold minsupp
■ Solid circle: DIC Solid Circle confirmed infrequent
itemset - we have finished counting and it is
below minsupp

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 75


Dynamic Itemset Counting:Algorithm
■ Dashed box: DIC Dashed Box suspected
frequent itemset - an itemset we are still
counting that exceeds minsupp
■ Dashed circle: DIC Dashed Circle suspected
infrequent itemset - an itemset we are still
counting that is below minsupp

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 76


Dynamic Itemset Counting:Algorithm
■ Mark the empty itemset with a solid square.
Mark all the 1-itemsets with dashed circles.
Leave all other itemsets unmarked.
■ While any dashed itemsets remain:
■ Read M transactions (if we reach the end of

the transaction file, continue from the


beginning). For each transaction, increment
the respective counters for the itemsets that
appear in the transaction and are marked
with dashes.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 77
Dynamic Itemset Counting:Algorithm
■ If a dashed circle's count exceeds minsupp,
turn it into a dashed square. If any immediate
superset of it has all of its subsets as solid or
dashed squares, add a new counter for it and
make it a dashed circle.
■ Once a dashed itemset has been counted
through all the transactions, make it solid and
stop counting it.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 78


Dynamic Itemset Counting:Algorithm
■ Itemset lattices: An itemset lattice contains all of
the possible itemsets for a transaction database.
Each itemset in the lattice points to all of its
supersets. When represented graphically, a
itemset lattice can help us to understand the
concepts behind the DIC algorithm.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 79


Summary of Previous Lecture
● We are learning Dynamic Itemset Counting(DIC):Algorithm
● Solid - confirmed
● Dashed - suspected
● Box - frequent
● Circle- infrequent
● Solid box - confirmed frequent
● Solid circle - confirmed infrequent
● Dashed box - suspected frequent
● Dashed circle - suspected infrequent

80
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Summary of Previous Lecture
● We read m transactions at a time.
● In 1st pass, one itemset counts are changed typically
● In 2nd pass, two itemset counts are changed typically
● Transitions of dashed-> solid and from circle-> box
○ Dashed circle -> dashed box
○ Dashed box-> solid box if count>=min support
○ Dashed box-> solid circle if count< min support

81
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Dynamic Itemset Counting:EXAMPLE

TID A B C
T1 1 1 0
T2 1 0 0
T3 0 1 1
T4 0 0 0

Minimum support=25%, M=2 no of transactions read


Minimum support count=25/100*4 = 1
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 82
Dynamic Itemset Counting:EXAMPLE

Itemset lattice for the


transaction database:

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 83


Dynamic Itemset Counting:EXAMPLE
Itemset lattice before any
transactions are read:

● Counters: A = 0, B = 0, C = 0
● Empty itemset is marked with
a solid box. All 1-itemsets are
marked with dashed circles.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 84


Dynamic Itemset Counting:EXAMPLE
After M transactions are read:
T1: A, B, T2: A
● Counters: A = 2, B = 1, C = 0,
AB = 0
● We change A and B to dashed
boxes because their counters
are greater than minsup (1)
and add a counter for AB
because both of its subsets
are boxes.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 85
Dynamic Itemset Counting:EXAMPLE
After 2M transactions are read:
T3: B,C, T4: {}
-Counters: A = 2, B = 2, C = 1, AB
= 0, AC = 0, BC = 0
-C changes to a square because
its counter is >= minsup.A, B and
C have been counted all the way
through so we stop counting them
and make their boxes solid. Add
counters for AC and BC because
their subsets are all boxes.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 86
Dynamic Itemset Counting:EXAMPLE
After 3M transactions are read:
T1: A, B, T2: A
● Counters: A = 2, B = 2, C = 1,
AB = 1, AC = 0, BC = 0
● AB has been counted all the
way through and its counter
satisfies minsup so we change
it to a solid box. BC remains to
a dashed circle.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 87


Dynamic Itemset Counting:EXAMPLE
After 4M transactions are read:
T3:B, C T4: {}

● Counters: A = 2, B = 2, C = 1,
AB = 1, AC = 0, BC = 1
● AC and BC are counted all the
way through. We do not count
ABC because one of its
subsets is a circle. There are
no dashed itemsets left so the
algorithm is done.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 88
Discussion of the Apriori algorithm
● Much faster than the Brute-force algorithm
○ It avoids checking all elements in the lattice
● The running time is in the worst case O(2d)
○ Pruning really prunes in practice
● It makes multiple passes over the dataset
○ One pass for every level k
● Multiple passes over the dataset is inefficient when we
have thousands of candidates and millions of transactions

89
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
■ Bottlenecks of the Apriori approach
■ Breadth-first (i.e., level-wise) search
■ Candidate generation and test
■ Often generates a huge number of candidates

■ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
■ Depth-first search
■ Avoid explicit candidate generation
■ Major philosophy: Grow long patterns from short ones using local frequent
items only
■ “abc” is a frequent pattern
■ Get all transactions having “abc”, i.e., project DB on abc: DB|abc
■ “d” is a local frequent item in DB|abc → abcd is a frequent pattern
90
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Mining Frequent Patterns Without Candidate
Generation
● Compress a large database into a compact, Frequent-Pattern tree (FP-tree)
structure
○ highly condensed, but complete for frequent pattern mining
○ avoid costly database scans
● Develop an efficient, FP-tree-based frequent pattern mining method
○ A divide-and-conquer methodology: decompose mining tasks into smaller ones
○ Avoid candidate generation: sub-database test only!

91
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth Method
● First, create the root of the tree, labeled with “null”.
● Scan the database D a second time. (First time we scanned it to
create 1-itemset and then L).
● The items in each transaction are processed in L order (i.e. sorted
order). Bread:5, butter:3
● A branch is created for each transaction with items having their
support count separated by colon. I:2
● Whenever the same node is encountered in another transaction, we
just increment the support count of the common node or Prefix.
● To facilitate tree traversal, an item header table is built so that each
item points to its occurrences in the tree via a chain of node-links.
● Now, The problem of mining frequent patterns in database is
transformed to that of mining the FP-Tree.
92
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth Biggest Advantages
● The biggest advantage found in FP-Growth is the fact that the
algorithm only needs to read the file twice, as opposed to apriori
who reads it once for every iteration.
● Another huge advantage is that it removes the need to calculate
the pairs to be counted, which is very processing heavy, because
it uses the FP-Tree. This makes it O(n) which is much faster than
apriori.
● The FP-Growth algorithm stores in memory a compact version of
the database.

93
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth Method : An Example
TID List of item_IDs • Consider a database, D ,
T100 I1, I2, I5
consisting of 9 transactions.
• Suppose min. support count
T200 I2, I4 required is 2 (i.e. min_sup = 2/9 = 22
T300 I2, I3 %)
T400 I1, I2, I4
• Let minimum confidence required is
70%.
T500 I1, I3
• We have to first find out the
T600 I2, I3 frequent itemset using Apriori
T700 I1, I3 algorithm.
• Then, Association rules will be
T800 I1, I2, I3, I5
generated using min. support & min.
T900 I1, I2, I3 confidence. 94
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth Method : An Example
● Step 1: The first step is we count all
TID Support Compar TID Support the items in all the transactions
e with
Count
mini_su
Count ● Step 2: Next we apply the threshold
pport we had set previously.
{I1} 6 and {I2} 7
sorting 2/9 = .22 >= mini_support
{I2} 7 {I1} 6
{I3} 6 All items with support_count>=2 will be
{I3} 6
having support>=mini_support
{I4} 2 {I4} 2
{I5} 2 ● Step 3: Now we sort the list according
{I5} 2 to the count of each item.In sorting
order

Sorted L=[I2:7, I1:6, I3:6, I4:2, I5:2]


95
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 4: Now we build the tree. We go


through each of the transactions and null
add all the items in the order they
appear in our sorted list.
I2:1

1.Transaction to add= [I1, I2, I5]


I1:1
a/c sorted list it will be added as

[I2, I1, I5] - set count=1 for each item I5:1

96
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 4: Now we build the tree. We go


through each of the transactions and null
add all the items in the order they
appear in our sorted list.
I2:2

2.Transaction to add= [I2, I4]


I4:1
I1:1
a/c sorted list it will be added as

[I2, I4] - increment count if revisited I5:1


the node, else set count=1 for each
item
97
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 4: Now we build the tree. We go


through each of the transactions and null
add all the items in the order they
appear in our sorted list.
I2:3

3.Transaction to add= [I2, I3]


I1:1 I4:1
a/c sorted list it will be added as
I3:1

[I2, I3] - increment count if revisited I5:1


the node, else set count=1 for each
item
98
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 4: Now we build the tree. We go


through each of the transactions and null
add all the items in the order they
appear in our sorted list.
I2:4

4.Transaction to add= [I1, I2, I4]


I1:2 I4:1
a/c sorted list it will be added as I3:1

[I2, I1, I4] - increment count if I5:1


I4:1
revisited the node, else set count=1
for each item
99
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 4: Now we build the tree. We go


through each of the transactions and
add all the items in the order they null

appear in our sorted list.


I1:1
I2:4
5.Transaction to add= [I1, I3]
I3:1
a/c sorted list it will be added as I1:2 I4:1
I3:1

[I1, I3] - increment count if revisited


I5:1
the node, else set count=1 for each I4:1
item
100
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 4: Now we build the tree. We go


through each of the transactions and null

add all the items in the order they


I1:1
appear in our sorted list. I2:5

6.Transaction to add= [I2, I3]


I3:1
I1:2 I4:1
a/c sorted list it will be added as I3:2

[I2, I3] - increment count if revisited


I5:1
the node, else set count=1 for each I4:1

item
101
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 4: Now we build the tree. We go


through each of the transactions and null

add all the items in the order they


I1:2
appear in our sorted list. I2:5

7.Transaction to add= [I1, I3]


I3:2
I1:2 I4:1
a/c sorted list it will be added as I3:2

[I1, I3] - increment count if revisited


I5:1
the node, else set count=1 for each I4:1

item
102
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Step 4: Now we build the tree. We go
through each of the transactions and null

add all the items in the order they


I1:2
appear in our sorted list. I2:6

8.Transaction to add= [I1, I2, I3, I5]


I3:2
I1:3 I4:1
a/c sorted list it will be added as I3:2

[I2, I1, I3, I5] - increment count if I5:1


revisited the node, else set count=1 I4:1 I3:1

for each item

I5:1

103
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example
null
Step 4: Now we build the tree. We go
through each of the transactions and I1:2
I2:7
add all the items in the order they
appear in our sorted list.
I3:2
I4:1
9.Transaction to add= [I1, I2, I3] I1:4 I3:2

a/c sorted list it will be added as I5:1


I4:1 I3:2
[I2, I1, I3] - increment count if
revisited the node, else set count=1
for each item I5:1

104
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example
null

Step 5: In order to get the


associations now we go through I1:2
I2:7
every branch of the tree and only
include in the association all the
I3:2
nodes whose count passed the
I3:2 I4:1
I1:4
threshold.

[I2:7, I1:4] in the available branch. I5:1


Together they appear 4 times in I4:1 I3:2

branch. So association,

{I2, I1}=4/9 = 44% I5:1

105
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example
null

Step 5: In order to get the


associations now we go through I1:2
I2:7
every branch of the tree and only
include in the association all the
I3:2
nodes whose count passed the
I3:2 I4:1
I1:4
threshold.

[I2:7, I1:4, I3:2] in the available I5:1


branch. Together they appear 2 I4:1 I3:2

times in branch. So association,

{I2:7, I1:4, I3:2}=2/9 = 22% I5:1

106
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example
null

Step 5: In order to get the


associations now we go through I1:2
I2:7
every branch of the tree and only
include in the association all the
I3:2
nodes whose count passed the
I3:2 I4:1
I1:4
threshold.

[I2:7, I3:2] in the available branch. I5:1


Together they appear 2 times in I4:1 I3:2

branch. So association,

{I2:7, I3:2}=2/9 = 22% I5:1

107
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example
null

Step 5: In order to get the


associations now we go through I1:2
I2:7
every branch of the tree and only
include in the association all the
I3:2
nodes whose count passed the
I3:2 I4:1
I1:4
threshold.

[I1:2, I3:2] in the available branch. I5:1


Together they appear 2 times in I4:1 I3:2

branch. So association,

{I1:2, I2:2}=2/9 = 22% I5:1

108
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example
Step 5: So associations,

{I2, I1}=4/9 = 44%

{I2:7, I1:4, I3:2}=2/9 = 22%

{I2:7, I3:2}=2/9 = 22%

{I1:2, I2:2}=2/9 = 22%

109
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Items
Exercise Ordered didn’t pass
Frequent min_support
TID Items bought (ordered) threshold
frequent items
Item Support
items
Ite Support Count
10 {f, a, c, d, g, i, m, p} {f, c, a, m, p} m Count
l 2
0
f 4
d 1
20 {a, b, c, f, l, m, o} {f, c, a, b, m}
0 c 4
g 1
30 {b, f, h, j, o, w} {f, b} a 3
h 1
0 b 3
i 1
40 {b, c, k, s, p} {c, b, p}
m 3
0 j 1
min_support = 3
p 3
50 {a, f, c, e, l, p, m, n} {f, c, a, m, p} w 1 110
0 Ami Tusharkant Choksi@CKPCET
Dr. Machine Learning (3170724)
FP-Growth Example

For ordered transaction: {f, c, a, m, p} ro


ot

f:1

c:1

a:1

m:
1

p:1

111
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth Example

For ordered transaction: {f, c, a, b, m} ro


ot

f:2

c:2

a:2

m:
1
b:1

p:1

m:
1
112
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth Example

For ordered transaction: {f, b} ro


ot

f:3

c:2
b:1

a:2

m:
1
b:1

p:1

m:
1
113
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth Example

For ordered transaction:{c, b, p} ro


ot

f:3 c:1

c:2
b:1 b:1
a:2

m: p:1
1
b:1

p:1

m:
1
114
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth Example
Check for item’s support count from tree and table

Ordered ro
Frequent items ot

Item Support f:4 c:1


Count

f 4 c:3
b:1 b:1
c 4
a:3

a 3
m: p:1
2
b 3 b:1

m 3 p:2

m:
p 3 1
115
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Create Conditional FP-tree to mine association rules Example
-Create conditional FP-tree: create list of paths to reach a particular node.
Write path and the count of reach node. root
-E.g. reaching p node having two paths f-c-a-m and c-b. The count of p
through f-c-a-m is 2 and through c-b is 1, so will write in table as fcam:2 f: c
4 :
and cb:1 in CBP 1
c
It Conditional Conditional FP : b b
3 : :
e Pattern tree a 1 1
m Base(CBP) :
3 p
m
f Empty Empty :
: b 1
2 :
c {f:3} {f:3}|c p 1
:
a {fc:3} {f:3,c:3}|a 2 m
:
1
b {fca:1}, {f:1}, Empty(no For conditional FP tree: take whatever is common in more
{c:1} common) than one path for reaching a particular node and add their
count. e.g.{fca:2}, {fcab:1}->fca is common, and 2+1=3
m {fca:2}, {fcab:1} {f:3,c:3,a:3}|m {f:3,c:3,a:3}|m

p {fcam:2},{cb:1} {c:3}|p Explanation: https://www.youtube.com/watch?v=y8iHL6vKgIo


116
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP-Growth DisAdvantages
● The biggest problem is the interdependency of data. The interdependency
problem is that for the parallelization of the algorithm some that still needs to
be shared, which creates a bottleneck in the shared memory.

117
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Apriori vs FP-Growth

118
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Algorithm Technique Runtime Memory Parallelizability
usage

Apriori Generate Candidate Saves Candidate


singletons, pairs, generation is singletons, generation is very
triplets, etc. extremely slow. pairs, triplets, parallelizable
Runtime increases etc.
exponentially
depending on the
number of different
items.

FP- Insert sorted Runtime increases Stores a Data are very


Growth items by linearly, depending compact interdependent,
frequency into a on the number of version of the each node needs
pattern tree transactions and database. the root.
items
119
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Associative Classification Mining
● Given a labeled training data set, the problem is to derive a set of
class association rules (CARs) from the training data set which
satisfy certain user-constraints, i.e support and confidence
thresholds.
● Common Associative Algorithms:
○ CBA : Class based Association
○ CPAR : Classification based on Predictive Association Rule
○ CMAR : Classification Based on Multiple Class-Association Rules
○ MCAR : Multi-class Classification based on Association Rule

120
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
In detail few algorithms
● CBA : is a Class based Association Rule Mining (CARM) algorithm
developed by Bing Liu, Wynne Hsu and Yiming Ma (Liu et al. 1998).
● CBA operates using a two stage approach to generate a classifier:
○ Generating a complete set of CARs (Classification Association Rules).
○ Prune the set of CARs to produce a classifier.
● Uses apriori algorithm to generate candidate sets.
● CMAR:-Classification Based on Multiple Class-Association Rules
○ uses approaches based on the frequent pattern (FP)- growth method to discover rules
○ CBA and CMAR are time consuming
● CPAR -Classification based on Predictive Association Rule
○ The CPAR and other predictive mining algorithms overcome this problem by generating
a small set of predictive rules directly from the dataset based on the rule prediction and
coverage analysis, as opposed to generating candidate rules.

121
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Associative Classification Mining-
algorithms
● MCAR- Multi-class Classification based on Association Rule
○ MCAR uses an efficient technique for discovering frequent items and
employs a rule ranking method which ensures detailed rules with high
confidence are part of the classifier.

122
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Incremental ARM
● In incremental association rule mining as the time goes new
transaction are added and old transaction are being obsolete.
● old rule may be dropped out and new rule may be arrived in.
● Incremental algorithms are:
○ FUP (fast update)
○ FUP2
○ UPDATE WITH EARLY PRUNING(UWEP)
○ Negative Border

123
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FUP
● FUP is the first algorithm of incremental association rule mining.
● It works with insertion transaction only
● It cannot work with deletion transaction.
● It performs multiple scanning of database i.e. it scans incremented
database as well as old database.
● it performs similar operation for k itemset.
● Original database D and its corresponding frequent itemsets L = {L1...
Lk}.
● The goal is D’= D ∪ Δ+. Here Δ+ is an incremented database.

124
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FUP2
● It is an extension of FUP algorithm.
● It works with incremented database as well as decremented
database.
● So i.e. it will handle deletion of transaction from old database also.
● FUP2 is equivalent to FUP for the case of insertion, and is, however,
a complementary algorithm of FUP for the case of deletion.
● For a general case that transactions are added and deleted, algorithm
FUP2 can work smoothly with both the deleted portion Δ− and the
added portion Δ+ of the whole dataset.
● It gives poor result if it used with temporal database. [Temporal
database stores data relating to time instances. It offers temporal data
types and stores information relating to past, present and future time.]
125
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
UPDATE WITH EARLY

PRUNING(UWEP)
It is a subset of FUP algorithm.
● In update, with early pruning algorithm it prunes the itemset in original dataset
as soon as it became infrequent in updated database D’.
● It will not wait until all kth iteration is completed. So it reduces the candidate
set generation in incremented database.

126
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Negative Border
● Negative border algorithm is used for improving efficiency of FUP-based
algorithm
● Given a collection of frequent itemsets L negative border Bd−(L) of L consists
of itemset R which are not in L.
● In other words, the negative border consists of all itemsets that were
candidates of the level-wise method which did not have enough support.
● This algorithm first scans incremented part of database and then whole
database is scanned if and only if itemset outside of negative border gets
added to frequent itemset. This may result into increasing size of candidate
set generation.

127
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Next Lecture

128
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Mining Multiple-Level Association Rules
■ Items often form hierarchies
■ Flexible support settings
■ Items at the lower level are expected to have lower support
■ Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)

uniform support reduced support


Level 1 Milk Level 1
min_sup = 5% [support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 129


Multi-level Association: Flexible Support and
Redundancy filtering
■ Flexible min-support thresholds: Some items are more valuable but
less frequent
■ Use non-uniform, group-based min-support

■ E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …

■ Redundancy Filtering: Some rules may be redundant due to


“ancestor” relationships between items
■ milk ⇒ wheat bread [support = 8%, confidence = 70%]

■ 2% milk ⇒ wheat bread [support = 2%, confidence = 72%]

The first rule is an ancestor of the second rule


■ A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 130
References:
1. Apriori, http://www.cs.uic.edu/~liub/teach/cs583-fall-11/CS583-association-sequential-patterns.ppt
2. Apriori, http://cse.iitkgp.ac.in/~bivasm/uc_notes/07apriori.pdf
3. Apriori, https://en.wikipedia.org/wiki/Apriori_algorithm
4. Apriori, https://project.dke.maastrichtuniversity.nl/datamining/material/lecture07.ppt
5. Apriori, https://www.cs.sjsu.edu/~lee/cs157b/Gaurang%20Negandhi--Apriori%20Algorithm%20Presentation.ppt
6. Apriori, http://cs-people.bu.edu/evimaria/cs565-12/lect2.pptx
7. Apriori video, https://www.youtube.com/watch?v=l7n4K12EjY0
8. Full calculation of apriori and fp-algorithms, http://www3.cs.stonybrook.edu/~cse634/lecture_notes/07apriori.pdf
9. FP Growth, http://www.singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining
10. FP conditional tree, http://www.cis.hut.fi/Opinnot/T-61.6020/2008/fptree.pdf
11. Conditional FP tree video, https://www.youtube.com/watch?v=LXx1xKF9oDg
12. FP tree video, https://www.youtube.com/watch?v=W2Cp0uuFO1s&t=159s
13. FP Growth, https://www.youtube.com/watch?v=UbR1qXuIeJY
14. Associative Classification Mining, https://pdfs.semanticscholar.org/6145/5083a199a3844d209d19636248d63a0fec9f.pdf
15. Wei-Guang Teng and Ming-Syan Chen, “Incremental Mining on Association Rules”,
https://pdfs.semanticscholar.org/e0bd/1ea23f79e427d933505ef2899ccf148874b6.pdf
16. Ms. Anju k.kakkad, Ms. Anita Zala, “Incremental Association Rule Mining by Modified Approach of Promising Frequent Itemset
Algorithm Based on Bucket Sort Approach”, http://www.ijarcce.com/upload/2013/november/45-s-anju_kakkad-incremental.pdf
17. Jong Soo Park, Ming-syan Chen, Philip S. Yu, “An effective hash based algorithm”,1995,
http://user.it.uu.se/~kostis/Teaching/DM-01/Handouts/PCY.pdf
18. Wenmin Li Jiawei Han Jian Pei, “CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules”,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.219&rep=rep1&type=pdf
131
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
References:
19. Data characterization and discrimination, https://www.youtube.com/watch?v=SW7-o86iL3w&t=181s
20. Data characterization and discrimination exampleas, https://www.sciencedirect.com/topics/computer-science/data-
characterization#:~:text=1.4.,customers%20include%20bigSpenders%20and%20budgetSpenders.
21. Apriori algorithm improvement methods, https://www.youtube.com/watch?v=asWqVHex9kY
22. Dynamic Itemset Counting example,
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/DIC.html#:~:text=Each%20itemset%20in%20the%20lattice,25%25%20an
d%20M%20%3D%202.

132
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
HASH BASED APRIORI ALGORITHM
1. Scan all the transaction. Create possible 2-itemsets.
2. Let the Hash table of size 8.
3. For each bucket assign an candidate pairs using the ASCII
values of the itemsets.
4. Each bucket in the hash table has a count, which is increased by
1 each item an itemset is hashed to that bucket.
5. If the bucket count is equal or above the minimum support count,
the bit vector is set to 1. Otherwise it is set to 0.
6. The candidate pairs that hash to locations where the bit vector
bit is not set are removed.
7. Modify the transaction database to include only these candidate
pairs. 133
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Example of the hash-tree for C3

134
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Example of the hash-tree for C3

135
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Example of the hash-tree for C3

136
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
Can be added
Lift

Drawing conditional fp-tree and derive rule from it.

137
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
MODIFY THE NEXT EXAMPLE

138
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example
● Transaction TMario= [ [beer, bread, butter, milk] , [beer, milk, butter],
[beer, milk, cheese] , [beer, butter, diapers, cheese] , [beer, cheese,
bread] ]
● Step 1: The first step is we count all the items in all the transactions
TMario= [ beer: 5, bread: 2, butter: 3, milk: 3, cheese: 3, diapers: 1]
● Step 2: Next we apply the threshold. For this example let's say we have a
threshold of 30% so each item has to appear at least twice.
TMario= [ beer: 5, bread: 2, butter: 3, milk: 3, cheese: 3, diapers: 1]
● Step 3: Now we sort the list according to the count of each item.
TMarioSorted = [ beer: 5, butter: 3, milk: 3, cheese: 3, bread: 2], removed
diapers:1

139
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 4: Now we build the tree. We go


through each of the transactions and
add all the items in the order they
appear in our sorted list.

1.Transaction to add= [beer, bread,


butter, milk]

a/c sorted list it will be added as

[beer, butter, milk, bread] - set


count=1 for each item
140
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

2.Transaction 2: [beer, milk, butter]

a/c sorted list it will be added as

[beer, butter, milk] - increment count


if revisited the node, else set count=1
for each item

141
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

3.Transaction 3=[beer, milk, cheese]

a/c sorted list it will be added as

[beer, milk, cheese] - increment


count if revisited the node, else set
count=1 for each item

142
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

4.Transaction 4=[beer, butter,


diapers, cheese]

-diaper has not passed the minimum


support threshold, so removed it

-a/c sorted list it will be added as

[beer, butter, cheese] - increment


count if revisited the node, else set
count=1 for each item

143
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

5.Transaction 5= [beer, cheese,


bread]

a/c sorted list it will be added as

[beer, cheese, bread].

- increment count if revisited the


node, else set count=1 for each item

144
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example
Step 5: In order to get the
associations now we go through
every branch of the tree and only
include in the association all the
nodes whose count passed the
threshold.

[beer:4, butter:2, milk:2] in the


available branch. Together they
appear 2 times in branch. So
association,

{beer, butter, milk}=⅖ = 40%


145
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)
FP Growth Method Example

Step 5: In order to get the


associations now we go through
every branch of the tree and only
include in the association all the
nodes whose count passed the
threshold.

[beer:4, cheese:2] in the available


branch. Together they appear 2
times in branch. So association,

{beer, cheese}=⅖ = 40%


146
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724)

You might also like