Module 4 DM
Module 4 DM
Border Set
An itemset is a border set if it is not a frequent set, but all its proper
subsets are frequent sets
(A1, A2, A3, A4, A5, A6, A7, A8, A9). Assume support= 20%.
• {1} – Not a frequent set
• {3} - is a frequent set
• {5, 6, 7} - is a border set
• {5, 6} - is a maximal frequent set
• {2, 4} - is also a maximal frequent set
• But there is no border set having {2, 4} as a proper subset
APRIORI ALGORITHM
• It is also called the level-wise algorithm
• Apriori algorithm may be simply described by a two step approach:
• Step 1 ─ discover all frequent (single) items that have support above
the minimum support required
Transaction ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk
Exampl
e
First find L1. 50% support requires that each frequent item appear in
at least three transactions. Therefore L1 is given by:
Item Frequnecy
Bread 4
Cheese 3
Juice 4
Milk 3
Exampl
e The candidate 2-itemsets or C2 therefore has six pairs
(why?). These pairs and their frequencies are:
support(Bread → 3
confidence(Bread → Juice) =
Juice)𝑠 𝑢𝑝𝑝𝑜𝑟𝑡(𝐵𝑟𝑒𝑎𝑑) 4
=
𝖺 Second scan: set up counters for each potentially large itemset and compute
their actual supports
• During the first scan, a superset of the actual large itemsets is generated. (i.e.
false positives may be generated, but no false negatives are generated)
• The Partition Algorithm executes in two phases:
𝖺 Phase I: the algorithm logically divides the database into a number of non-
overlapping partitions. The partitions are considered one at a time and all large
itemsets for that partition are generated. At the end of phase I, these large
itemsets are merged to generate a set of all potentially large itemsets.
𝖺 Phase II: the actual supports for these itemsets are generated and the large
itemsets are identified.
The partition sizes are chosen such that each partition can be accommodated in
the main memory so that the partitions are read only once in each phase.
Algorith
m
Algorith
m
Proble
m
Solutio
n
Pincer Search Algorithm
• Apriori algorithm operates in a bottom – up, breadth – first
search method.
• The computation starts from the smallest set of frequent item
sets and moves upward till it reaches the largest frequent item
set
• The number of database passes is equal to the largest size of
the frequent item set.
• COUNT(MFCS)=0
• L2= { {I1,I4},{I1,I3},{I1,I5},
{I2,I3},{I2,I4},{I2,I5}}
• S2= { {I1,I4}, {I3, I4}, {I3, I5}, {I4,I5}}
S2 not empty , update MFCS
• MFCS = {I1,I2,I3,I4,I5}
• S2= { {I1,I4}, {I3, I4}, {I3, I5},
{I4,I5}}
Delete {I1, I4} from MFCS one
at a time
{I1,I2,I3,I4,I5}
{I2,I3,I4,I5} {I1,I2,I3,I5}
{I2,I3,I5} {I2,I4,I5}
• {I3,I4} not a subset of {I1,
I2,I3,I5} so no change
• New MFCS ={ {I2,I3,I5},
{I2,I4,I5}, {I1,I2,I3,I5}}
• = {{I2,I4,I5}, {I1,I2,I3,I5}}
• New MFCS = {{I2,I4,I5},
{I1,I2,I3,I5}}
• S2= { {I1,I4}, {I3, I4}, {I3, I5},
{I4,I5}}
Delete {I3, I5} from each item
in MFCS one at a time
{I1,I2,I3,I5}
{I1,I2,I5} {I1,I2,I3}
• {I3,I5} not a subset of
{I2,I4,I5} so no change
• New MFCS ={ {I1,I2,I5},
{I1,I2,I3}, {I2,I4,I5}}
• New MFCS ={ {I1,I2,I5}, {I1,I2,I3},
{I2,I4,I5}}
• S2= { {I1,I4}, {I3, I4}, {I3, I5},
{I4,I5}}
Delete {I4, I5} from each item in
MFCS one at a time
{I2,I4,I5}
{I2,I4} {I2,I5}
• {I3,I5} not a subset of {I1,I2,I5},
{I1,I2,I3} so no change
• New MFCS ={ {I1,I2,I5}, {I1,I2,I3},
{I2,I4}}
• Final MFCS ={ {I1,I2,I5}, {I1,I2,I3}, {I2,I4}}
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
Advantages Of FP Growth Algorithm
Candidate Generation
There is no candidate generation Apriori uses candidate generation
Process
The process is faster as compared to Apriori. The The process is comparatively slower than FP Growth, the
runtime of process increases linearly with increase in runtime increases exponentially with increase in
number of itemsets. number of itemsets
Memory Usage
A compact version of database is saved The candidates combinations are saved in memory
Dynamic Itemset Counting Algorithm
• Alternative to Apriori Itemset Generation
• Itemsets are dynamically added and deleted as transactions are read
• Relies on the fact that for an itemset to be frequent, all of its subsets
must also be frequent, so we only examine those itemsets whose
subsets are all frequent
Itemsets are marked in four different ways as they are counted:
• Solid box: confirmed frequent itemset - an itemset we have finished
counting and exceeds the support threshold minsupp
• Solid Circle: confirmed infrequent itemset - we have finished
counting and it is below minsupp
• Dashed box: suspected frequent itemset - an itemset we are still
counting that exceeds minsupp
• Dashed Circle: suspected infrequent itemset - an itemset we are
still counting that is below minsupp
DIC Algorithm
1. Mark the empty itemset with a solid square. Mark all the 1-itemsets with
dashed circles. Leave all other itemsets unmarked.
2.While any dashed itemsets remain:
1. Read M transactions (if we reach the end of the transaction file, continue from the
beginning). For each transaction, increment the respective counters for the itemsets
that appear in the transaction and are marked with dashes.
2. If a dashed circle's count exceeds minsupp, turn it into a dashed square. If any
immediate superset of it has all of its subsets as solid or dashed squares, add a new
counter for it and make it a dashed circle.
3. Once a dashed itemset has been counted through all the transactions, make it solid
and stop counting it.
• Itemset lattices: An itemset lattice contains all of the possible
itemsets for a transaction database. Each itemset in the lattice points to
all of its supersets. When represented graphically, a itemset lattice can
help us to understand the concepts behind the DIC algorithm
Proble
m
• Example: minsupp = 25% (support count= 0.25*4=1) and M = 2.
Example