Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Module 4 DM

The document provides an overview of Association Rule Analysis, focusing on methods for discovering relationships in data through techniques like the Apriori algorithm and the FP-Growth algorithm. It explains key concepts such as support, confidence, frequent itemsets, and the process of generating association rules. Additionally, it discusses alternative algorithms like the Partition and Pincer Search algorithms, emphasizing their efficiency in mining frequent patterns from transactional data.

Uploaded by

amiya07chinnu03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 4 DM

The document provides an overview of Association Rule Analysis, focusing on methods for discovering relationships in data through techniques like the Apriori algorithm and the FP-Growth algorithm. It explains key concepts such as support, confidence, frequent itemsets, and the process of generating association rules. Additionally, it discusses alternative algorithms like the Partition and Pincer Search algorithms, emphasizing their efficiency in mining frequent patterns from transactional data.

Uploaded by

amiya07chinnu03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

MODULE 4

Association Rule Analysis


Association Rules-Introduction
• Association rule mining finds interesting associations and
relationships among large sets of data items.

• This rule shows how frequently a itemset occurs in


a transaction.

• A typical example is a Market Based Analysis.


• It allows retailers to identify relationships between the items
that people buy together frequently.
• Association Rule – An implication expression of the form X -> Y,
where X and Y are any 2 itemsets.

• Measures for Rule Creation

1. The support for X→Y is the probability of both X and Y appearing


together, that is P(X U Y).
support(AUB) = P(AꓴB)
1. The confidence of X→Y is the conditional probability of Y
appearing given that X exists. It is written as P(Y|X) and read as P of Y
given X.
Example-Find support value of each
itemset

Support (item) = Frequency of item/Number of transactions


Answer
:
Methods to discover Association rules
In general, association rule mining can be viewed as a two-step
process:
1. Find all frequent itemset: By definition, each of these itemset will
occur at least as frequently as a predetermined minimum support
count
2. Generate strong association rules from the frequent itemset: By
definition, these rules must satisfy minimum support and minimum
confidence.
Frequent Item Set
• Let T be the transaction database and a be the user-specified
minimum support σ.
• An itemset X ϵ A is said to be a frequent itemset in T with respect to
σ, if
Support(X) ≥ σ
• Downward Closure Property : Any subset of a frequent set is a
frequent set.
• Upward Closure Property: Any superset of an infrequent set is an
infrequent set.
Maximal Frequent Set
• A frequent set is a maximal frequent set if it is a frequent set and no
superset of this is a frequent set.

Border Set
An itemset is a border set if it is not a frequent set, but all its proper
subsets are frequent sets
(A1, A2, A3, A4, A5, A6, A7, A8, A9). Assume support= 20%.
• {1} – Not a frequent set
• {3} - is a frequent set
• {5, 6, 7} - is a border set
• {5, 6} - is a maximal frequent set
• {2, 4} - is also a maximal frequent set
• But there is no border set having {2, 4} as a proper subset
APRIORI ALGORITHM
• It is also called the level-wise algorithm
• Apriori algorithm may be simply described by a two step approach:

• Step 1 ─ discover all frequent (single) items that have support above
the minimum support required

• Step 2 ─ use the set of frequent items to generate the association


rules that have high enough confidence level
• The candidate generation process and the pruning process are the
most important parts of this algorithm
• The candidate-generation method
• Given 𝐿𝑘−1, the set of all frequent (k-1)-itemset, we want to generate a
superset of the set of all frequent k-itemset.
• Using this algorithm: from L3 := { 1, 2, 3}, {1, 2, 5}, {1, 3, 5}, {2, 3, 5}, {2, 3, 4}).
C4 := { 1, 2, 3, 5 }, {2, 3, 4, 5} }
• {1, 2, 3, 5} is generated from {1, 2, 3} and {1, 2, 5}.
• Similarly, {2, 3, 4, 5} is generated from {2, 3, 4} and {2, 3, 5}.
• Pruning
• The pruning step eliminates the extensions of (k-1)-itemset which are not
found to be frequent, from being considered for counting support.
• For example, from C4, the itemset {2, 3, 4, 5} is pruned, since all its 3-subsets
are not in L3
Exampl
e Find association rules with minimum support of 50%
and minimum confidence of 75%.

Transaction ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk
Exampl
e
First find L1. 50% support requires that each frequent item appear in
at least three transactions. Therefore L1 is given by:

Item Frequnecy
Bread 4
Cheese 3
Juice 4
Milk 3
Exampl
e The candidate 2-itemsets or C2 therefore has six pairs
(why?). These pairs and their frequencies are:

Item Pairs Frequency


(Bread, Cheese) 2
(Bread, Juice) 3
(Bread, Milk) 2
(Cheese, Juice) 3
(Cheese, Milk) 1
(Juice, Milk) 2
Deriving
Rules
L2 has only two frequent item pairs {Bread, Juice} and {Cheese, Juice}.
After these two frequent pairs, there are no candidate 3-itemsets
Apriori property that all subsets of a frequent itemset must also be frequent

The two frequent pairs lead to the following possible rules:


Bread → Juice
Juice → Bread
Cheese → Juice
Juice → Cheese
• The confidence of these rules is obtained by dividing the support for both
items in the rule by the support of the item on the left hand side of the
rule.
• The confidence of the four rules therefore are
(Bread → Juice) ----- 3/4 = 75%

support(Bread → 3
confidence(Bread → Juice) =
Juice)𝑠 𝑢𝑝𝑝𝑜𝑟𝑡(𝐵𝑟𝑒𝑎𝑑) 4
=

(Juice → Bread) ------ 3/4 = 75%


(Cheese → Juice) ----- 3/3 = 100%
(Juice → Cheese) ----- 3/4 = 75%
• Since all of them have a minimum 75% confidence, they all qualify.
Problem 2
• Find Association rules from the following data using Apriori algorithm
with minimum support count required is 2 and confidence 70%
Solutio
n
If the minimum confidence threshold is, say, 70%, then only
the second, third, and last rules above are output, because
these are the only ones generated that are strong.
Partition Algorithm
• Apriori scans the database (set of transactions) several times, in order to
compute the supports of candidate frequent k-itemset.

• The Partition Algorithm scans the database only twice.


𝖺 First scan: generate a set of all potentially large itemset

𝖺 Second scan: set up counters for each potentially large itemset and compute
their actual supports
• During the first scan, a superset of the actual large itemsets is generated. (i.e.
false positives may be generated, but no false negatives are generated)
• The Partition Algorithm executes in two phases:
𝖺 Phase I: the algorithm logically divides the database into a number of non-
overlapping partitions. The partitions are considered one at a time and all large
itemsets for that partition are generated. At the end of phase I, these large
itemsets are merged to generate a set of all potentially large itemsets.

𝖺 Phase II: the actual supports for these itemsets are generated and the large
itemsets are identified.

The partition sizes are chosen such that each partition can be accommodated in
the main memory so that the partitions are read only once in each phase.
Algorith
m
Algorith
m
Proble
m
Solutio
n
Pincer Search Algorithm
• Apriori algorithm operates in a bottom – up, breadth – first
search method.
• The computation starts from the smallest set of frequent item
sets and moves upward till it reaches the largest frequent item
set
• The number of database passes is equal to the largest size of
the frequent item set.

• As a result, the performance decreases.


• pincer – search algorithm is based on bi – directional
search, which takes advantages of both the bottom –
up as well as the top – down process.

• It attempts to find the frequent item sets in a bottom –


up manner but, at the same time, it maintains a list of
maximal frequent item sets after pruning in the top
down approach.
• In this algorithm, in each pass, in addition to counting
the supports of the candidate in the bottom – up
direction, it also counts the supports of the item sets of
some item sets using a top – down approach.

• These are called the Maximal Frequent Candidate Set


(MFCS).
• This process helps in pruning the candidate sets very
early on in the algorithm.
• MFCS is initialized to contain one itemset, which contains all of the
database items.
• MFCS is updated whenever new infrequent itemsets are found
• Method:
Recover
y
C1 = {{ I1}, {I2}, {I3}, {I4}, {I5}}
Proble MFCS = {I1,I2,I3,I4,I5}
m MFS = Ø ;

Pass 1: Database is read to count


support as follows:
ITEM COUNT
I2 7
I1 6
I3 6
I4 2
I5 2
• {I1,I2,I3,I4,I5} – 0
• So MFCS = {I1,I2,I3,I4,I5} & MFS = Ø;
• L1 = {{ I1}, {I2}, {I3}, {I4}, {I5}}
• S1 = Ø, we don’t need to update MFCS
• C2=
• Pass 2: Read database to count
support of elements in C2 & MFCS as
given below:

• COUNT(MFCS)=0

• L2= { {I1,I4},{I1,I3},{I1,I5},
{I2,I3},{I2,I4},{I2,I5}}
• S2= { {I1,I4}, {I3, I4}, {I3, I5}, {I4,I5}}
S2 not empty , update MFCS
• MFCS = {I1,I2,I3,I4,I5}
• S2= { {I1,I4}, {I3, I4}, {I3, I5},
{I4,I5}}
Delete {I1, I4} from MFCS one
at a time
{I1,I2,I3,I4,I5}

{I2,I3,I4,I5} {I1,I2,I3,I5}

New MFCS = { {I2,I3,I4,I5} ,


{I1,I2,I3,I5} }
• New MFCS = { {I2,I3,I4,I5} ,
{I1,I2,I3,I5} }
• S2= { {I1,I4}, {I3, I4}, {I3, I5},
{I4,I5}}
Delete {I3, I4} from each item
in MFCS one at a time
{I2,I3,I4,I5}

{I2,I3,I5} {I2,I4,I5}
• {I3,I4} not a subset of {I1,
I2,I3,I5} so no change
• New MFCS ={ {I2,I3,I5},
{I2,I4,I5}, {I1,I2,I3,I5}}
• = {{I2,I4,I5}, {I1,I2,I3,I5}}
• New MFCS = {{I2,I4,I5},
{I1,I2,I3,I5}}
• S2= { {I1,I4}, {I3, I4}, {I3, I5},
{I4,I5}}
Delete {I3, I5} from each item
in MFCS one at a time
{I1,I2,I3,I5}

{I1,I2,I5} {I1,I2,I3}
• {I3,I5} not a subset of
{I2,I4,I5} so no change
• New MFCS ={ {I1,I2,I5},
{I1,I2,I3}, {I2,I4,I5}}
• New MFCS ={ {I1,I2,I5}, {I1,I2,I3},
{I2,I4,I5}}
• S2= { {I1,I4}, {I3, I4}, {I3, I5},
{I4,I5}}
Delete {I4, I5} from each item in
MFCS one at a time
{I2,I4,I5}

{I2,I4} {I2,I5}
• {I3,I5} not a subset of {I1,I2,I5},
{I1,I2,I3} so no change
• New MFCS ={ {I1,I2,I5}, {I1,I2,I3},
{I2,I4}}
• Final MFCS ={ {I1,I2,I5}, {I1,I2,I3}, {I2,I4}}

• Final Output : Largest sets of three items.


Frequent Pattern Growth Algorithm

• This algorithm is an improvement to the Apriori method.


• A frequent pattern is generated without the need for
candidate generation.
• FP growth algorithm represents the database in the form
of a tree called a frequent pattern tree or FP tree.
• This tree structure will maintain the association between
the itemsets.
• FP Tree
• Frequent Pattern Tree is a tree-like structure that is made
with the initial itemsets of the database.
• The purpose of the FP tree is to mine the most frequent
pattern.
• Each node of the FP tree represents an item of the
itemset.
• The root node represents null while the lower nodes
represent the itemsets.
• The association of the nodes with the lower nodes that is
the itemsets with the other itemsets are maintained while
forming the tree.
Problem 1 Solutio
n
ITEM COUNT
I2 7
I1 6
I3 6
I4 2
I5 2

Eliminate those item whose support


count below minimum support value
Here support_count = 2
I2, I1, I5
I2, I4
I2, I3
I2, I1 I4
I1, I3
I2, I3
I1, I3
I2, I1, I3, I5
I2, I1, I3
Item Conditional Pattern Conditional FP Frequent Pattern
Base Tree(with min Generated
support_count>=2
)
I5 {{I2, I1:1}, {I2, I1,I3:1} <I2 :2, I1 : 2 > {I2, I5 :2} {I1, I5 :2}
{I2, I1, I5 : 2}
I4 { {I2,I1:1}, {I2: 1} } <I2 : 2> { I2, I4 : 2}
I3 { {I2, I1 : 2}, {I2 :2}} - <I2:4, I1 :2 > {I2, I3 :4} {I1, I3:4}
left subtree <I1: 2> {I2, I1, I3 : 2}
{I1 : 2} – right
subtree
I1 { {I2: 4}} < I2 : 4 > {I2, I1: 4}
Problem : 2
• Support threshold=50%, Confidence= 6 0 %

Transaction List of items


T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
Advantages Of FP Growth Algorithm

1.This algorithm needs to scan the database only twice


when compared to Apriori which scans the transactions
for each iteration.
2.The pairing of items is not done in this algorithm and
this makes it faster.
3.The database is stored in a compact version in memory.
4.It is efficient and scalable for mining both long and
short frequent patterns.
• Let the minimum support be 3
Ordered-Item set
a) Inserting the set {K, E, M, O, Y}: b) Inserting the set {K, E, O, Y}: c) Inserting the set {K, E, M}:
d) Inserting the set {K, M, Y}: e) Inserting the set {K, E, O}
Conditional Pattern Base

• Conditional Frequent Pattern Tree


Frequent Pattern rules
Disadvantages Of FP-Growth Algorithm

1. FP Tree is more difficult to build than Apriori.


2. It may be expensive.
3.When the database is large, the algorithm may not fit in
the shared memory
FP Growth vs Apriori
FP Growth Apriori
Pattern Generation
FP growth generates pattern by constructing a FP tree Apriori generates pattern by pairing the items into
singletons, pairs and triplets.

Candidate Generation
There is no candidate generation Apriori uses candidate generation
Process
The process is faster as compared to Apriori. The The process is comparatively slower than FP Growth, the
runtime of process increases linearly with increase in runtime increases exponentially with increase in
number of itemsets. number of itemsets

Memory Usage
A compact version of database is saved The candidates combinations are saved in memory
Dynamic Itemset Counting Algorithm
• Alternative to Apriori Itemset Generation
• Itemsets are dynamically added and deleted as transactions are read
• Relies on the fact that for an itemset to be frequent, all of its subsets
must also be frequent, so we only examine those itemsets whose
subsets are all frequent
Itemsets are marked in four different ways as they are counted:
• Solid box: confirmed frequent itemset - an itemset we have finished
counting and exceeds the support threshold minsupp
• Solid Circle: confirmed infrequent itemset - we have finished
counting and it is below minsupp
• Dashed box: suspected frequent itemset - an itemset we are still
counting that exceeds minsupp
• Dashed Circle: suspected infrequent itemset - an itemset we are
still counting that is below minsupp
DIC Algorithm
1. Mark the empty itemset with a solid square. Mark all the 1-itemsets with
dashed circles. Leave all other itemsets unmarked.
2.While any dashed itemsets remain:
1. Read M transactions (if we reach the end of the transaction file, continue from the
beginning). For each transaction, increment the respective counters for the itemsets
that appear in the transaction and are marked with dashes.
2. If a dashed circle's count exceeds minsupp, turn it into a dashed square. If any
immediate superset of it has all of its subsets as solid or dashed squares, add a new
counter for it and make it a dashed circle.
3. Once a dashed itemset has been counted through all the transactions, make it solid
and stop counting it.
• Itemset lattices: An itemset lattice contains all of the possible
itemsets for a transaction database. Each itemset in the lattice points to
all of its supersets. When represented graphically, a itemset lattice can
help us to understand the concepts behind the DIC algorithm
Proble
m
• Example: minsupp = 25% (support count= 0.25*4=1) and M = 2.
Example

You might also like