Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Association Analysis: Basic Concepts and Algorithms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Module 3 18MCA452

Module 3

Association Analysis: Basic Concepts and Algorithms


3.1 Introduction

• Many business enterprises accumulate large quantities of data from their day-to-day
operations, huge amounts of customer purchase data are collected daily at the checkout
counters of grocery stores such data, commonly known as market basket transactions
is as shown in Table 3.1
• Each row in this table corresponds to a transaction, which contains a unique
identifier labeled TID and a set of items bought by a given customer. Retailers are
interested in analyzing the data to learn about the purchasing behavior of their
customers. Such valuable information can be used to support a variety of business-
related applications such as marketing promotions, inventory management, and
customer relationship management.
Table 3.1: Example of Market Basket transactions.

TID Items

1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}

3 {Milk, Diapers, Beer, Cola}

4 {Bread, Milk, Diapers, Beer}


5 {Bread, Milk, Diapers, Cola}

What is Association Analysis?


Association analysis is useful for discovering interesting relationships hidden in large
amount of data. From Table 3.1 it is clear that the people who buy bread will also buy milk too.
➔{Bread} → {Milk}
There are two key Issues that need to be addressed when applying association analysis to
market basket data.
1. First, discovering patterns from a large transaction data set can be computationally
expensive.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 1


Module 3 18MCA452

2. Second, some o f t h e discovered patterns are potentially spurious (fake) because they
may happen simply by chance.

3.1.1 Problem Definition


Basic terminology used in association analysis is:

1. Binary Representation: Market basket data can be represented in a binary format


where each row corresponds to a transaction and each column corresponds to an item.
An item can be treated as a binary variable whose value is one if the item is present in
a transaction and zero otherwise
Table 3.2: Binary Representation of market based data

TID Bread Milk Diapers Beer Eggs Cola


1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1

2. ItemSet: In association analysis, a collection of zero or more items is termed as -itemset.


For instance. {Beer, Diapers, Milk} is an example of a 3-itemset.The null (or empty) set is
an itemset that does not contain any items.
3. Transaction Width: It is defined as the number of items present in a transaction.
Important Property of ItemSet is: “Support and Count” which refers to the number of
transactions that contain a particular itemset.
4. Association Rule: An association rule is an Implication expression of the form X→ Y,
where X and Y are disjoint itemsets. The strength of an association rule can be
measured in terms of its support and confidence.
5. Support determines how often a rule is applicable to a given Data set
6. Confidence determines how frequently items in Y appear in transactions that contain
X.Confidence, on the other hand, measures the reliability of the inference made by a
rule.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 2


Module 3 18MCA452

Consider the Association rule (Milk, Diapers) → (Beer).


From the above market basket data t h e support count for (Milk, Diapers, Beer) is 2
and the total number of transactions is 5.
The rule's support is 2/5 = 0.4*100=40%
The rule's confidence is obtained by dividing the
support count for {Milk. Diapers, Beer}by the
support count for (Milk, Diapers).
Confidence: Since, there are 3 transactions that contain
milk and diapers,
The confidence for this rule is 2/3 = 0.67*100=67%

3.1.2 Association Rule Mining Task


Defnition: Given a set of transactions T, the goal of association rule mining is to find all
rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold

Apply the brute-force approach for mining association rules is:


• List all possible association rules
• A brute-force approach for mining association rules is to compute the support and
confidence for every possible rule. This approach is prohibitively expensive because there
are exponentially many rules that can be extracted from a data set. More specifically, the
total number of possible rules extracted from a data set that contains D items is
R=3d−2d+1+1.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 3


Module 3 18MCA452

• Even for the small data set shown in Table 3.1, this approach requires us to compute the
support and confidence for 36 −27+ 1 = 602 rules. More than 80% of the rules are discarded
after applying minsup = 20% and minconf = 50%, thus making most of the computations
become wasted. To avoid performing needless computations, it would be useful to prune the
rules early without having to compute their support and confidence values.
• Commons strategy adapted by many association rule mining algorithm is to decompose
the problem into two major approaches.
• The initial step toward improving the performance of association rule mining algorithms is
to decouple the support and confidence requirements.The support of a rule X→Y depends
only on the support of its corresponding itemset,XUY.The following rules have identical
support because they involve items from the same itemset {Beer,Diapers,Milk}:
{Beer ,Diapers}→{Milk} ,{Beer, Milk}→{Diapers},{Diapers, Milk}→{Beer},
{Beer}→{Diapers, Milk},{Milk}→{Beer, Diapers},{Diapers}→{Beer, Milk}
Two-step approach:
❖ Frequent Itemset Generation: Whose objective is to find all the itemsets that satisfy
the minsup threshold. These itemsets are called frequent itemsets. Generate all itemsets
whose support  minsup
❖ Rule Generation: Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemsets.
3.2 Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets. Figure 3.1
shows an itemset lattice for I ={a, b, c, d, e}.. In general, a data set that contains k items can
ot nti y g n r t u to 2k −1 fr qu nt it s ts, excluding the null set. Because k can be very
large in many practical applications, the search space of itemsets that need to be explored is
exponentially large.
A brute-force approach for finding frequent itemsets is to determine the support count for
every candidate itemset in the lattice structure. To do this, we need to compare each candidate
against every transaction, Such an approach can be very expensive. Such an approach can be
very expensive because it requires O (NMw ) comparisons, where N is the number of
transactions, M =2k −1 is the number of candidate itemsets, and w is the maximum transaction
width.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 4


Module 3 18MCA452

Figure 4.1: A Itemset lattice structure

Figure 3.2: counting the support of candidate itemsets

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 5


Module 3 18MCA452

There are several ways to reduce the computational complexity of frequent itemset generation.
1. Reduce the number of candidate itemsets (M).The Apriori principle, described in the next
section, is an effective way to eliminate some of the candidate itemsets without counting their
support values.
2. Reduce the number of comparisons. Instead of matching each candidate Itemset against
every transaction, we can reduce the number of comparisons by using more advanced data
structures, either to store the candidate itemsets or to compress the dataset

3.2.1 Apriori Principle

– If an itemset is frequent, then all of its subsets must also be frequent


• Ex: suppose {c, d, e} is a frequent itemset. Clearly any transaction that contains {c, d,
e} is a frequent item set. Clearly any transaction that contains {c, d, e} must also
contain its subsets, {c, d}, {d, e}, {c}, {d}, and {e}.As a result, if {c, d, e} is frequent,
then all subsets of {c, d, e} must also be frequent Figure 3.3.
• Conversely if an items set such as {a, b} is infrequent then all of its supersets must
also be infrequent too is shown in Figure 3.4

Figure 3.3: An illustration of the Apriori principle for frequent Itemset

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 6


Module 3 18MCA452

Figure 3.4: An illustration of the apriori principle for Infrequent Itemset

3.2.2 Frequent Itemset generation in Apriori Principle

• Apriori is the first association rule mining algorithm that pioneered the use of
support based pruning to systematically control the exponential growth of candidate
itemsets.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 7


Module 3 18MCA452

Figure 3.5: Illustration of frequent itemset generation in apriori algorithm


Note: Support threshold is 60%, which is equivalent to a minimum support count = 3.
• Initially, every item is considered as a candidate 1-itemset. After counting their
supports, the candidate itemsets {Cola} and {Eggs} are discarded because they appear in
fewer than three transactions.
• In the next iteration, candidate 2-itemsets are generated using only the frequent 1-
itemsets because the Apriori principle ensures that all supersets of the infrequent 1-
itemsets must be infrequent.
• Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found
to be infrequent after computing their support values. The remaining four candidates
are frequent, and thus will be used to generate candidate 3-itemsets.
• With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets
are frequent. The only candidate that has this property is {Bread, Diapers, Milk}.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 8


Module 3 18MCA452

Pseudo code for Frequent Itemset generation of the Apriori Algorithm:


Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
➢ Generate length (k+1) candidate itemsets from length k frequent itemsets
➢ Prune candidate itemsets containing subsets of length k that are infrequent
➢ Count the support of each candidate by scanning the DB
➢ Eliminate candidates that are infrequent, leaving only those that are
frequent

3.2.3 Candidate Generation and Candidate Pruning

• Candidate Generation: This operation generates new candidate k-itemset based on


the frequent (k — 1 )-itemsets found in the previous iteration.
• Candidate Pruning: This operation eliminates some of the candidate k-itemsets
using the support-based pruning strategy

Principles to generate candidate itemsets:


1. It should avoid generating too many unnecessary candidates.
2. It must ensure that the candidate set is complete i.e. no frequent items are left.
3. It should not generate the same candidate itemset more than once.

Fk-1×F1 Method
An t rn tiv thod for c ndid t g n r tion is to t nd ch fr qu nt (k−1)-itemset with
other frequent items. Figure 3.5 illustrates how a frequent 2-itemset such as {Beer, Diapers} can
be augmented with a frequent item such as Bread to produce a candidate 3-itemset {Beer,
Diapers, Bread} is as shown in Figure 3.6

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 9


Module 3 18MCA452

Figure 3.6: candidate generation for Fk-1×F1 Method

Fk-1×Fk-1 Method
The candidate generation procedure in the apriori-gen function merges a pair of frequent
(k−1)-itemsets only if their first k−2 it s are identical. Let A={a1,a2,...,ak1}and
B={b1,b2,...,bk−1}be a pair of fr qu nt (k−1)-itemsets. A and B are merged if they satisfy the
following conditions: ai=bi(for i=1,2,...,k2) and ak−1=bk−1.
The frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3-
itemset {Bread, Diapers, Milk} The algorithm does not have to merge {Beer, Diapers} with
{Diapers, Milk}because the first item in both itemsets is different. Indeed, if {Beer, Diapers,
Milk}is a viable candidate, it would have been obtained by merging {Beer, Diapers}with {Beer,
Milk}instead is as shown Figure 3.7

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 10


Module 3 18MCA452

Figure 3.7: candidate generation for Fk-1×Fk-1 Method

3.2.4 Support Counting


• Support counting is the process of determining the frequency of occurrence for
every candidate itemset that survives the candidate pruning step of the apriori-gen
function.
• One approach for doing this is to compare each transaction against every candidate
itemset. This approach is computationally expensive, especially when the numbers of
transactions and candidate itemsets are large.
• An alternative approach is to enumerate the itemsets contained in each transaction and
use them to update the support counts of their respective candidate itemsets.
• To illustrate, consider a transaction t= {1,2,3,5,6}. There are 10 itemsets of size 3
contained in this transaction.Some of the itemsets may correspond to the candidate
3-itemsets under investigation, in which case, their support counts are incremented.
Other subsets of t that do not correspond to any candidates can be ignored. The Figure-
3.8 below shows a systematic way for enumerating the 3-itemsets contained in t.
• Assuming that each itemset keeps its items in increasing lexicographic order, an itemset

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 11


Module 3 18MCA452

can be enumerated by specifying the smallest item first, followed by the larger items.
• For instance, given t = {1,2,3,5,6}, all the 3-itemsets contained in t must begin with item
1, 2, or 3.
• It is not possible to construct a 3-itemset that begins with items 5 or 6 because there are
only two items in t whose labels are greater than or equal to 5.
• The number of ways to specify the first item of a 3-itemset contained in in Figure-3.8.
For instance, 1 2 3 5 6 represents a 3-itemset that begins with item 1, followed by two
more items chosen from the set {2, 3, 5, 6}.
• After fixing the first item, the prefix structures at Level 2 represent the number of ways
to select the second item.

Figure 3.8: Enumerating subsets of three items from a transaction t

For example, 1 2 3 5 6 corresponds to itemsets that begin with prefix (1 2) and are followed by
items 3, 5, or 6. Finally, the prefix structures at Level 3 represent the complete set of 3-

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 12


Module 3 18MCA452

itemsets contained in t. For example, the 3-itemsets that begin with prefix {1 2} are {1,2,3},
{1,2,5}, and {1,2,6}, while those that begin with prefix {2 3} are {2,3,5} and {2,3,6}.
• The prefix structures shown in Figure-6.9 demonstrate how itemsets contained in
a transaction can be systematically enumerated, i.e., by specifying their items one by one,
from the leftmost item to the rightmost item. We still have to determine whether each
enumerated
• 3-itemset corresponds to an existing candidate itemset. If it matches one of the
candidates, then the support count of the corresponding candidate is incremented. In the
next section, we illustrate how this matching operation can be performed efficiently
using a hash tree structure.

3.2.5 Support Counting using a Hash tree


In the Apriori algorithm, candidate itemsets are partitioned into different buckets and
stored in a hash tree. During support counting, itemsets contained in each transaction are
also hashed into their appropriate buckets. That way, instead of comparing each itemset in
the transaction with every candidate itemset, it is matched only against candidate itemsets
that belong to the same bucket.
• During support counting, itemsets contained in each transaction are also hashed into their
appropriate buckets. That way, instead of comparing each itemset in the transaction with
every candidate itemset, it is matched only against candidate itemsets that belong to the
same bucket, as shown in Figure 3.9 shows an example of a hash tree structure.
• Each internal node of the tree uses the following hash function,h(p)= p mod 3, to
determine which branch of the current node should be followed next. For example, items
1, 4, and 7 are hashed to the same branch (i.e., the leftmost branch) because they have the
same remainder after dividing the number by 3. All candidate itemsets are stored at the
leaf nodes of the hash tree. The hash tree shown in Figure 3.9 contains 15 candidate 3-
itemsets, distributed across 9 leaf nodes.
• Consider a transaction,t={1,2,3,5,6} . To update the support counts of the candidate
itemsets, the hash tree must be traversed in such a way that all the leaf nodes containing
candidate 3-itemsets belonging to t must be visited at least once. Recall that the 3-

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 13


Module 3 18MCA452

itemsets contained in t must begin with items 1, 2, or 3, as indicated by the Level 1 prefix
structures shown in Figure 3.9.
In the level 1 : { 1,4,5} the first left most item should be kept in the left child(1,4,7) in
the bucket of the root node,
Similarly { 2, 3, 4} here the first left element is 2 so, item should be kept in middle child
bucket{2,5,8 } and { 3,4,5}here the first left element is 3 so item should be kept in right
child{3,6,9}
In the level 2: All middle item in the given set is hashed into appropriate left, middle and
right child. For example {1, 2,4}
In the level 3: All right most item in given set is mapped into appropriate left, middle
and right child .For example {1,3,6}

Figure 3.9: Hashing a transaction at the root node of a hash tree.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 14


Module 3 18MCA452

3.2.6 Computational Complexity

The Computational complexity of the Apriory algorithm can be affected by the


following factors:

• Support Threshold: Lowering the support threshold often results in more itemsets
being declared as frequent. This has an adverse effect on the computational complexity of
the algorithm because more candidate itemsets must be generated and counted.
• Number of Items (Dimensionality): As the number of items increases, more space will
be needed to store the support counts of items. If the number of frequent items also grows
with the dimensionality of the data, the computation and I/O costs will increase because of
the larger number of candidate itemsets generated by the algorithm.
• Number of Transactions: Since the Apriori algorithm makes repeated passes over the
data set, its run time increases with a larger number of transactions.
• Average Transaction Width: For dense data sets, the average transaction width can be
very large. This affects the complexity of the Apriori algorithm in two ways.
o First, the maximum size of frequent itemsets tends to increase as the
average transaction width increases.
o Second as the transaction width increases, more itemsets are contained
in the transaction.
• Generation of frequent 1-itemsets: For each transaction, we need to update the
support count for every item present in the transaction. Assuming that w is the average
transaction width, this operation requires O(Nw) time, where N is the total number of
transactions.
• Candidate generation: To generate candidate A:-itemsets, pairs of frequent (k — l)-
itemsets arc merged to determine whether they have at least k - 2 items in common. Each
merging operation requires at most k — 2 equality comparisons. In the best-case
scenario, every merging step produces a viable candidate A-itemset. In the worst-case
scenario, the algorithm must merge every pair of frequent (k- l)-itemsets found in the
previous iteration.

• Support counting: Each transaction of length |t| produces itemsets of size k. This is
also the effective number of hash tree traversals performed for each transaction.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 15


Module 3 18MCA452

3.3 Rule Generation

An association rule can be extracted by partitioning the item set Y into two non-empty
subsets, X and Y — X, such that X → Y — X satisfies the confidence threshold.

• Note that all such rules must have already met the support threshold because they
are generated from a frequent itemset.

Example:
• Let X = {1,2,3} be a frequent itemset. There are six candidate association rules that
can be generated from X: {1,2} → {3}, {1,3} → {2}, {2,3}→ {1}, {1} —> {2,3},
{2}→{1,3}, and {3}→ {1,2}.
As each of their support is identical to the support for X, the rules must satisfy the
support threshold.

3.3.1 Confidence Based Pruning


Theorem: A rule X —► Y-X does not satisfy the confidence threshold, then any rule
X' →Y — X’, where X' is a subset of X, must not satisfy the confidence threshold as well.

3.3.2 Rule Generation in Apriori Algorithm

• The Apriori algorithm uses a level-wise approach for generating association rules,
where each level corresponds to the number of items that belong to the rule consequent.
• Initially, all the high-confidence rules that have only one item in the rule
consequent are extracted. These rules are then used to generate new candidate rules.
For example, if {acd} —► {b} and {abd} —► {c} are high-confidence rules, then the
candidate rule {ad} —► {bc} is generated by merging the consequents of both rules
• If any node in the lattice has low confidence, then according to confidence based
pruning Theorem, the entire subgraph spanned by the node can be pruned immediately.
• Suppose the confidence for {bcd} —► {a} is low. All the rules containing item a
in its consequent, including {cd} →{ab}, {bd} —► {ac}, {bc} —> {ad} and {d} →
{abc} can be discarded is as shown in the figure 3.11

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 16


Module 3 18MCA452

Figure 3.10: Pruning of Association rules using the confidence measure

3.4 Compact Representation of Frequent Itemsets

In practice the number of frequent itemsets produced from a transaction data set that can be
very large.It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived. Two approaches are presented in this section in the form of
Maximal and Closed frequent itemsets.

3.4.1 Maximal Frequent Itemsets

Definition: A maximum frequent itemset is defined as a frequent itemset for which none of
its immediate supersets are frequent.

To illustrate this concept, consider the itemset lattice shown in Figure 3.11

The itemsets in the lattice are divided into two groups: those that are frequent and those that are
infrequent. A frequent itemset border, which is represented by a dashed line, is also illustrated in
the diagram. Every itemset located above the border is frequent, while those located below the
border (the shaded nodes) are infrequent. Among the itemsets residing near the border, {a, d},

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 17


Module 3 18MCA452

{a, c, e}, and{b, c, d, e}are considered to be maximal frequent itemsets because their immediate
supersets are infrequent. An itemset such as {a, d}is maximal frequent because all of its
immediate supersets,{a, b, d},{a, c, d}and {a, d, e}, are infrequent. In contrast, {a, c}is non-
maximal because one of its immediate supersets,{a, c, e}, is frequent. Maximal frequent itemsets
effectively provide a compact representation of frequent itemsets.

Figure 3.11: Minimal frequent itemset

3.4.2 Closed Frequent Itemsets


Closed itemsets provide a minimal representation of itemsets without losing their support
information. A formal definition of a closed itemset is presented below.
Closed Itemset: An itemset X is closed if none of its immediate supersets has exactly the
same support count as X.
For example, since the node {b, c} is associated with transaction IDs 1, 2. and 3, its
support count is equal to three. From the transactions given in this diagram, notice that
every transaction that contains b also contains c.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 18


Module 3 18MCA452

Consequently, the support for {b} is identical to {b, c] and {b} should not be
considered a closed itemset. Similarly, since c occurs in every transaction that contains both a
and d, the itemset. {a, d} is not closed.
On the other hand. { b, c} is a closed itemset because it does not have the same support
count as any of its supersets is as shown in figure 3.12
Closed Frequent Itemset:
An itemset is a closed frequent itemset if it is closed and its support is greater than or
equal to minsup
Ex: {bc} is a closed frequent itemset because its support is 60%. Assuming that the
support threshold is 40%. {b,c} is a closed frequent itemset because its support is 60%. The
rest of the closed frequent itemsets are indicated by the shaded nodes

Figure 3.12: Closed Frequent itemset

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 19


Module 3 18MCA452

3.5 Alternative Methods for Generating Frequent Itemsets


The performance of the Apriori algorithm may degrade significantly for dense data sets
because of the increasing width of transactions. Several alternative methods have been developed
to overcome these limitations and improve upon the efficiency of the Apriori algorithm.
The following is a high-level description of these methods.

Traversal of itemset lattice: A search for frequent itemset can be conceptually viewed as a
traversal on the itemset lattice. Search algorithm decides how the lattice structure is traversed
during the frequent itemset generation process. Some search strategies are discussed below
❖ General to specific vs specific to general
The Apriori algorithm uses a general-to-specific search strategy, where pairs of frequent
(k−1)-itemsets are merged to obtain candidate k-itemsets. This general-to-specific search
strategy is effective, provided the maximum length of a frequent itemset is not too long. The
configuration of frequent item-sets that works best with this strategy is shown in Figure 3.13(a),
where the darker nodes represent infrequent itemsets.
Alternatively, a specific-to-general search strategy looks for more specific frequent itemsets
first,before finding the more general frequent itemsets. This strategy is use-ful to discover
maximal frequent itemsets in dense transactions, where the frequent itemset border is located
near the bottom of the lattice,as shown in Figure 3.13(b).
Another approach is to combine both general-to-specific and specific-to-general search
strategies. This bidirectional approach requires more space to store the candidate itemsets, but it
can help to rapidly identify the frequent itemset border

Figure 3.13: General to Specific, Specific to general and Bidirectional search

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 20


Module 3 18MCA452

❖ Equivalence classes
Another way to envision the traversal is to first partition the lattice into disjoint groups
of nodes (or equivalence classes). A frequent itemset generation algorithm searches for frequent
itemsets within a particular equivalence class first before moving to another equivalence class.
Equivalence classes can also be defined according to the prefix or suffix labels of an itemset.
In this case, two itemsets belong to the same equivalence class if they share a common prefix or suffix
of length k . In the prefix-based approach,the algorithm can search for frequent itemsets starting with
the prefix a before looking for those starting with prefixes b, c, and so on. Both prefix-based and
suffix-based equivalence classes can be demonstrated using the tree-like structure shown in Figure
3.14

Figure 3.14: Equivalence classes based on the prefix and suffix labels on itemsets

❖ BSF and DFS (Breadth first search and Depth first search)

The Apriori algorithm traverses the lattice in a breadth-first manner, as shown in Figure
3.15(a). It first discovers all the frequent 1-itemsets, followed by the frequent 2-itemsets, and so
on, until no new frequent itemsets are generated.

The lattice can also be traversed in a depth-first manner, as shown in Figures 3.15(b). The
algorithm can start from, say, node a in Figure 3.16, and count its support to determine whether it

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 21


Module 3 18MCA452

is frequent. If so, the algorithm progressively expands the next level of nodes, i.e., ab ,abc, and
so on, until an infrequent node is reached, say, abcd . It then backtracks to another branch, say,
abce , and continues the search from there. The depth-first approach is often used by algorithms
designed to find maximal frequent itemsets. This approach allows the frequent itemset border to
be detected more quickly than using a breadth-first approach.

Figure 3.15: Representation of BFS and DFS

Figure 3.16: Generating candidate itemsets using the depth –first approach

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 22


Module 3 18MCA452

Representation of Database:
There are many ways to represent a transaction data set. The choice of representation can
affect the I/O costs incurred when computing the support of candidate itemsets. Figure 3.17
shows two different ways of representing market basket transactions. The representation on the
left is called a horizontal data layout, which is adopted by many association rule mining
algorithms, including Apriori . Another possibility is to store the list of transaction identifiers
(TID-list) associated with each item. Such a representation is known as the vertical data layout.

Figure 3.17: Horizontal and Vertical data layout

3.6 FP Growth Algorithm


Definition: FP-growth that takes a radically different approach to discovering
frequent itemsets, it encodes the data set using a compact data structure called an FP-tree
and extracts frequent itemsets directly from this structure.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 23


Module 3 18MCA452

Apriori Vs FP Growth Algorithm

Memory
Algorithm Technique Runtime Parallelizability
usage
Candidate generation is
Generate Saves
extremely slow. Runtime
singletons, singletons, Candidate generation
Apriori increases exponentially
pairs, triplets, pairs, triplets, is very parallelizable
depending on the number of
etc. etc.
different items.
Insert sorted Stores a
Runtime increases linearly, Data are very inter
FP- items by compact
depending on the number of dependent, each node
Growth frequency into a version of the
transactions and items needs the root.
pattern tree database.

3.6.1 FP-Tree Representation

• An FP-tree is a compressed representation of the input data. It is constructed by


reading the data set one transaction at a time and mapping each transaction onto a path in
the FP -tree. As different transactions can have several items in common, their paths may
overlap.
• The more the paths overlap with one another, the more compression we can achieve
using the FP-tree structure.
• The Figure 3.18 shows a data set that contains ten transactions and five items. The
structures of the FP - tree after reading the first three transactions are also depicted in the
diagram. Each node in the tree contains the label of an item along with a counter that
shows the number of transactions mapped onto the given path. Initially, the FP-tree
contains only the root node represented by the null symbol.

The FP-tree is subsequently extended in the following way:

1. The data set is scanned once to determine the support count of each item. Infrequent
items are discarded, while the frequent items are sorted in decreasing support counts.
For the data set shown in Figure 3.18-(i) is the most frequent item, followed by b,
c, d, and e.

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 24


Module 3 18MCA452

2. The algorithm makes a second pass over the data to construct the FP-tree. After
reading the first transaction, {a, b}. The nodes labeled as a and b are created. A path
is then formed from null → a→ b to encode the transaction. Every node along the path
has a frequency count of 1.

Figure 3.18: FP Tree Construction


3. After reading the second transaction, {b,c,d}, a new set of nodes is created for items b.
c. and d. A path is then formed to represent the transaction by connecting the nodes null
→ b →c→d. Every node along this path also has a frequency count equal to one.
Although the first two transactions have an item in common, which is
6, their paths are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a. c. d, e}. Shares a common prefix item (which is a) with the
first transaction. As a result, the path for the third transaction, null→a→c→d→e
overlaps with the path for the first transaction, null→a→b. Because of their overlapping

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 25


Module 3 18MCA452

path, the frequency count for node a is incremented to two, while the frequency counts
for the newly created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths
given in the FP-tree. The resulting FP-tree after reading all the transactions is
shown at the bottom of Figure 3.18
• The size of an FP-tree is typically smaller than the size of the uncompressed data
because many transactions in market basket data often share a few items in common.
• In the best-case scenario, where all the transactions have the same set of items, the
FP –tree contains only a single branch of nodes.
• The worst-case scenario happens when every transaction has a unique set of items. As
none of the transactions have any items in common, the size of the FP-tree is effectively
the same as the size of the original data.
• However, the physical storage requirement for the FP-tree is higher because it requires
additional space to store pointers between nodes and counters for each item.

3.6.2 Frequent Itemset Generation in FP- Growth Algorithm


• FP-growth is an algorithm that generates frequent itemsets from an FP-tree by
exploring the tree in a bottom-up fashion.
Consider the following dataset,

TID ITEM SETS

1 {M,O,N,K,E,Y}

2 {D,O,N,K,E,Y}

3 {M,A,K,E}

4 {M,U,C,K,Y}

5 {C,O,K,I,E}

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 26


Module 3 18MCA452

Step1: Find the support count of every item in the transaction and remove the item having low
support count < 3.In below example item D, A, U, N, C,T is having low support count so discard
those item for next step and arrange the element in highest support count order.

ITEM SUPPORT COUNT


M 3 ITEM SUPPORT
O 3 COUNT
N 2 K 5
K 5 E 4
E 4 M 3
Y 3 O 3
D 1 Y 3
A 1
U 1
C 2
T 1

Step 2: Consider the above frequent pattern i.e. “KEMOY” and compare the pattern with
original data set then generate the ordered itemsets

No ORDERED ITEMSETS
1 KEMOY
2 KEOY
3 KEM
4 KMY
5 KEO

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 27


Module 3 18MCA452

Step 3: Construct the FP Tree for above Ordered Itemsets

ITEM SUPPORT FP TREE


ID COUNT

Null
K 5

K: 5
E 4

E: 4 M: 1

M 3
M: 3

O 3 O: 2
O: 1

Y 3 Y: 1
Y: 1 Y: 1

Step 4: Arrange the item in reverse order then find the frequent itemsets ending in Y, the
algorithm proceeds to look for frequent item sets in O. this process continues until all the paths
are processed and construct the conditional FP Tree by counting the repeated item in conditional
pattern base.

ITEMS CONDITIONAL PATTERN BASE CONDITIONAL FP TREE


Y {{KEMO:1}}{KEO:1}{KM:1} {K:3}
O {{KEM:1}{KE:2}} {KE:3}
M {{KE:2}{K:1}} {K:3}
E {K:4} {K:4}
K
***************************END***************************************

Shwetha G N, Asst, Prof., Dept. Of MCA RNSIT 28

You might also like