Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Module 3
• Many business enterprises accumulate large quantities of data from their day-to-day
operations, huge amounts of customer purchase data are collected daily at the checkout
counters of grocery stores such data, commonly known as market basket transactions
is as shown in Table 3.1
• Each row in this table corresponds to a transaction, which contains a unique
identifier labeled TID and a set of items bought by a given customer. Retailers are
interested in analyzing the data to learn about the purchasing behavior of their
customers. Such valuable information can be used to support a variety of business-
related applications such as marketing promotions, inventory management, and
customer relationship management.
Table 3.1: Example of Market Basket transactions.
TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
2. Second, some o f t h e discovered patterns are potentially spurious (fake) because they
may happen simply by chance.
• Even for the small data set shown in Table 3.1, this approach requires us to compute the
support and confidence for 36 −27+ 1 = 602 rules. More than 80% of the rules are discarded
after applying minsup = 20% and minconf = 50%, thus making most of the computations
become wasted. To avoid performing needless computations, it would be useful to prune the
rules early without having to compute their support and confidence values.
• Commons strategy adapted by many association rule mining algorithm is to decompose
the problem into two major approaches.
• The initial step toward improving the performance of association rule mining algorithms is
to decouple the support and confidence requirements.The support of a rule X→Y depends
only on the support of its corresponding itemset,XUY.The following rules have identical
support because they involve items from the same itemset {Beer,Diapers,Milk}:
{Beer ,Diapers}→{Milk} ,{Beer, Milk}→{Diapers},{Diapers, Milk}→{Beer},
{Beer}→{Diapers, Milk},{Milk}→{Beer, Diapers},{Diapers}→{Beer, Milk}
Two-step approach:
❖ Frequent Itemset Generation: Whose objective is to find all the itemsets that satisfy
the minsup threshold. These itemsets are called frequent itemsets. Generate all itemsets
whose support minsup
❖ Rule Generation: Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemsets.
3.2 Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets. Figure 3.1
shows an itemset lattice for I ={a, b, c, d, e}.. In general, a data set that contains k items can
ot nti y g n r t u to 2k −1 fr qu nt it s ts, excluding the null set. Because k can be very
large in many practical applications, the search space of itemsets that need to be explored is
exponentially large.
A brute-force approach for finding frequent itemsets is to determine the support count for
every candidate itemset in the lattice structure. To do this, we need to compare each candidate
against every transaction, Such an approach can be very expensive. Such an approach can be
very expensive because it requires O (NMw ) comparisons, where N is the number of
transactions, M =2k −1 is the number of candidate itemsets, and w is the maximum transaction
width.
There are several ways to reduce the computational complexity of frequent itemset generation.
1. Reduce the number of candidate itemsets (M).The Apriori principle, described in the next
section, is an effective way to eliminate some of the candidate itemsets without counting their
support values.
2. Reduce the number of comparisons. Instead of matching each candidate Itemset against
every transaction, we can reduce the number of comparisons by using more advanced data
structures, either to store the candidate itemsets or to compress the dataset
• Apriori is the first association rule mining algorithm that pioneered the use of
support based pruning to systematically control the exponential growth of candidate
itemsets.
Fk-1×F1 Method
An t rn tiv thod for c ndid t g n r tion is to t nd ch fr qu nt (k−1)-itemset with
other frequent items. Figure 3.5 illustrates how a frequent 2-itemset such as {Beer, Diapers} can
be augmented with a frequent item such as Bread to produce a candidate 3-itemset {Beer,
Diapers, Bread} is as shown in Figure 3.6
Fk-1×Fk-1 Method
The candidate generation procedure in the apriori-gen function merges a pair of frequent
(k−1)-itemsets only if their first k−2 it s are identical. Let A={a1,a2,...,ak1}and
B={b1,b2,...,bk−1}be a pair of fr qu nt (k−1)-itemsets. A and B are merged if they satisfy the
following conditions: ai=bi(for i=1,2,...,k2) and ak−1=bk−1.
The frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3-
itemset {Bread, Diapers, Milk} The algorithm does not have to merge {Beer, Diapers} with
{Diapers, Milk}because the first item in both itemsets is different. Indeed, if {Beer, Diapers,
Milk}is a viable candidate, it would have been obtained by merging {Beer, Diapers}with {Beer,
Milk}instead is as shown Figure 3.7
can be enumerated by specifying the smallest item first, followed by the larger items.
• For instance, given t = {1,2,3,5,6}, all the 3-itemsets contained in t must begin with item
1, 2, or 3.
• It is not possible to construct a 3-itemset that begins with items 5 or 6 because there are
only two items in t whose labels are greater than or equal to 5.
• The number of ways to specify the first item of a 3-itemset contained in in Figure-3.8.
For instance, 1 2 3 5 6 represents a 3-itemset that begins with item 1, followed by two
more items chosen from the set {2, 3, 5, 6}.
• After fixing the first item, the prefix structures at Level 2 represent the number of ways
to select the second item.
For example, 1 2 3 5 6 corresponds to itemsets that begin with prefix (1 2) and are followed by
items 3, 5, or 6. Finally, the prefix structures at Level 3 represent the complete set of 3-
itemsets contained in t. For example, the 3-itemsets that begin with prefix {1 2} are {1,2,3},
{1,2,5}, and {1,2,6}, while those that begin with prefix {2 3} are {2,3,5} and {2,3,6}.
• The prefix structures shown in Figure-6.9 demonstrate how itemsets contained in
a transaction can be systematically enumerated, i.e., by specifying their items one by one,
from the leftmost item to the rightmost item. We still have to determine whether each
enumerated
• 3-itemset corresponds to an existing candidate itemset. If it matches one of the
candidates, then the support count of the corresponding candidate is incremented. In the
next section, we illustrate how this matching operation can be performed efficiently
using a hash tree structure.
itemsets contained in t must begin with items 1, 2, or 3, as indicated by the Level 1 prefix
structures shown in Figure 3.9.
In the level 1 : { 1,4,5} the first left most item should be kept in the left child(1,4,7) in
the bucket of the root node,
Similarly { 2, 3, 4} here the first left element is 2 so, item should be kept in middle child
bucket{2,5,8 } and { 3,4,5}here the first left element is 3 so item should be kept in right
child{3,6,9}
In the level 2: All middle item in the given set is hashed into appropriate left, middle and
right child. For example {1, 2,4}
In the level 3: All right most item in given set is mapped into appropriate left, middle
and right child .For example {1,3,6}
• Support Threshold: Lowering the support threshold often results in more itemsets
being declared as frequent. This has an adverse effect on the computational complexity of
the algorithm because more candidate itemsets must be generated and counted.
• Number of Items (Dimensionality): As the number of items increases, more space will
be needed to store the support counts of items. If the number of frequent items also grows
with the dimensionality of the data, the computation and I/O costs will increase because of
the larger number of candidate itemsets generated by the algorithm.
• Number of Transactions: Since the Apriori algorithm makes repeated passes over the
data set, its run time increases with a larger number of transactions.
• Average Transaction Width: For dense data sets, the average transaction width can be
very large. This affects the complexity of the Apriori algorithm in two ways.
o First, the maximum size of frequent itemsets tends to increase as the
average transaction width increases.
o Second as the transaction width increases, more itemsets are contained
in the transaction.
• Generation of frequent 1-itemsets: For each transaction, we need to update the
support count for every item present in the transaction. Assuming that w is the average
transaction width, this operation requires O(Nw) time, where N is the total number of
transactions.
• Candidate generation: To generate candidate A:-itemsets, pairs of frequent (k — l)-
itemsets arc merged to determine whether they have at least k - 2 items in common. Each
merging operation requires at most k — 2 equality comparisons. In the best-case
scenario, every merging step produces a viable candidate A-itemset. In the worst-case
scenario, the algorithm must merge every pair of frequent (k- l)-itemsets found in the
previous iteration.
• Support counting: Each transaction of length |t| produces itemsets of size k. This is
also the effective number of hash tree traversals performed for each transaction.
An association rule can be extracted by partitioning the item set Y into two non-empty
subsets, X and Y — X, such that X → Y — X satisfies the confidence threshold.
• Note that all such rules must have already met the support threshold because they
are generated from a frequent itemset.
Example:
• Let X = {1,2,3} be a frequent itemset. There are six candidate association rules that
can be generated from X: {1,2} → {3}, {1,3} → {2}, {2,3}→ {1}, {1} —> {2,3},
{2}→{1,3}, and {3}→ {1,2}.
As each of their support is identical to the support for X, the rules must satisfy the
support threshold.
• The Apriori algorithm uses a level-wise approach for generating association rules,
where each level corresponds to the number of items that belong to the rule consequent.
• Initially, all the high-confidence rules that have only one item in the rule
consequent are extracted. These rules are then used to generate new candidate rules.
For example, if {acd} —► {b} and {abd} —► {c} are high-confidence rules, then the
candidate rule {ad} —► {bc} is generated by merging the consequents of both rules
• If any node in the lattice has low confidence, then according to confidence based
pruning Theorem, the entire subgraph spanned by the node can be pruned immediately.
• Suppose the confidence for {bcd} —► {a} is low. All the rules containing item a
in its consequent, including {cd} →{ab}, {bd} —► {ac}, {bc} —> {ad} and {d} →
{abc} can be discarded is as shown in the figure 3.11
In practice the number of frequent itemsets produced from a transaction data set that can be
very large.It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived. Two approaches are presented in this section in the form of
Maximal and Closed frequent itemsets.
Definition: A maximum frequent itemset is defined as a frequent itemset for which none of
its immediate supersets are frequent.
To illustrate this concept, consider the itemset lattice shown in Figure 3.11
The itemsets in the lattice are divided into two groups: those that are frequent and those that are
infrequent. A frequent itemset border, which is represented by a dashed line, is also illustrated in
the diagram. Every itemset located above the border is frequent, while those located below the
border (the shaded nodes) are infrequent. Among the itemsets residing near the border, {a, d},
{a, c, e}, and{b, c, d, e}are considered to be maximal frequent itemsets because their immediate
supersets are infrequent. An itemset such as {a, d}is maximal frequent because all of its
immediate supersets,{a, b, d},{a, c, d}and {a, d, e}, are infrequent. In contrast, {a, c}is non-
maximal because one of its immediate supersets,{a, c, e}, is frequent. Maximal frequent itemsets
effectively provide a compact representation of frequent itemsets.
Consequently, the support for {b} is identical to {b, c] and {b} should not be
considered a closed itemset. Similarly, since c occurs in every transaction that contains both a
and d, the itemset. {a, d} is not closed.
On the other hand. { b, c} is a closed itemset because it does not have the same support
count as any of its supersets is as shown in figure 3.12
Closed Frequent Itemset:
An itemset is a closed frequent itemset if it is closed and its support is greater than or
equal to minsup
Ex: {bc} is a closed frequent itemset because its support is 60%. Assuming that the
support threshold is 40%. {b,c} is a closed frequent itemset because its support is 60%. The
rest of the closed frequent itemsets are indicated by the shaded nodes
Traversal of itemset lattice: A search for frequent itemset can be conceptually viewed as a
traversal on the itemset lattice. Search algorithm decides how the lattice structure is traversed
during the frequent itemset generation process. Some search strategies are discussed below
❖ General to specific vs specific to general
The Apriori algorithm uses a general-to-specific search strategy, where pairs of frequent
(k−1)-itemsets are merged to obtain candidate k-itemsets. This general-to-specific search
strategy is effective, provided the maximum length of a frequent itemset is not too long. The
configuration of frequent item-sets that works best with this strategy is shown in Figure 3.13(a),
where the darker nodes represent infrequent itemsets.
Alternatively, a specific-to-general search strategy looks for more specific frequent itemsets
first,before finding the more general frequent itemsets. This strategy is use-ful to discover
maximal frequent itemsets in dense transactions, where the frequent itemset border is located
near the bottom of the lattice,as shown in Figure 3.13(b).
Another approach is to combine both general-to-specific and specific-to-general search
strategies. This bidirectional approach requires more space to store the candidate itemsets, but it
can help to rapidly identify the frequent itemset border
❖ Equivalence classes
Another way to envision the traversal is to first partition the lattice into disjoint groups
of nodes (or equivalence classes). A frequent itemset generation algorithm searches for frequent
itemsets within a particular equivalence class first before moving to another equivalence class.
Equivalence classes can also be defined according to the prefix or suffix labels of an itemset.
In this case, two itemsets belong to the same equivalence class if they share a common prefix or suffix
of length k . In the prefix-based approach,the algorithm can search for frequent itemsets starting with
the prefix a before looking for those starting with prefixes b, c, and so on. Both prefix-based and
suffix-based equivalence classes can be demonstrated using the tree-like structure shown in Figure
3.14
Figure 3.14: Equivalence classes based on the prefix and suffix labels on itemsets
❖ BSF and DFS (Breadth first search and Depth first search)
The Apriori algorithm traverses the lattice in a breadth-first manner, as shown in Figure
3.15(a). It first discovers all the frequent 1-itemsets, followed by the frequent 2-itemsets, and so
on, until no new frequent itemsets are generated.
The lattice can also be traversed in a depth-first manner, as shown in Figures 3.15(b). The
algorithm can start from, say, node a in Figure 3.16, and count its support to determine whether it
is frequent. If so, the algorithm progressively expands the next level of nodes, i.e., ab ,abc, and
so on, until an infrequent node is reached, say, abcd . It then backtracks to another branch, say,
abce , and continues the search from there. The depth-first approach is often used by algorithms
designed to find maximal frequent itemsets. This approach allows the frequent itemset border to
be detected more quickly than using a breadth-first approach.
Figure 3.16: Generating candidate itemsets using the depth –first approach
Representation of Database:
There are many ways to represent a transaction data set. The choice of representation can
affect the I/O costs incurred when computing the support of candidate itemsets. Figure 3.17
shows two different ways of representing market basket transactions. The representation on the
left is called a horizontal data layout, which is adopted by many association rule mining
algorithms, including Apriori . Another possibility is to store the list of transaction identifiers
(TID-list) associated with each item. Such a representation is known as the vertical data layout.
Memory
Algorithm Technique Runtime Parallelizability
usage
Candidate generation is
Generate Saves
extremely slow. Runtime
singletons, singletons, Candidate generation
Apriori increases exponentially
pairs, triplets, pairs, triplets, is very parallelizable
depending on the number of
etc. etc.
different items.
Insert sorted Stores a
Runtime increases linearly, Data are very inter
FP- items by compact
depending on the number of dependent, each node
Growth frequency into a version of the
transactions and items needs the root.
pattern tree database.
1. The data set is scanned once to determine the support count of each item. Infrequent
items are discarded, while the frequent items are sorted in decreasing support counts.
For the data set shown in Figure 3.18-(i) is the most frequent item, followed by b,
c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP-tree. After
reading the first transaction, {a, b}. The nodes labeled as a and b are created. A path
is then formed from null → a→ b to encode the transaction. Every node along the path
has a frequency count of 1.
path, the frequency count for node a is incremented to two, while the frequency counts
for the newly created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths
given in the FP-tree. The resulting FP-tree after reading all the transactions is
shown at the bottom of Figure 3.18
• The size of an FP-tree is typically smaller than the size of the uncompressed data
because many transactions in market basket data often share a few items in common.
• In the best-case scenario, where all the transactions have the same set of items, the
FP –tree contains only a single branch of nodes.
• The worst-case scenario happens when every transaction has a unique set of items. As
none of the transactions have any items in common, the size of the FP-tree is effectively
the same as the size of the original data.
• However, the physical storage requirement for the FP-tree is higher because it requires
additional space to store pointers between nodes and counters for each item.
1 {M,O,N,K,E,Y}
2 {D,O,N,K,E,Y}
3 {M,A,K,E}
4 {M,U,C,K,Y}
5 {C,O,K,I,E}
Step1: Find the support count of every item in the transaction and remove the item having low
support count < 3.In below example item D, A, U, N, C,T is having low support count so discard
those item for next step and arrange the element in highest support count order.
Step 2: Consider the above frequent pattern i.e. “KEMOY” and compare the pattern with
original data set then generate the ordered itemsets
No ORDERED ITEMSETS
1 KEMOY
2 KEOY
3 KEM
4 KMY
5 KEO
Null
K 5
K: 5
E 4
E: 4 M: 1
M 3
M: 3
O 3 O: 2
O: 1
Y 3 Y: 1
Y: 1 Y: 1
Step 4: Arrange the item in reverse order then find the frequent itemsets ending in Y, the
algorithm proceeds to look for frequent item sets in O. this process continues until all the paths
are processed and construct the conditional FP Tree by counting the repeated item in conditional
pattern base.