Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Unit3 Data mining Pattern

Frequent pattern mining is a data mining process that identifies patterns or associations within datasets that occur frequently, utilizing algorithms like Apriori, FP-growth, and Eclat. The Apriori algorithm generates frequent itemsets based on a minimum support threshold and employs the Apriori property to efficiently search for these itemsets. The FP-growth algorithm constructs a compact FP-tree to mine frequent patterns without generating candidate itemsets explicitly, making it efficient for large datasets.

Uploaded by

soumyachandu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit3 Data mining Pattern

Frequent pattern mining is a data mining process that identifies patterns or associations within datasets that occur frequently, utilizing algorithms like Apriori, FP-growth, and Eclat. The Apriori algorithm generates frequent itemsets based on a minimum support threshold and employs the Apriori property to efficiently search for these itemsets. The FP-growth algorithm constructs a compact FP-tree to mine frequent patterns without generating candidate itemsets explicitly, making it efficient for large datasets.

Uploaded by

soumyachandu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Mining Frequent Patterns

Unit 3
Introduction
• Frequent pattern mining in data mining is the process of identifying
patterns or associations within a dataset that occur frequently.
• This is typically done by analyzing large datasets to find items or sets
of items that appear together frequently.
• Frequent pattern extraction is an essential mission in data mining that
intends to uncover repetitive patterns or item sets in a granted dataset.
• It encompasses recognizing collections of components that occur
together frequently in a transactional or relational database.
• This procedure can offer valuable perceptions into the connections and
affiliations among diverse components or features within the data.
• These frequent item sets are identified based on a minimum support
threshold, and from them, association rules can be generated.
• Popular algorithms for this task include Apriori, FP-growth, and Eclat.
• This technique finds applications in market basket analysis, product
recommendations, and network traffic analysis, providing valuable
insights from large datasets.
Algorithms Used For Frequent Pattern
Mining
A few of the commonly used algorithms for mining frequent
patterns include the following -
• Apriori -
• Apriori is a classic algorithm for mining frequent patterns in large
datasets.
• It works by iteratively generating candidate itemsets of increasing size
and pruning those that do not meet the minimum support threshold.
• This approach significantly reduces the search space and makes it
possible to handle datasets with a large number of items.
• However, Apriori can be computationally expensive for datasets with
many infrequent itemsets.
Algorithms Used For Frequent Pattern
Mining
• FP-growth -
• FP-growth is an algorithm for mining frequent patterns that uses a
divide-and-conquer approach.
• It constructs a tree-like data structure called the frequent pattern
(FP) tree, where each node represents an item in a frequent pattern,
and its children represent its immediate sub-patterns.
• By scanning the dataset only twice, FP-growth can efficiently mine
all frequent itemsets without generating candidate itemsets
explicitly.
• It is particularly suitable for datasets with long patterns and
relatively low support thresholds.
Algorithms Used For Frequent Pattern
Mining
• Eclat -
• Eclat is a depth-first search algorithm for mining frequent itemsets similar to
Apriori.
• However, instead of generating candidate itemsets of increasing size, Eclat
uses a vertical representation of the dataset to identify frequent itemsets
recursively.
• It exploits the overlap among the itemsets in different transactions to reduce
the search space and is efficient for datasets with many short and frequent
itemsets.
• However, Eclat may perform poorly for datasets with long itemsets or low
support thresholds.
Apriori Algorithm
• The Apriori algorithm in data mining is a popular algorithm used for finding
frequent itemsets in a dataset.
• It is widely used in association rule mining to discover relationships
between items in a dataset.
• The Apriori algorithm was developed by R. Agrawal and R. Srikant in 1994.
• The Apriori algorithm is used to implement frequent pattern mining (FPM).
• Frequent pattern mining is a data mining technique to discover frequent
patterns or relationships between items in a dataset.
• Frequent pattern mining involves finding sets of items or itemsets that occur
together frequently in a dataset.
• These sets of items or itemsets are called frequent patterns, and their
frequency is measured by the number of transactions in which they occur.
• In data mining, an itemset is a collection of one or more items that
appear together in a transaction or dataset.
• An itemset can be either a single item, also known as a 1-itemset, or a
set of k items, also known as a k-itemset.
• For example, in sales transactions of a retail store, an itemset can be
referred to as products purchased together, such as bread and milk,
which would be a 2-item set.
• The Apriori algorithm can be used to discover frequent itemsets in the
sales transactions of a retail store.
• For instance, the algorithm might discover that customers who
purchase bread and milk together often also purchase eggs. This
information can be used to recommend eggs to customers who
purchase bread and milk in the future.
• The Apriori algorithm is called "apriori" because it uses prior knowledge
about the frequent itemsets.
• The algorithm uses the concept of "apriori property," which states that if an
itemset is frequent, then all of its subsets must also be frequent.
Apriori Property
• The Apriori property is a fundamental property of frequent itemsets used in
the Apriori algorithm.
• In other words, if an itemset appears frequently enough in the dataset to be
considered significant, then all of its subsets must also appear frequently
enough to be significant.
• For example, if the itemset {A, B, C} frequently appears in a dataset, then
the subsets {A, B}, {A, C}, {B, C}, {A}, {B}, and {C} must also appear
frequently in the dataset.
• The Apriori property allows the Apriori algorithm in data mining to
efficiently search for frequent itemsets by eliminating candidate itemsets
containing infrequent subsets, as they cannot be frequent.
• This search space pruning reduces the time and memory required to find
frequent itemsets in large datasets.
Apriori Algorithm Components
• The various terminologies used in the Apriori algorithm are:-
➢Support
• In the Apriori algorithm, support refers to the frequency or occurrence
of an item set in a dataset.
• It is defined as the proportion of transactions in the dataset that contain
the itemset.
• To calculate the support of an itemset, count the number of
transactions in which the itemset appears and divide it by the total
number of transactions in the dataset.
• For instance, if the itemset {milk, bread} appears in 5 transactions out
of 10 transactions in the dataset, then its support is 5/10=0.5 or 50%.
• In the Apriori algorithm, itemsets with a support value above the
minimum defined support threshold are considered frequent and are
used to generate candidate itemsets for the next iteration of the
algorithm.
Number of Transactions in which A occurs
Support(A) =
Number of all Transactions
➢Lift
• Lift measures the strength of the association between two items.
• It is defined as the ratio of the support of the two items occurring
together to the support of the individual items multiplied together.
• Lift for any two items can be calculated using the below formula -
Support(A and B)
Lift(A→B)=
Support(A)∗Support(B)
• If the lift value is greater than 1, then it indicates a positive association
between the two items, which means that the two items are more likely
to be bought together.
• A lift value of exactly 1 indicates that the two items are independent
and there is no association between the two items, while a value less
than 1 indicates a negative association, meaning that two items are
more likely to be bought separately.
➢Confidence
• In the Apriori algorithm, confidence is also a measure of the strength
of the association between two items in an itemset.
• It is defined as the conditional probability that item B appears in a
transaction, given that another item A appears in the same transaction.
• Support for two items can be calculated using the below formula.
sup(A∪B)
confidence(A⇒B) = P(B/A) =
sup(A)
• If the confidence value exceeds a specified threshold, it indicates that
item B is likely to be purchased with item A.
• For instance, if the confidence of the association between "bread" and
"butter" is 0.8, it means that when a customer buys "bread", there is
an 80% chance that they will also buy "butter".
• This can be useful in recommending to customers or optimizing
product placement in a store.
Steps in Apriori Algorithm
1.Define minimum support threshold - This is the minimum number
of times an item set must appear in the dataset to be considered as
frequent. The support threshold is usually set by the user based on the
size of the dataset and the domain knowledge.
2.Generate a list of frequent 1-item sets - Scan the entire dataset to
identify the items that meet the minimum support threshold. These
item sets are known as frequent 1-item sets.
3.Generate candidate item sets - In this step, the algorithm generates a
list of candidate item sets of length k+1 from the frequent k-item sets
identified in the previous step.
4. Count the support of each candidate item set - Scan the dataset
again to count the number of times each candidate item set appears
in the dataset.
5. Prune the candidate item sets - Remove the item sets that do not
meet the minimum support threshold.
6. Repeat steps 3-5 until no more frequent item sets can be generated.
7. Generate association rules - Once the frequent item sets have been
identified, the algorithm generates association rules from them.
Association rules are rules of form A -> B, where A and B are item
sets. The rule indicates that if a transaction contains A, it is also
likely to contain B.
8. Evaluate the association rules - Finally, the association rules are
evaluated based on metrics such as confidence and lift.
Apriori Algorithm Example
• Use a minimum support threshold of 3. This means an item set must
appear in at least three transactions to be considered frequent.
• Let’s consider the transaction dataset of a retail store as shown in the
below table.
• Let’s calculate support for each item present in the dataset. As shown
in the below table, support for all items is greater than 3. It means that
all items are considered as frequent 1-itemsets and will be used to
generate candidates for 2-itemsets.
• Below table represents all candidates generated from frequent 1-
itemsets identified from the previous step and their support value.
• Now remove candidate item sets that do not meet the minimum
support threshold of 3. After this step, frequent 2-itemsets would be -
{milk, bread}, {milk, sugar}, {milk, butter}, and {bread, butter}.
• In the next step, let’s generate candidates for 3-itemsets and calculate
their respective support values. It is shown in the below table.
• As we can see in the above table, only one candidate itemset exceeds the minimum
defined support threshold - {milk, bread, butter}. As there is only one 3-itemset
exceeding minimum support, we can’t generate candidates for 4-itemsets. So, in the
next step, we can write the association rules and their respective metrics, as shown in
the below table.

• Based on association rules mentioned in the above table, we can recommend products
to the customer or optimize product placement in retail stores.
Advantages
• Apriori algorithm is simple and easy to implement, making it
accessible even to those without a deep understanding of data mining
or machine learning.
• Apriori algorithm can handle large datasets and run on distributed
systems, making it scalable for large-scale applications.
• Apriori algorithm is one of the most widely used algorithms for
association rule mining and is supported by many popular data mining
tools.
Disadvantages
• Apriori algorithm can be computationally expensive, especially for
large datasets with many itemsets.
• Apriori algorithm can generate a large number of rules, making it
difficult to sift through and identify the most important ones.
• The algorithm requires multiple database scans to generate frequent
itemsets, which can be a limitation in systems where data access is
slow or expensive.
• Apriori algorithm is sensitive to data sparsity, meaning it may not
perform well on datasets with a low frequency of itemsets.
FP Growth algorithm
• The FP Growth algorithm in data mining is a popular method for frequent pattern mining.
• The algorithm is efficient for mining frequent item sets in large datasets.
• It works by constructing a frequent pattern tree (FP-tree) from the input dataset.
• FP Growth algorithm was developed by Han in 2000 and is a powerful tool for frequent pattern
mining in data mining.
• It is widely used in various applications such as market basket analysis, bioinformatics, and web
usage mining.
• The algorithm first scans the dataset and maps each transaction to a path in the tree.
• Items are ordered in each transaction based on their frequency, with the most frequent items
appearing first.
• Once the FP tree is constructed, frequent itemsets can be generated by recursively mining the tree.
• This is done by starting at the bottom of the tree and working upwards, finding all combinations of
itemsets that satisfy the minimum support threshold.
• The FP Growth algorithm requires only a single scan of the data and a small amount of memory to
construct the FP tree. It can also be parallelized to improve performance.
Working on FP Growth Algorithm Steps
• Scan the database:-In this step, the algorithm scans the input dataset to determine
the frequency of each item. This determines the order in which items are added to
the FP tree, with the most frequent items added first.
• Sort items:-In this step, the items in the dataset are sorted in descending order of
frequency. The infrequent items that do not meet the minimum support threshold
are removed from the dataset. This helps to reduce the dataset's size and improve
the algorithm's efficiency.
• Construct the FP-tree:- In this step, the FP-tree is constructed. The FP-tree is a
compact data structure that stores the frequent itemsets and their support counts.
• Generate frequent itemsets:-Once the FP-tree has been constructed, frequent
itemsets can be generated by recursively mining the tree. Starting at the bottom of
the tree, the algorithm finds all combinations of frequent item sets that satisfy the
minimum support threshold.
• Generate association rules:-Once all frequent item sets have been generated, the
algorithm post-processes the generated frequent item sets to generate association
rules, which can be used to identify interesting relationships between the items in
the dataset.
FP Tree
• The FP-tree (Frequent Pattern tree) is a data structure used in the FP
Growth algorithm for frequent pattern mining. It represents the frequent
itemsets in the input dataset compactly and efficiently. The FP tree consists
of the following components:
• Root Node:-The root node of the FP-tree represents an empty set. It has no associated
item but a pointer to the first node of each item in the tree.
• Item Node:-Each item node in the FP-tree represents a unique item in the dataset. It
stores the item name and the frequency count of the item in the dataset.
• Header Table:-The header table lists all the unique items in the dataset, along with
their frequency count. It is used to track each item's location in the FP tree.
• Child Node:-Each child node of an item node represents an item that co-occurs with
the item the parent node represents in at least one transaction in the dataset.
• Node Link:-The node-link is a pointer that connects each item in the header table to
the first node of that item in the FP-tree. It is used to traverse the conditional pattern
base of each item during the mining process.
• The FP tree is constructed by scanning the input dataset and inserting
each transaction into the tree one at a time.
• For each transaction, the items are sorted in descending order of
frequency count and then added to the tree in that order.
• If an item exists in the tree, its frequency count is incremented, and a
new path is created from the existing node.
• If an item does not exist in the tree, a new node is created for that item,
and a new path is added to the tree.
Example
• Suppose we have a dataset of transactions as shown below:
• Let’s scan the above database and compute the frequency of each item as shown in the below table.
• Let’s consider minimum support as 3.
• After removing all the items below minimum support in the above table, would remain with these
items -
{K: 5, E: 4, M : 3, O : 3, Y : 3}.
• Let’s re-order the transaction database based on the items above minimum support.
• In this step, in each transaction, will remove infrequent items and re-order them in the
descending order of their frequency, as shown in the table below.
• Now we will use the ordered itemset in each transaction to build the FP tree. Each
transaction will be inserted individually to build the FP tree, as shown below -
• First Transaction {K, E, M, O, Y}:
In this transaction, all items are simply linked, and their support count is initialized as 1.
• Second Transaction {K, E, O, Y}:
• In this transaction, increase the support count of K and E in the tree to 2.
• As no direct link is available from E to O, then insert a new path for O and Y and initialize
their support count as 1.
• Third Transaction {K, E, M}:
• After inserting this transaction, the tree will look as shown below and increase the support
count for K and E to 3 and for M to 2.
• Fourth Transaction {K, M, Y} and Fifth Transaction {K, E, O}:
• After inserting the last two transactions, the FP-tree will look like as shown below:
• Create a Conditional Pattern Base for all the items.
• The conditional pattern base is the path in the tree ending at the given frequent
item.
• For example, for item O, the paths {K, E, M} and {K, E} will result in item O.
• The conditional pattern base for all items will look like as shown below table:
• Now for each item, build a conditional frequent pattern tree.
• It is computed by identifying the set of elements common in all the paths in the
conditional pattern base of a given frequent item and computing its support count
by summing the support counts of all the paths in the conditional pattern base.
• The conditional frequent pattern tree will look like this as shown below table:
• From the above conditional FP tree, generate the frequent itemsets as shown in the
below table:
Advantages
• Efficiency:-FP Growth algorithm is faster and more memory-efficient than other
frequent itemset mining algorithms such as Apriori, especially on large datasets
with high dimensionality. This is because it generates frequent itemsets by
constructing the FP-Tree, which compresses the database and requires only two
scans.
• Scalability:-FP Growth algorithm scales well with increasing database size and
itemset dimensionality, making it suitable for mining frequent itemsets in large
datasets.
• Resistant to noise:-FP Growth algorithm is more resistant to noise in the data
than other frequent itemset mining algorithms, as it generates only frequent
itemsets and ignores infrequent itemsets that may be caused by noise.
• Parallelization:-FP Growth algorithm can be easily parallelized, making it
suitable for distributed computing environments and allowing it to take advantage
of multi-core processors.
Disadvantages
• Memory consumption:-
• Although the FP Growth algorithm is more memory-efficient than other
frequent itemset mining algorithms, storing the FP-Tree and the conditional
pattern bases can still require a significant amount of memory, especially for
large datasets.
• Complex implementation:-
• The FP Growth algorithm is more complex than other frequent itemset mining
algorithms, making it more difficult to understand and implement.
FP Growth Algorithm Vs. Apriori Algorithm
Factor FP Growth Algorithm Apriori Algorithm
Apriori algorithm mines frequent items in
Working FP Growth uses FP-tree to mine frequent itemsets. an iterative manner - 1-itemsets, 2-
itemsets, 3-itemsets, etc.

Generates frequent itemsets by constructing the FP-


Generates candidate itemsets by joining
Candidate Generation Tree and recursively generating conditional pattern
and pruning.
bases.

Scans the database only twice to construct the FP- Scans the database multiple times for
Data Scanning
Tree and generate conditional pattern bases. frequent itemsets.

Requires less memory than Apriori as it constructs Requires a large amount of memory to
Memory Usage
the FP-Tree, which compresses the database store candidate itemsets.

Faster due to efficient data compression and Slower due to multiple database scans and
Speed
generation of frequent itemsets. candidate generation.
Performs well on large datasets due to efficient data Performs poorly on large datasets due to a
Scalability
compression and generation of frequent itemsets. large number of candidate itemsets.
Association Rule Mining
• Association rule mining is a technique in data mining that aims to
discover interesting patterns and relationships among items in a
dataset.
• It involves identifying frequent itemsets and generating association
rules describing the relationship between them.
• An association rule is an implication of the form X−>Y,
where X and Y are itemsets.
• The rule indicates that if a transaction contains all the items in X, it is
likely to also contain all the items in Y.
• For example, consider a dataset of customer transactions at a grocery
store.
• We can use association rule mining to generate rules describing the
relationship between the items.
• For example, association rule {bread, milk} -> {eggs} with support
of 50% and confidence of 75% indicates that customers who purchase
bread and milk are also likely to buy eggs with a probability of 0.75.
• Association rule mining can be used in various applications, such as
market basket analysis, cross-selling, and recommendation systems.
• Identifying patterns and relationships among items, can provide
insights into customer behavior and help businesses make data-driven
decisions.
Advantages
• Frequent pattern mining helps identify correlations between different
items in a dataset, which can be helpful in various applications, such
as market basket analysis, recommendation systems, and cross-selling.
• Frequent pattern mining can help businesses make data-driven
decisions by identifying patterns in data, such as optimizing marketing
strategies, identifying trends, and improving customer satisfaction.
• Frequent pattern mining provides insights into customer behavior,
which can be useful for businesses to improve the consumer
experience.
Disadvantages
• Frequent pattern mining can be computationally expensive, especially
for large datasets or complex patterns.
• Frequent pattern mining can sometimes produce patterns that are not
relevant or useful, leading to noise and decreased accuracy.
• Interpreting frequent patterns can be challenging, as it requires domain
knowledge and an understanding of the underlying data.
Applications
• Market basket analysis - Identifying frequently co-occurring products in a customer's
basket or transaction history.
• Recommendation systems - Generating recommendations based on patterns of behavior
or purchases.
• Cross-selling and up-selling - Identifying related products to recommend or suggest to
customers.
• Fraud detection - Identifying patterns of fraudulent behavior or transactions.
• Web usage mining - Analyzing user behavior and navigation patterns on a website.
• Social network analysis - Identifying common patterns of connections and relationships
between individuals or groups.
• Healthcare - Analyzing patient data and identifying common patterns or risk factors.
• Quality control - Analyzing production data and identifying patterns of defects or errors.

You might also like