Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 4 Association Rule Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Association Rule Mining

Association Rule Mining: Importance and Steps


As the name implies, association rules are simple If/Then statements that aid in the discovery of relationships
between seemingly independent relational databases or other data repositories.

What is Association Rule Mining?


Association rule mining is a procedure used to discover frequent patterns, correlations, associations, or causal
structures in data sets stored in various types of databases such as relational databases, transactional databases,
and other types of data repositories.
The goal of association rule mining, given a set of transactions, is to find the rules that allow us to predict the
occurrence of a specific item based on the occurrences of the other items in the transaction. The data mining
process of discovering the rules that govern associations and causal objects between sets of items is known as
association rule mining.

So, in a given transaction involving multiple items, it attempts to identify the rules that govern how or why such
items are frequently purchased together. For example, peanut butter and jelly are frequently purchased together
because many people enjoy making PB&J sandwiches.

Association Rule Mining is a Data Mining technique for discovering patterns in data. Association Rule Mining
patterns represent relationships between items. When combined with sales data, this is known as Market Basket
Analysis.

Fast-food restaurants, for example, discovered early on that people who eat fast food tend to be thirsty due to the
high salt content and end up buying Coke. They took advantage of this by creating combo meals that combine
food that is sure to make you thirsty with Coke as part of the meal.

Importance of Association Rule Mining


Here are some of the reasons why Association Rule Mining is important and such an effective business tool.

Importance of Association Rule Mining

1. Aids businesses in developing sales strategies

The ultimate goal of any business is to become profitable. This entails attracting more customers and
increasing sales. They can develop better strategies by identifying products that sell better together. For
example, knowing that people who buy fries almost always buy Coke can be used to boost sales.

2. Assists businesses in developing marketing strategies

Attracting customers is a critical component of any business. Understanding which products sell well
together and which do not is essential when developing marketing strategies.

Er.Binay Yadav Page 1


Association Rule Mining

This includes sales and advertisement planning, as well as targeted marketing. For example, knowing that
some ornaments do not sell as well as others during the holiday season may enable the manager to offer a
discount on the infrequent ornaments.

3. It aids in shelf-life planning

Knowledge of association rules can help store managers plan their inventory and avoid losing money by
overstocking low-selling perishables.

For example, if olives aren't selling well, the manager won't stock up on them. However, he wishes to
ensure that the existing stock is sold before the expiration date. Given that people who buy pizza dough also
buy olives, the olives can be sold at a lower price when purchased with the pizza dough.

4. It aids in-store organization

Products that have been shown to increase the sales of other products can be moved closer together in the
store. For example, if butter sales are driven by bread sales, they can be moved to the same aisle in the
store.

Association Rule Mining is also used in media recommendations (movies, music, etc.), webpage analysis
(people who visit website A are more likely to visit website B), and so on.

Steps in Association Rule Mining


Association Rules are based on if/then statements. These statements aid in the discovery of associations between
independent data in a database, relational database, or other data repository. These rules are used to determine the
relationships between objects that are commonly used together.

Support and confidence are the two primary patterns used by association rules. The method searches for
similarities and rules formed by decomposing data for commonly used if/then patterns. Association rules are
typically used to simultaneously satisfy user-specified minimum support and a user-specified minimum
resolution. To implement association rule learning, various algorithms are used.

Association Rule Mining can be described as a two-step process.

Step 1: Locate all frequently occurring itemsets

An itemset is a collection of items found in a shopping basket. It can include many products. For example,
[bread, butter, eggs] is a supermarket database itemset.

A frequently occurring item set is one that frequently appears in a database. This raises the issue of how
frequency is defined. This is where your support comes into play. The frequency of an item in the dataset is used
to calculate its support count.

The number of supporters can only speak to the frequency of an item set. It does not consider the relative
frequency, or the frequency in relation to the number of transactions. This is referred to as an item set’s support.
The frequency of an item set in relation to the number of transactions is referred to as its support.

Step 2: Create strong association rules using the frequently used itemsets

Association rules are created by constructing associations from the frequent itemsets created in step 1. To find
strong associations, this employs a metric known as confidence.

The Apriori algorithm is one of the most fundamental Association Rule Mining algorithms. It is based on the
idea that "having prior knowledge of frequent itemsets can generate strong association rules." The term Apriori
refers to prior knowledge.

Er.Binay Yadav Page 2


Association Rule Mining

Apriori discovers frequent itemsets through a process known as candidate itemset generation. This is an iterative
approach that uses k-itemsets to explore (k+1)-itemsets. The set of frequent 1-itemsets is found first, followed by
the set of frequent 2-itemsets, and so on until no more frequent k-itemsets can be found.

An important property known as the Apriori property is used to reduce the search space to improve the efficiency
of the level-wise generation of frequent itemsets. According to the Apriori Property, "all non-empty subsets of a
frequent itemset must also be frequent."

This means that if an item is frequent, its subsets will also be frequent. For example, if [Bread, Butter] is a
frequent item set, [Bread] and [Butter] must be frequent individually as well.

Applications of Association Rule Learning


It has various applications in machine learning and data mining. Below are some popular applications of association rule learning:

o Market Basket Analysis: It is one of the popular examples and applications of association rule mining. This technique is
commonly used by big retailers to determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in identifying the probability
of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other applications.

Association rule learning can be divided into three types of algorithms:


1. Apriori
2. Eclat
3. F-P Growth Algorithm

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also increases
accordingly. So, to measure the associations between thousands of data items, there are several metrics.
These metrics are given below:
o Support
o Confidence
o Lift

Let's understand each of them:

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the transaction T that
contains the itemset X. If there are X datasets, then for transactions T, it can be written as:

Er.Binay Yadav Page 3


Association Rule Mining

Confidence

Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur together in the dataset
when the occurrence of X is already given. It is the ratio of the transaction that contains X and Y to the number of records that
contain X.

Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of each other. It has three possible
values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect on another.

Types of Association Rule Learning


Association rule learning can be divided into three algorithms:

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that contain
transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be bought together. It can also be used
in the healthcare field to find drug reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search technique to find frequent
itemsets in a transaction database. It performs faster execution than Apriori Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori Algorithm. It represents the
database in the form of a tree structure that is known as a frequent pattern or tree. The purpose of this frequent tree is to extract
the most frequent patterns.

Er.Binay Yadav Page 4


Association Rule Mining

Apriori Algorithm for Mining Association


The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to work on
the databases that contain transactions. With the help of these association rule, it determines how strongly
or how weakly two objects are connected.
The Apriori algorithm is made up of three major components:

 Support

 Confidence

 Lift

With the help of an example, we will explain these three concepts.

Assume we have a database of 1,000 customer transactions and want to find the Support, Confidence, and Lift
for two items, such as burgers and ketchup. One hundred transactions contain ketchup, while 150 contain a
burger. In 50 of the 150 transactions where a burger is purchased, ketchup is also included.

1. Support

Support refers to an item's default popularity and can be calculated by dividing the number of transactions
containing a specific item by the total number of transactions. Assume we want to find help for item B. This
can be calculated as follows:

Support(B) = (Transactions with (B))/ (Total Transactions)

2. Confidence

If item A is purchased, confidence refers to the likelihood that item B will be purchased as well. It can be
calculated by dividing the number of transactions in which A and B are purchased together by the total
number of transactions in which A is purchased. It can be expressed mathematically as:

Confidence(AB) = (Transactions with both (A and B))/ (Transactions containing A)

3. Lift

Lift(A -> B) denotes the increase in the sale ratio of B when A is sold. Lift(A -> B) is computed by dividing
Confidence(A -> B) by Support (B). It can be expressed mathematically as:

Lift(AB) = (Ab) Confidence/(B) Support

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the threshold value or user-specified
minimum support. It means if A & B are the frequent itemsets together, then individually A and B should
also be the frequent itemset.

Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two transactions, 2 and 3 are
the frequent itemsets.

Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and select the minimum support
and confidence.

Er.Binay Yadav Page 5


Association Rule Mining

Step-2: Take all supports in the transaction with higher support value than the minimum or selected
support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or minimum
confidence.

Step-4: Sort the rules as the decreasing order of lift.

Apriori Algorithm Working

We will understand the apriori algorithm using an example and mathematical calculation:

Example: Suppose we have the following dataset that has various transactions, and from this dataset, we
need to find the frequent itemsets and generate the association rules using the Apriori algorithm:

Solution:

Step-1: Calculating C1 and L1:


o In the first step, we will create a table that contains support count (The frequency of each itemset individually in the
dataset) of each itemset in the given dataset. This table is called the Candidate set or C1.

o Now, we will take out all the itemsets that have the greater support count that the Minimum Support (2). It will give us the
table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support, except the E, so E itemset will be
removed.

Er.Binay Yadav Page 6


Association Rule Mining

Step-2: Candidate Generation C2, and L2:


o In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the itemsets of L1 in the form of
subsets.
o After creating the subsets, we will again find the support count from the main transaction table of datasets, i.e., how many
times these pairs have occurred together in the given dataset. So, we will get the below table for C2:

o Again, we need to compare the C2 Support count with the minimum support count, and after comparing, the itemset with
less support count will be eliminated from the table C2. It will give us the below table for L2

Step-3: Candidate generation C3, and L3:


o For C3, we will repeat the same two processes, but now we will form the C3 table with subsets of three itemsets together,
and will calculate the support count from the dataset. It will give the below table:

o Now we will create the L3 table. As we can see from the above C3 table, there is only one combination of itemset that has
support count equal to the minimum support count. So, the L3 will have only one combination, i.e., {A, B, C}.

Step-4: Finding the association rules for the subsets:

To generate the association rules, first, we will create a new table with the possible rules from the occurred combination {A, B.C}. For
all the rules, we will calculate the Confidence using formula sup( A ^B)/A. After calculating the confidence value for all rules, we will
exclude the rules that have less confidence than the minimum threshold(50%).

Consider the below table:

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

Er.Binay Yadav Page 7


Association Rule Mining

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C, B^C → A, and A^C → B can be
considered as the strong association rules for the given problem.

Advantages of Apriori Algorithm


o This is easy to understand algorithm
o The join and prune steps of the algorithm can be easily implemented on large datasets.

Disadvantages of Apriori Algorithm


o The apriori algorithm works slow compared to other algorithms.
o The overall performance can be reduced as it scans the database for multiple times.
D
o The time complexity and space complexity of the apriori algorithm is O(2 ), which is very high. Here D represents the
horizontal width present in the database.

Frequent Pattern Growth Algorithm


The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.
These two properties inevitably make the algorithm slower. To overcome these redundant steps, a new association -
rule mining algorithm was developed named Frequent Pattern Growth Algorithm. It overcomes the disadvantages of
the Apriori algorithm by storing all the transactions in a Trie Data Structure. Consider the following data: -

The above-given data is a hypothetical dataset of transactions with each letter representing an item. The frequency
of each individual item is computed:-

Er.Binay Yadav Page 8


Association Rule Mining

Let the minimum support be 3. A Frequent Pattern set is built which will contain all the elements whose frequency
is greater than or equal to the minimum support. These elements are stored in descending order of their respective
frequencies. After insertion of the relevant items, the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the Frequent Pattern set
and checking if the current item is contained in the transaction in questio n. If the current item is contained, the item
is inserted in the Ordered-Item set for the current transaction. The following table is built for all the transactions:

Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence in the set and initiali ze the
support count for each item as 1.

b) Inserting the set {K, E, O, Y}:


Till the insertion of the elements K and E, simply the support count is increased by 1. On inserting O we can see
that there is no direct link between E and O, therefore a new node for the item O is initialized with the support count
as 1 and item E is linked to this new node. On inserting Y, we first initialize a new node for the item Y with support
count as 1 and link the new node of O with the new node of Y.

Er.Binay Yadav Page 9


Association Rule Mining

c) Inserting the set {K, E, M}:


Here simply the support count of each element is increased by 1.

d) Inserting the set {K, M, Y}:


Similar to step b), first the support count of K is increased, then new nodes f or M and Y are initialized and linked
accordingly.

e) Inserting the set {K, E, O}:


Here simply the support counts of the respective elements are increased. Note that the support count of the new
node of item O is increased.

Er.Binay Yadav Page 10


Association Rule Mining

Now, for each item, the Conditional Pattern Base is computed which is path labels of all the paths which lead to
any node of the given item in the frequent-pattern tree. Note that the items in the below table are arranged in the
ascending order of their frequencies.

Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set of elements that is
common in all the paths in the Conditional Pattern Base of that item and calculating its support count by summing
the support counts of all the paths in the Conditional Pattern Base.

From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing the items of the
Conditional Frequent Pattern Tree set to the corresponding to the item as given in the below table.

For each row, two types of association rules can be inferred for example for the first row which contains the
element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule, the confidence of both the rules is
calculated and the one with confidence greater than or equal to the minimum confidence value is retained.

Er.Binay Yadav Page 11


Association Rule Mining

Market Basket Analysis in Data Mining

Market basket analysis is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns. It involves analyzing large data sets, such as purchase history,
to reveal product groupings and products that are likely to be purchased together.

The adoption of market basket analysis was aided by the advent of electronic point-of-sale (POS) systems.
Compared to handwritten records kept by store owners, the digital records generated by POS systems
made it easier for applications to process and analyze large volumes of purchase data.

One example is the Shopping Basket Analysis tool in Microsoft Excel, which analyzes transaction data
contained in a spreadsheet and performs market basket analysis. A transaction ID must relate to the items
to be analyzed. The Shopping Basket Analysis tool then creates two worksheets:

o The Shopping Basket Item Groups worksheet, which lists items that are frequently purchased
together,
o And the Shopping Basket Rules worksheet shows how items are related (For example, purchasers of
Product A are likely to buy Product B).

How does Market Basket Analysis Work?

Market Basket Analysis is modelled on Association rule mining, i.e., the IF {}, THEN {} construct. For example,
IF a customer buys bread, THEN he is likely to buy butter as well.

Association rules are usually represented as: {Bread} -> {Butter}

Some terminologies to familiarize yourself with Market Basket Analysis are:

o Antecedent:Items or 'itemsets' found within the data are antecedents. In simpler words, it's the IF
component, written on the left-hand side. In the above example, bread is the antecedent.
o Consequent:A consequent is an item or set of items found in combination with the antecedent. It's
the THEN component, written on the right-hand side. In the above example, butter is the
consequent.

Types of Market Basket Analysis

Market Basket Analysis techniques can be categorized based on how the available data is utilized. Here are
the following types of market basket analysis in data mining, such as:

1. Descriptive market basket analysis: This type only derives insights from past data and is the most
frequently used approach. The analysis here does not make any predictions but rates the association

Er.Binay Yadav Page 12


Association Rule Mining

between products using statistical techniques. For those familiar with the basics of Data Analysis,
this type of modelling is known as unsupervised learning.
2. Predictive market basket analysis: This type uses supervised learning models like classification and
regression. It essentially aims to mimic the market to analyze what causes what to happen.
Essentially, it considers items purchased in a sequence to determine cross-selling. For example,
buying an extended warranty is more likely to follow the purchase of an iPhone. While it isn't as
widely used as a descriptive MBA, it is still a very valuable tool for marketers.
3. Differential market basket analysis: This type of analysis is beneficial for competitor analysis. It
compares purchase history between stores, between seasons, between two time periods, between
different days of the week, etc., to find interesting patterns in consumer behaviour. For example, it
can help determine why some users prefer to purchase the same product at the same price on
Amazon vs Flipkart. The answer can be that the Amazon reseller has more warehouses and can
deliver faster, or maybe something more profound like user experience.

Algorithms associated with Market Basket Analysis

In market basket analysis, association rules are used to predict the likelihood of products being purchased
together. Association rules count the frequency of items that occur together, seeking to find associations
that occur far more often than expected.

With the help of the Apriori Algorithm, we can further classify and simplify the item sets that the consumer
frequently buys. There are three components in APRIORI ALGORITHM:

o SUPPORT
o CONFIDENCE
o LIFT

For example, suppose 5000 transactions have been made through a popular e-Commerce website. Now
they want to calculate the support, confidence, and lift for the two products. For example, let's say pen and
notebook, out of 5000 transactions, 500 transactions for pen, 700 transactions for notebook, and 1000
transactions for both.

SUPPORT

It has been calculated with the number of transactions divided by the total number of transactions made,

1. Support = freq(A, B)/N

support(pen) = transactions related to pen/total transactions

i.e support -> 500/5000=10 percent

CONFIDENCE

Whether the product sales are popular on individual sales or through combined sales has been calculated.
That is calculated with combined transactions/individual transactions.

1. Confidence = freq (A, B)/ freq(A)

Confidence = combine transactions/individual transactions

i.e confidence-> 1000/500=20 percent

Er.Binay Yadav Page 13


Association Rule Mining

LIFT

Lift is calculated for knowing the ratio for the sales.

1. Lift = confidence percent/ support percent

Lift-> 20/10=2

When the Lift value is below 1, the combination is not so frequently bought by consumers. But in this case,
it shows that the probability of buying both the things together is high when compared to the transaction
for the individual items sold.

Examples of Market Basket Analysis

Here are the following examples that explore Market Basket Analysis by market segment, such as:

o Retail: The most well-known MBA case study is Amazon.com. Whenever you view a product on
Amazon, the product page automatically recommends, "Items bought together frequently." It is
perhaps the simplest and most clean example of an MBA's cross-selling techniques.
Apart from e-commerce formats, BA is also widely applicable to the in-store retail segment. Grocery
stores pay meticulous attention to product placement based and shelving optimization. For
example, you are almost always likely to find shampoo and conditioner placed very close to each
other at the grocery store. Walmart's infamous beer and diapers association anecdote is also an
example of Market Basket Analysis.
o Telecom: With the ever-increasing competition in the telecom sector, companies are paying close
attention to customers' services. For example, Telecom has now started to bundle TV and Internet
packages apart from other discounted online services to reduce churn.
o IBFS: Tracing credit card history is a hugely advantageous MBA opportunity for IBFS organizations.
For example, Citibank frequently employs sales personnel at large malls to lure potential customers
with attractive discounts on the go. They also associate with apps like Swiggy and Zomato to show
customers many offers they can avail of via purchasing through credit cards. IBFS organizations also
use basket analysis to determine fraudulent claims.
o Medicine: Basket analysis is used to determine comorbid conditions and symptom analysis in the
medical field. It can also help identify which genes or traits are hereditary and which are associated
with local environmental effects.

Benefits of Market Basket Analysis

The market basket analysis data mining technique has the following benefits, such as:

Er.Binay Yadav Page 14


Association Rule Mining

o Increasing market share: Once a company hits peak growth, it becomes challenging to determine
new ways of increasing market share. Market Basket Analysis can be used to put together
demographic and gentrification data to determine the location of new stores or geo-targeted ads.
o Behaviour analysis: Understanding customer behaviour patterns is a primal stone in the
foundations of marketing. MBA can be used anywhere from a simple catalogue design to UI/UX.
o Optimization of in-store operations: MBA is not only helpful in determining what goes on the
shelves but also behind the store. Geographical patterns play a key role in determining the
popularity or strength of certain products, and therefore, MBA has been increasingly used to
optimize inventory for each store or warehouse.
o Campaigns and promotions: Not only is MBA used to determine which products go together but
also about which products form keystones in their product line.
o Recommendations: OTT platforms like Netflix and Amazon Prime benefit from MBA by
understanding what kind of movies people tend to watch frequently.

Frequent Item set in Data set (Association Rule Mining)


Association Mining searches for frequent items in the data-set. In frequent mining usually the
interesting associations and correlations between item sets in transactional and relational databases are
found. In short, Frequent Mining shows which items appear together in a transaction or relation.

Need of Association Mining: Frequent mining is generation of association rules from a Transactional
Dataset. If there are 2 items X and Y purchased frequently then its good to put them together in stores or
provide some discount offer on one item on purchase of other item. This can really increase the sales.

For example it is likely to find that if a customer buys Milk and bread he/she also buys Butter. So the
association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can suggest the customer to buy butter if
he/she buys Milk and Bread.

Important Definitions :

 Support : It is one of the measure of interestingness. This tells about usefulness and certainty of
rules. 5% Support means total 5% of transactions in database follow the rule.
Support(A -> B) = Support_count(A ∪ B)
 Confidence: A confidence of 60% means that 60% of the customers who purchased a milk and bread
also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
 Support_count(X) : Number of transactions in which X appears. If X is A union B then it is the number of
transactions in which A and B both are present.
 Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.

Er.Binay Yadav Page 15


Association Rule Mining

 Closed Itemset:An itemset is closed if none of its immediate supersets have same support count same
as Itemset.
 K- Itemset:Itemset which contains K items is a K-itemset. So it can be said that an itemset is frequent if
the corresponding support count is greater than minimum support count.
 Example On finding Frequent Itemsets – Consider the given dataset with given

transactions.
 Lets say minimum support count is 3
 Relation hold is maximal frequent => closed => frequent
1-frequent: {A} = 3; // not closed due to {A, C} and not maximal {B} = 4; // not closed due to {B,
D} and no maximal {C} = 4; // not closed due to {C, D} not maximal {D} = 5; // closed item-set
since not immediate super-set has same count. Not maximal
2-frequent: {A, B} = 2 // not frequent because support count < minimum support count so ignore
{A, C} = 3 // not closed due to {A, C, D} {A, D} = 3 // not closed due to {A, C, D} {B, C} = 3 // not
closed due to {B, C, D} {B, D} = 4 // closed but not maximal due to {B, C, D} {C, D} = 4 // closed
but not maximal due to {B, C, D}
3-frequent: {A, B, C} = 2 // ignore not frequent because support count < minimum support count
{A, B, D} = 2 // ignore not frequent because support count < minimum support count {A, C, D} =
3 // maximal frequent {B, C, D} = 3 // maximal frequent
4-frequent: {A, B, C, D} = 2 //ignore not frequent </

Pattern Evaluation Methods in Data Mining


In data mining, pattern evaluation is the process of assessing the quality of discovered patterns. This process is
important in order to determine whether the patterns are useful and whether they can be trusted. There are a
number of different measures that can be used to evaluate patterns, and the choice of measure will depend on the
application.
There are several ways to evaluate pattern mining algorithms:
1. Accuracy
The accuracy of a data mining model is a measure of how correctly the model predicts the target values. The
accuracy is measured on a test dataset, which is separate from the training dataset that was used to train the
model. There are a number of ways to measure accuracy, but the most common is to calculate the percentage of
correct predictions. This is known as the accuracy rate.
A model that is 100% accurate on the training data but only 50% accurate on the test data is not a good model. The
model is overfitting the training data and is not generalizable to new data. A model that is 80% accurate on the
training data and 80% accurate on the test data is a good model. The model is generalizable and can be used to
make predictions on new data.

2. Classification Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to classify new data. This is
typically done by taking a set of data that has been labeled with known class labels and then using the discovered
patterns to predict the class labels of the data. The accuracy can then be computed by comparing the predicted
labels to the actual labels.

Er.Binay Yadav Page 16


Association Rule Mining

Classification accuracy is one of the most popular evaluation metrics for classification models, and it is simply the
percentage of correct predictions made by the model. Although it is a straightforward and easy-to-understand
metric, classification accuracy can be misleading in certain situations. For example, if we have a dataset with a
very imbalanced class distribution, such as 100 instances of class 0 and 1,000 instances of class 1, then a model
that always predicts class 1 will achieve a high classification accuracy of 90%. However, this model is clearly not
very useful, since it is not making any correct predictions for class 0.
3. Clustering Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to cluster new data. This is
typically done by taking a set of data that has been labeled with known cluster labels and then using the discovered
patterns to predict the cluster labels of the data. The accuracy can then be computed by comparing the predicted
labels to the actual labels.
There are a few ways to evaluate the accuracy of a clustering algorithm:
 External indices: these indices compare the clusters produced by the algorithm to some known ground truth.
For example, the Rand Index or the Jaccard coefficient can be used if the ground truth is known.
 Internal indices: these indices assess the goodness of clustering without reference to any external information.
The most popular internal index is the Dunn index.
 Stability: this measures how robust the clustering is to small changes in the data. A clustering algorithm is said
to be stable if, when applied to different samples of the same data, it produces the same results.
 Efficiency: this measures how quickly the algorithm converges to the correct clustering.
4. Coverage
This measures how many of the possible patterns in the data are discovered by the algorithm. This can be
computed by taking the total number of possible patterns and dividing it by the number of patterns discovered by
the algorithm. A Coverage Pattern is a type of sequential pattern that is found by looking for items that tend to
appear together in sequential order. For example, a coverage pattern might be “customers who purchase item A
also tend to purchase item B within the next month.”
To evaluate a coverage pattern, analysts typically look at two things: support and confidence. Support is the
percentage of transactions that contain the pattern. Confidence is the percentage of transactions that contain the
pattern divided by the number of transactions that contain the first item in the pattern.
For example, consider the following coverage pattern: “customers who purchase item A also tend to purchase item
B within the next month.” If the support for this pattern is 0.1%, that means that 0.1% of all transactions contain the
pattern. If the confidence for this pattern is 80%, that means that 80% of the transactions that contain item A also
contain item B.
Generally, a higher support and confidence value indicates a stronger pattern. However, analysts must be careful to
avoid overfitting, which is when a pattern is found that is too specific to the data and would not be generalizable to
other data sets.

5. Visual Inspection
This is perhaps the most common method, where the data miner simply looks at the patterns to see if they make
sense. In visual inspection, the data is plotted in a graphical format and the pattern is observed. This method is
used when the data is not too large and can be easily plotted. It is also used when the data is categorical in nature.
Visual inspection is a pattern evaluation method in data mining where the data is visually inspected for patterns.
This can be done by looking at a graph or plot of the data, or by looking at the raw data itself. This method is often
used to find outliers or unusual patterns.

6. Running Time
This measures how long it takes for the algorithm to find the patterns in the data. This is typically measured in
seconds or minutes. There are a few different ways to measure the performance of a machine learning algorithm,
but one of the most common is to simply measure the amount of time it takes to train the model and make
predictions. This is known as the running time pattern evaluation.

7. Support
The support of a pattern is the percentage of the total number of record s that contain the pattern. Support Pattern
evaluation is a process of finding interesting and potentially useful patterns in data. The purpose of support pattern
evaluation is to identify interesting patterns that may be useful for decision-making. Support pattern evaluation is
typically used in data mining and machine learning applications.
Support pattern evaluation can be used to find a variety of interesting patterns in data, including ass ociation rules,
sequential patterns, and co-occurrence patterns. Support pattern evaluation is an important part of data mining and
machine learning, and can be used to help make better decisions.

8. Confidence
The confidence of a pattern is the percentage of times that the pattern is found to be correct. Confidence Pattern
evaluation is a method of data mining that is used to assess the quality of patterns found in data. This evaluation is
typically performed by calculating the percentage of times a pattern is found in a data set and comparing this

Er.Binay Yadav Page 17


Association Rule Mining

percentage to the percentage of times the pattern is expected to be found based on the overall distribution of data.
If the percentage of times a pattern is found is significantly higher than the expected percent age, then the pattern is
said to be a strong confidence pattern.

9. Lift
The lift of a pattern is the ratio of the number of times that the pattern is found to be correct to the number of times
that the pattern is expected to be correct. Lift Pattern evaluation is a data mining technique that can be used to
evaluate the performance of a predictive model. The lift pattern is a graphical representation of the model’s
performance and can be used to identify potential problems with the model.
The lift pattern can be a useful tool for identifying potential problems with a predictive model. It is important to
remember, however, that the lift pattern is only a graphical representation of the model’s performance, and should
be interpreted in conjunction with other evaluation measures.

10. Prediction
The prediction of a pattern is the percentage of times that the pattern is found to be correct. Predi ction Pattern
evaluation is a data mining technique used to assess the accuracy of predictive models. It is used to determine how
well a model can predict future outcomes based on past data. Prediction Pattern evaluation can be used to
compare different models, or to evaluate the performance of a single model.
Prediction Pattern evaluation involves splitting the data set into two parts: a training set and a test set. The training
set is used to train the model, while the test set is used to assess the acc uracy of the model. To evaluate the
accuracy of the model, the prediction error is calculated. Prediction Pattern evaluation can be used to improve the
accuracy of predictive models. By using a test set, predictive models can be fine -tuned to better fit the data. This
can be done by changing the model parameters or by adding new features to the data set.

11. Precision
Precision Pattern Evaluation is a method for analyzing data that has been collected from a variety of sources. This
method can be used to identify patterns and trends in the data, and to evaluate the accuracy of data. Precision
Pattern Evaluation can be used to identify errors in the data, and to determine the cause of the errors. This method
can also be used to determine the impact of the errors on the overall accuracy of the data.
Precision Pattern Evaluation is a valuable tool for data mining and data analysis. This method can be used to
improve the accuracy of data, and to identify patterns and trends in the data.

12. Cross-Validation
This method involves partitioning the data into two sets, training the model on one set, and then testing it on the
other. This can be done multiple times, with different partitions, to get a more reliable estimate of the model’s
performance. Cross-validation is a model validation technique for assessing how the results of a data mining
analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and
one wants to estimate how accurately a predictive model will perform in practice. Cross -validation is also referred to
as out-of-sample testing.
Cross-validation is a pattern evaluation method that is used to assess the accuracy of a model. It does this by
splitting the data into a training set and a test set. The model is then fit on the training set and the accuracy is
measured on the test set. This process is then repeated a number of times, with the accuracy being averaged over
all the iterations.

13. Test Set


This method involves partitioning the data into two sets, training the model on the entire data set, and then testing
it on the held-out test set. This is more reliable than cross-validation but can be more expensive if the data set is
large. There are a number of ways to evaluate the performance of a model on a test set. The most common is to
simply compare the predicted labels to the true labels and compute the p ercentage of instances that are correctly
classified. This is called accuracy. Another popular metric is precision, which is the number of true positives divided
by the sum of true positives and false positives. The recall is the number of true positives d ivided by the sum of true
positives and false negatives. These metrics can be combined into the F1 score, which is the harmonic mean of
precision and recall.

14. Bootstrapping
This method involves randomly sampling the data with replacement, training the m odel on the sampled data, and
then testing it on the original data. This can be used to get a distribution of the model’s performance, which can be
useful for understanding how robust the model is. Bootstrapping is a resampling technique used to estimate t he
accuracy of a model. It involves randomly selecting a sample of data from the original dataset and then training the
model on this sample. The model is then tested on another sample of data that is not used in training. This process
is repeated a number of times, and the average accuracy of the model is calculated.

Er.Binay Yadav Page 18

You might also like