100% found this document useful (1 vote)

215 views

Intro Data Mining

The document provides an overview of data mining from a commercial viewpoint. It discusses how large amounts of data are being collected and stored, and how data mining can be used to discover useful patterns and relationships within the data. Specifically, it notes that data mining can help companies provide better customized services and identify "hidden" information that human analysts may miss. It then discusses some common applications of data mining, such as fraud detection, customer segmentation, and market analysis.

Uploaded by

api-19730613

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

215 views

Intro Data Mining

Uploaded by

api-19730613

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 87

Data Mining

Introduction
Why Mine Data? Commercial
Viewpoint
 Lots of data is being collected
and warehoused
◦ Web data, e-commerce
◦ purchases at department/
grocery stores
◦ Bank/Credit Card
transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
◦ Provide better, customized services for an edge
(e.g. in Customer Relationship Management)

Mining Large Data Sets - Motivation

 There is often information “hidden” in the data

that is
not readily evident
 Human analysts may take weeks to discover
useful information
 Much of the data is never analyzed at all
4,000,000

3,500,000

3,000,000 The Data Gap

2,500,000

2,000,000 Total new disk (TB) since 1995

1,500,000

1,000,000
Number of
analysts
500,000

0
1995 1996 1997 1998 1999
What is Data Mining?
 Many Definitions
◦ Non-trivial extraction of implicit,
previously unknown and potentially
useful information from data
◦ Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Data Mining is Not ...

DM - what it can’t do Automatically find relationships without

user intervention when no relationships exist
Data warehousing
SQL Queries / Reporting
Software Agents
Online Analytical Processing (OLAP)
Data Visualization

Origins of Data Mining
 Draws ideas from machine learning/AI,
pattern recognition, statistics, and
database systems
 Traditional Techniques
may be unsuitable due Statistics/
to Machine Learning/
◦ Enormity of data AI Pattern
Recognition
◦ High dimensionality
of data Data
◦ Heterogeneous, Mining
distributed nature
of data
Database
systems
A Problem...
 You are a marketing manager for a
brokerage company
◦ Problem: Churn is too high
 Turnover (after six month introductory period
ends) is 40%
◦ Customers receive incentives (average cost:
$160)
when account is opened
◦ Giving new incentives to everyone who might
leave is very expensive (as well as wasteful)
◦ Bringing back a customer after they leave is
both difficult and costly

… A Solution
 One month before the end of the
introductory period is over, predict which
customers will leave
◦ If you want to keep a customer that is predicted
to churn, offer them something based on their
predicted value
The ones that are not predicted to churn - need
no attention
◦ If you don’t want to keep the customer, do
nothing

 How can you predict future behavior?

◦ solution : Data Mining

Data Mining : Why now ?
 Changes in the Business Environment
◦ Customers becoming more demanding
◦ Markets are saturated
◦ Replace statistician  Better models, less grunge work
◦ Many different data mining algorithms / tools available
◦ Statistical expertise required to compare different
techniques
◦ Build intelligence into the software
◦
 Drivers
◦ Focus on the customer, competition, and data assets
 Enablers
◦ Increased data understanding
◦ Cheaper and faster hardware



Growing Base of data

•Data doubling every

20 months in the
world
•Businesses feel there
is value in historical
data
•Automated knowledge
discovery is only way to
explore this data
1970 1980 1990 2000
Convergence of Three Key
Technologies

Increased
Computing
Power

DM
Statistical Improved
and Learning Data Collection
Algorithms and Mgmt
Applications

 Fraud Detection
 Loan and Credit Approval
 Market Basket Analysis
 Customer Segmentation
 Financial Applications
 E-Commerce & Decision Support
 Web and text mining
Market Analysis and
Management (1)
 Where are the data sources for analysis?
◦ Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies
 Target marketing
◦ Find clusters of “model” customers who share the
same characteristics: interest, income level,
spending habits, etc.
 Determine customer purchasing patterns over
time
◦ Conversion of single to a joint bank account: marriage,
etc.
 Cross-market analysis
Market Analysis and
Management (2)

 Customer profiling
◦ data mining can tell you what types of customers buy
what products (using techniques : clustering or
classification)
 Identifying customer requirements
◦ identifying the best products for different customers
◦ use prediction to find what factors will attract new
customers
 Provides summary information
◦ various multidimensional summary reports
◦ statistical summary information
Fraud Detection and
Management (1)
 Applications
◦ widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
 Approach
◦ use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
 Examples
◦ auto insurance: detect a group of people who stage
accidents to collect on insurance
◦ money laundering: detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network)
◦ medical insurance: detect professional patients and
ring of doctors and ring of references
Fraud Detection and
Management (2)
 Detecting telephone fraud
◦ Telephone call model: destination of the call,
duration, time of day or week. Analyze
patterns that deviate from an expected norm.
◦ British Telecom identified discrete groups of
callers with frequent intra-group calls,
especially mobile phones, and broke a
multimillion dollar fraud. (source:
Gartner,2006)
 Retail
◦ Analysts estimate that 38% of retail shrink is
due to dishonest employees. (Business
today,2006)
Other Applications

 Sports
◦ IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat
 Astronomy
◦ JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
 Internet Web Surf-Aid
◦ IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover
customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Data Mining: On What Kind
of Data?

 Relational databases
 Data warehouses
 Transactional databases
 Object-oriented databases
 Time-series data
 Text databases and multimedia databases
 Heterogeneous
Data Mining Techniques
 Association Rules & sequential Patterns
 Classification
 Clustering
 Similar Images
 Text/Web Mining
 Outliers analysis (using stats)
–
Data mining - Analysis

12/04/09 20
Examples of Data Mining

Conditional Logic
n

If profession = athlete then

n
age < 30 in 90 % cases
n

When Paint is sold, Paint

Associations brushes are also sold 85%

times
n

Golf balls sales are seasonal

n
with Summer peak and
Winter low
Trends & Variations

12/04/09 21
Examples of Datamining contd....

✠How many people can be

Outcome Prediction expected to respond to a mailer
campaign?
✠
✠What will be the total sales of this
Forecasting
product range in next quarter
taking into account seasonal and
long term trends?
✠
✠Is this insurance claim likely to be
Deviation Detection
a fraud?
✠
✠When a person is fired, he is
Link Analysis
likely to default on credit card
payments.
12/04/09 22
Query example

 suppose that you want to see all the data in a table called SALES.
 There are three columns (fields) in the table: NAME, DEPT, and
SALES. The user wants to see all the data, so he enters this
statement:
 SELECT * from SALES (the * indicates all columns)

 The user wants to know the total sales by month for all the
individuals in Department 50. The statement now becomes a bit
more complex.

 SELECT DEPT,SUM(SALES)
 FROM SALES
 GROUP BY DEPT (you have aggregated using SUM and now
require a GROUP)
 ORDER BY 2 DESC (sorting the data based on the second column
in descending order)

12/04/09 23
Consider this complex tricky
query
 A sales executive wishes to see all the sales for the past three
years where profitability has been greater than xx percent.
He wishes to see it by month. And where the percentages
have been greater than yy percent, he wants to see whether
the sales team has been in place during this period or
whether there has been personnel turnover. They are looking
for territorial versus personnel factors in sales success. He
also wishes to see trends in profitability so, where all sales
by year have steadily increased for zz percent at least two
years in a row, he wishes to see the top five products ranked
by profitability.

12/04/09 24
This query requires
 Sums
 Percentages
 Grouping
 Trends
 Time-based analysis
 Comparisons


12/04/09 25
Data mining - Users
 Executives - need top-level insights and
spend far less time with computers than
the other groups.

 Analysts may be financial analysts,
statisticians, consultants, or database
designers.

 End users are sales people, scientists,
market researchers, engineers, physicians,
etc.

12/04/09 26
Mining market
 Around 20 to 30 mining tool vendors
 Major tool players:
◦ Clementine,
◦ IBM’s Intelligent Miner,
◦ SGI’s MineSet,
◦ SAS’s Enterprise Miner.
 All pretty much the same set of tools
 Many embedded products:
◦ fraud detection:
◦ electronic commerce applications,
◦ health care,
◦ customer relationship management: Epiphany
Vertical integration:
Mining on the web
 Web log analysis for site design:
◦ what are popular pages,
◦ what links are hard to find.
 Electronic stores sales enhancements:
◦ recommendations, advertisement:
◦ Collaborative filtering: Net perception,
Wisewire
◦ Inventory control: what was a shopper looking
for and could not find..
Data Mining
 Through a variety of techniques, data mining identifies nuggets of
information in bodies of data.

 Data mining extracts information in such a way that it can be used in
areas such as decision support, prediction, forecasts, and
estimation. Data is often voluminous but of low value and with little
direct usefulness in its raw form. It is the hidden information in the
data that has value.

 In data mining, success comes from combining your (or your expert’s)
knowledge of the data with advanced, active analysis techniques in
which the computer identifies the underlying relationships and
features in the data.

 The process of data mining generates models from historical data that
are later used for predictions, pattern detection, and more. The
technique for building these models is called machine learning, or
modeling.

Data Mining Tasks
 Prediction Methods
◦ Use some variables to predict unknown or future
values of other variables.


 Description Methods
◦ Find human-interpretable patterns that describe
the data.


From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]
Modeling Techniques


 Predictive modeling methods include decision trees, neural
networks, and statistical models.

 Clustering models focus on identifying groups of similar records
and labeling the records according to the group to which they
belong. Clustering methods include Kohonen, k-means, and
TwoStep.

 Association rules associate a particular conclusion (such as the
purchase of a particular product) with a set of conditions (the
purchase of several other products).

 Screening models can be used to screen data to locate fields
and records that are most likely to be of interest in modeling
and identify outliers that may not fit known patterns.
Available methods include feature selection and anomaly
detection.

Typical Applications

Typical applications of data mining techniques include the



following:

 Direct mail. Determine which demographic groups have
the highest response rate. Use this information to
maximize the response to future mailings.

 Credit scoring. Use an individual’s credit history to make
credit decisions.

 Human resources. Understand past hiring practices and
create decision rules to streamline the hiring process.

 Medical research. Create decision rules that suggest
appropriate procedures based on medical evidence.


Typical Applications

 Market analysis. Determine which variables, such as

geography, price, and customer characteristics, are
associated with sales.

 Quality control. Analyze data from product manufacturing
and identify variables determining product defects.

 Policy studies. Use survey data to formulate policy by
applying decision rules to select the most important
variables.

 Health care. User surveys and clinical data can be
combined to discover variables that contribute to
health.
A Strategy for Data Mining

 As with most business endeavors, data

mining is much more effective if done in a
planned, systematic way. Even with
cutting-edge data mining tools, such as
Clementine, SAA, the majority of the work
in data mining requires a knowledgeable
business analyst to keep the process on
track.

A Strategy for Data Mining
 To guide your planning, answer the following
questions:


 What substantive problem do you want to solve?

 What data sources are available, and what parts of
the data are relevant to the current problem?
 What kind of preprocessing and data cleaning do
you need to do before you start mining
 the data?
 What data mining technique(s) will you use?
 How will you evaluate the results of the data mining
analysis?
 How will you get the most out of the information you
obtained from data mining?

The CRISP-DM Process Model

 Cross Industry Standard Process Model for

Data Mining


 The general CRISP-DM process model

includes six phases that address the main
issues in data mining. The six phases fit
together in a cyclical process designed to
incorporate data mining into your larger
business practices.

The six phases include:
 Business understanding. This is perhaps the most
important phase of data mining.
 Business understanding includes determining
business objectives, assessing the situation, determining
data mining goals, and producing a project plan.

 Data understanding. Data provides the “raw
materials” of data mining.
 This phase addresses the need to understand what
your data resources are and the characteristics of those
resources. It includes collecting initial data, describing
data, exploring data, and verifying data quality.



 Data preparation. After cataloging your data
resources, you will need to prepare your data
for
 mining.
 Preparations include selecting, cleaning,
constructing, integrating, and formatting
 data.


 Modeling.
 sophisticated analysis methods are used to
extract information from the data. This phase
involves selecting modeling techniques,
generating test designs, and building and
assessing models.

 Evaluation.
 Once you have chosen your models, you are
ready to evaluate how the data mining results can
help you to achieve your business objectives.
 Elements of this phase include evaluating results,
reviewing the data mining process, and determining
the next steps.
 Deployment.
 This phase focuses on integrating your new
knowledge into your everyday business processes to
 solve your original business problem.
 This phase includes plan deployment, monitoring
and
 maintenance, producing a final report, and
reviewing the project.

Data Mining
Association Analysis: Basic
Concepts
and Algorithms
Association Rule Mining
 Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

Market -Basket t ransact ions

Exam ple of Associat ion
Rules

TID Item
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},

Implication means co-occurrence,

not causality!
Definition: Frequent Itemset
 Itemset
◦ A collection of one or more
items
 Example: {Milk, Bread,
Diaper}

TID
◦ k-itemset
 An itemset that contains k
items
 Support count ()
◦ Frequency of occurrence of an
itemset
◦ E.g. ({Milk, Bread,Diaper}) =
2

1
 Support
◦ Fraction of transactions that
contain an itemset
◦ E.g. s({Milk, Bread, Diaper}) =
2/5
 Frequent Itemset
◦ An itemset whose support is
greater than or equal to a
minsup threshold
Definition: Association Rule
●Association Rule
–An implication expression of the

TID
form X  Y, where X and Y are
itemsets
–Example:
{Milk, Diaper}  {Beer}

●Rule Evaluation Metrics

–Support (s)

1
uFraction of transactions that Example:
contain both X and Y
{Milk, Diaper } ⇒ Beer
–Confidence (c)
uMeasures how often items in
σ (Milk , Diaper, Beer ) 2
Y s = = = 0.4
appear in transactions that |T| 5
contain X
σ (Milk, Diaper, Beer) 2

2
c= = = 0.67
σ (Milk, Diaper) 3
Association Rule Mining
Task
 Given a set of transactions T, the goal of
association rule mining is to find all rules
having
◦ support ≥ minsup threshold
◦ confidence ≥ minconf threshold
◦
 Brute-force approach:
◦ List all possible association rules
◦ Compute the support and confidence for each
rule
◦ Prune rules that fail the minsup and minconf
thresholds
  Computationally prohibitive!
Mining Association Rules
Example of Rules:

TID Items
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
Observations: {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

1 Bread
•All the above rules are binary
partitions of the same itemset:
{Milk, Diaper, Beer}
•Rules originating from the same
itemset have identical support but
can have different confidence
•Thus, we may decouple the
support and confidence
requirements
Mining Association Rules
 Two-step approach:
1.Frequent Itemset Generation
– Generate all itemsets whose support  minsup


2.Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset

 Frequent itemset generation is still
computationally expensive

Frequent Itemset
Generation
Brute-force approach:
◦ Each itemset in the lattice is a candidate frequent
itemset
◦ Count the support of each candidate by scanning the
database
◦ Transactions List of
◦ Candidate
TID Items
◦ 1 Bread, Milk
◦ 2 Bread, Diaper, Beer, Eggs
◦ N 3 Milk, Diaper, Beer, Coke
◦ 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
◦ Match each transaction against every candidate
◦ Complexity ~ O(NMw) w => Expensive since M = 2d !!!
A Naïve Algorithm
Transaction ID Items
100 Bread, Cheese
200 Bread, Cheese, Juice
300 Bread, Milk
400 Cheese, Juice, Milk

Find out all possible com binat ions

Itemsets Frequency
Bread 3
Cheese 3
 Juice 2
milk 2

(Bread, Cheese) 2
 (Bread, Juice) 1
 (Bread, Milk) 1
(Cheese, Juice) 2
 (Cheese, Milk) 1
 (Juice, Milk) 1
(Bread, Cheese, Juice) 1

(Bread, Cheese, Milk) 0
 (Bread, Juice, Milk) 0
 (Cheese, Juice, Milk) 1
(Bread, Cheese, Juice, 0
 Milk)


 minimum support – 50%
 Minimum confidence – 75%

Itemsets Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
Bread, cheese 2
Cheese, Juice 2
 Bread Cheese with confidence of 2/3
=67%
 Cheese Bread with confidence of 2/3
=67%
 Cheese Juice with confidence of 2/3
=67%
 Juice Cheese with confidence of
=100%
 Rules that have more than the user-specified
minimum confidence are called confident

Improved Naïve Algorithm
 Find outItems
Transaction all possible combinations
Combinations
ID
100 Bread, Cheese {Bread, Cheese}
200 Bread, Cheese, {Bread, Cheese}, {Bread,
300 Juice
Bread, Milk Juice}
{Bread, Milk}
400 Cheese, Juice, Milk {Cheese,
{Cheese, Juice},
Juice}, {Bread,
{Cheese,
Cheese, Juice} Milk}{Cheese,
Milk}, {Juice,
Juice, Milk}

Find out all possible com binat ions wit h non-zero

frequency
Itemsets Frequency
Bread 3
Cheese 3
Juice 2
milk 2
(Bread, Cheese) 2
(Bread, Juice) 1
(Bread, Milk) 1
(Cheese, Juice) 2
(Cheese, Milk) 1
(Juice, Milk) 1
(Bread, Cheese, Juice) 1
(Cheese, Juice, Milk) 1
Apriori Algorithm
 Method:


◦ Let k=1
◦ Generate frequent itemsets of length 1
◦ Repeat until no new frequent itemsets are
identified
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets of
length k that are infrequent
Count the support of each candidate by
scanning the DB
Eliminate candidates that are infrequent,
leaving only those that are frequent
Reducing Number of
Comparisons
 Candidate counting:
◦ Scan the database of transactions to determine
the support of each candidate itemset
◦ To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transaction against
every candidate, match it against candidates
contained in the hashed buckets

Transactions Hash Structure

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
APRIORI
Transaction ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Cheese, Juice, Milk
500 Cheese, Juice, Milk
50% support
Frequent items L1

Item Frequency
Bread 4
Cheese 3
Juice 4
Milk 3
Candidate item pairs C2

Itemsets Frequency
(Bread, Cheese) 2
(Bread, Juice) 3
(Bread, Milk) 2
(Cheese, Juice) 3
(Cheese, Milk) 1
(Juice, Milk) 2
Frequent items L2

Item Frequency
Bread, Juice 3
Cheese, Juice 3
The APRIORI Algorithm
 Bread Juice with confidence of 3/4
=75%
 Juice Bread with confidence of 3/4
=75%
 Cheese Juice with confidence of 3/3
=100%
 Juice Cheese with confidence of 3/4
=75%
A larger APRIORI Example
Item Number Item Name
1 Biscuits
2 Bread
3 Cereal
4 Cheese
5 Chocolate
6 Coffee
7 Donuts
8 Eggs
9 Juice
10 Milk
11 Newspaper
12 Pastry
13 Rolls
14 Sugar
15 Tea
16 Yogurt
TID Items
1 Biscuits, Bread, Cheese, Coffee, Yogurt
2 Bread, Cereal, Cheese, Coffee
3 Cheese, Chocolate, Donuts, Juice, Milk
4 Bread, Cheese, Coffee, Cereal, Juice
5 Bread, Cereal, Chocolate, Donuts, Juice
6 Milk, Tea
7 Biscuits, Bread, Cheese, Coffee, Milk
8 Eggs, Milk, Tea
9 Bread, Cereal, Cheese, Chocolate, Coffee
10 Bread, Cereal, Chocolate, Donuts, Juice
11 Bread, Cheese, Juice
12 Bread, Cheese, Coffee, Donuts, Juice
13 Biscuits, Bread, Cereal
14 Cereal, Cheese, Chocolate, Donuts, Juice
15 Chocolate, Coffee
16 Donuts
17 Donuts, Eggs, Juice
18 Biscuits, Bread, Cheese, Coffee
19 Bread, Cereal, Chocolate, Donuts, Juice
20 Cheese, Chocolate, Donuts, Juice
21 Milk, Tea, Yogurt
22 Bread, Cereal, Cheese, Coffee
23 Chocolate, Donuts, Juice, Milk, Newspaper
24 Newspaper, Pastry, Rolls
25 Rolls, Sugar, Tea
25% support
Frequency count for all items

Item NumberItem Name Frequency

1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 Donuts 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 Yogurt 2
The frequent 1-itemset or
L1
Item Frequency
Bread 13
Cereal 10
Cheese 11
Chocolate 9
Coffee 9
Donuts 10
Juice 11
The 21 candidate 2-itemsets or C2
{Bread, Cereal}
{Bread, Cheese}
{Bread, Chocolate}
{Bread, Coffee}
{Bread, Donuts}
{Bread, Juice}
{Cereal, Cheese}
{Cereal, Chocolate}
{Cereal, Coffee}
{Cereal, Donuts}
{Cereal, Juice}
{Cheese, Chocolate}
{Cheese, Coffee}
{Cheese, Donuts}
{Cheese, Juice}
{Chocolate, Coffee}
{Chocolate, Donuts}
{Chocolate, Juice}
{Coffee, Donuts}
{Coffee, Juice}
{Donuts, Juice}
Frequency count of candidate
2-itemsets
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Chocolate} 4
{Bread, Coffee} 8
{Bread, Donuts} 4
{Bread, Juice} 6
{Cereal, Cheese} 5
{Cereal, Chocolate} 4
{Cereal, Coffee} 5
{Cereal, Donuts} 4
{Cereal, Juice} 6
{Cheese, Chocolate} 4
candidate 2-itemsets or C2

{Cheese, Coffee} 9
{Cheese, Donuts} 3
{Cheese, Juice} 4
{Chocolate, 1
Coffee}
{Chocolate, 7
Donuts}
{Chocolate, Juice} 7
{Coffee, Donuts} 1
{Coffee, Juice} 2
{Donuts, Juice} 9
The frequent 2-itemset or
L2
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
{Chocolate, Donuts} 7
{Chocolate, Juice} 7
{Donuts, Juice} 9
Candidate 3-itemsets or C3

{Bread, Cereal, 4
{Bread,
Cheese} Cereal, 4
{Bread,
Coffee} Cheese, 8
{Chocolate,
Coffee} Donuts, 7
Juice}

Frequent 3-itemsets or L3

{Bread, Cheese, 8
{Chocolate,
Coffee} Donuts, 7
Juice}
Confidence of association rules from
{Chocolate, Donuts, Juice}
Rule Support of BCD Frequency of Confidence
N MP 7 9LHS 0.78
M NP 7 10 0.70
P NM7 11 0.64
MP N 7 9 0.78
NP M 7 7 1.0
NM P 7 7 1.0
Confidence of association rules
from {Bread, Cheese, Coffee}
Rule Support of BCD Frequency of Confidence
B 8 13
LHS 0.61
CD
C 8 11 0.72
BD
D 8 9 0.89
BC
CD 8 9 0.89
B
BD 8 8 1.0
C
BC 8 8 1.0
D
All association rules
Cheese Bread
Cheese Coffee
Coffee Bread
Coffee Cheese
Cheese, Coffee Bread
Bread, Coffee Cheese
Bread, Cheese Coffee
Chocolate Donuts
Chocolate Juice
Donuts Chocolate
Donuts Juice
Donuts, Juice Chocolate
Chocolate, Juice Donuts
Chocolate, Donuts Juice
Bread Cereal
Cereal Bread
Closed and Maximal
Itemsets
 Closed itemset – a frequent itemset X such
that there exists no superset of X with the
same support count as X.
 Maximal itemset – a frequent itemset Y is
maximal if it is not a proper subset of any
other frequent itemset.
 A maximal itemset is a closed itemset, but a
closed itemset is not necessarily a
maximal itemset.
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets
A transaction database to
illustrate closed and maximal
itemsets

Transaction ID Items
100 Bread, Cheese, Juice
200 Bread, Cheese, Juice, Milk
300 Cheese, Juice, Egg
400 Bread, Juice, Milk, Egg
500 Milk, Egg
Frequent itemsets for the database
Itemset Support Closed? Maximal? Both?
{Bread} 3 No No No
{Cheese} 3 No No No
{Juice} 4 Yes No No
{Milk} 3 Yes No No
{Egg} 3 Yes No No
{Bread, Cheese} 2 No No No
{Bread, Juice} 3 Yes No No
{Bread, Milk} 2 No No No
{Cheese, Juice} 3 Yes No No
{Juice, Milk} 2 No No No
{Juice, Egg} 2 Yes Yes Yes
{Milk, Egg} 2 Yes Yes Yes
{Bread, Cheese, 2 Yes Yes Yes
Juice}
{Bread, Juice, 2 Yes Yes Yes
Milk}
Association models
 associate a particular conclusion
(such as a decision to buy something)
with a set of conditions.

 The Generalized Rule Induction (GRI)
node discovers association rules in
the data. For example, customers
who purchase razors and aftershave
lotion are also likely to purchase
shaving cream. GRI extracts rules
with the highest information content
based on an index that takes both
the generality (support) and
accuracy (confidence) of rules into
account. GRI can handle numeric
and categorical inputs, but the target
must be categorical.


Association models
 The Apriori node extracts a set of
rules from the data, pulling out the
rules with the highest information
content.
 Apriori offers five different methods of
selecting rules and uses a
sophisticated indexing scheme to
process large datasets efficiently.
 For large problems, Apriori is
generally faster to train than GRI; it
has no arbitrary limit on the
number of rules that can be
retained, and it can handle rules
with up to 32 preconditions.
 Apriori requires that input and output
fields all be categorical but delivers
better performance because it is
optimized for this type of data.

 At the end of the processing, a table of the best rules is
presented. Unlike a decision tree, this set of association
rules cannot be used directly to make predictions in the
way that a standard model (such as a decision tree or a
neural network) can. This is due to the many different
 possible conclusions for the rules. Another level of
transformation is required to transform the
 association rules into a classification ruleset. Hence, the
association rules produced by association
 algorithms are known as unrefined models. Although
the user can browse these unrefined
 models, they cannot be used explicitly as classification
models unless the user tells the system
 to generate a classification model from the unrefined
model. This is done from the browser
 through a Generate menu option.

Association Rule Mining
 X & Y appear in only 10% o the transactions
but whenever X appears there is an 80 %
chance that Y also appears.
 The 10 % presence is called – Support ( or
prevalance)
 80 % chance is called – confidence (or
predictability)
 High level of support – the rule is frequent
enough for the business to be interested in
it.
 High level of confidence – the rule is true
often enough to justify a decision based on
it.
Association Rule Mining
 Total number of transactions = N
 Support(X) = (Number of times X appears) / N
 = P(X)
 Support(XY) = (Number of times X and Y
appears together) / N
 = P(X∩Y)
 Confidence of (X Y)
=Support(XY)/Support(X)
 = P(X∩Y) / P(Y)
 = P(Y∣X)
 P(Y∣X) is the probability of Y once X has taken
place, also called the conditional probability of
Y

Association Rule Mining
 Lift – used to measure the power of
association between items that are
purchased together.
 How much more likely an item Y is to be
purchased if the customer has brought the
item X that has been identified as having
an association with the first item Y,
compared to the likelihood of Y being
purchased without the the other item
being purchased.
 P(Y∣X) / P(Y)

7 Minutes With God
No ratings yet
7 Minutes With God
5 pages
Analytics For Decision Making
0% (1)
Analytics For Decision Making
66 pages
09-Datamining Concepts
100% (1)
09-Datamining Concepts
121 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Web Mining: Faculty of Information Technology Department of Software Engineering and Information Systems
No ratings yet
Web Mining: Faculty of Information Technology Department of Software Engineering and Information Systems
67 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
DM ITERA 2020 w1
No ratings yet
DM ITERA 2020 w1
35 pages
ETEM S01 - (Big Data)
No ratings yet
ETEM S01 - (Big Data)
24 pages
Module 3
No ratings yet
Module 3
187 pages
Screenshot 2024-10-17 at 2.05.17 PM
No ratings yet
Screenshot 2024-10-17 at 2.05.17 PM
47 pages
Chapter-1: October 27, 2020 Data Mining: Concepts and Techniques 1
No ratings yet
Chapter-1: October 27, 2020 Data Mining: Concepts and Techniques 1
35 pages
Introduction to Data Mining_125604
No ratings yet
Introduction to Data Mining_125604
7 pages
IME 672-Chapter 1 PDF
No ratings yet
IME 672-Chapter 1 PDF
41 pages
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
66 pages
Big data
No ratings yet
Big data
47 pages
Data Mining Concept (MMU)
No ratings yet
Data Mining Concept (MMU)
38 pages
1 DM Intro
No ratings yet
1 DM Intro
38 pages
Distributed Data Mining in Credit Card Fraud Detection
100% (1)
Distributed Data Mining in Credit Card Fraud Detection
57 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
Data Mining L1,2
No ratings yet
Data Mining L1,2
26 pages
1 DM Intro
No ratings yet
1 DM Intro
38 pages
Day 3
No ratings yet
Day 3
20 pages
DataMining
No ratings yet
DataMining
43 pages
Module 1 Ppt1
No ratings yet
Module 1 Ppt1
59 pages
BIG DATA
No ratings yet
BIG DATA
16 pages
Data Mining Introduction Unit III
No ratings yet
Data Mining Introduction Unit III
48 pages
DWDM
No ratings yet
DWDM
30 pages
1 Data-Analytic Thinking
No ratings yet
1 Data-Analytic Thinking
17 pages
Data Mining Notes
No ratings yet
Data Mining Notes
46 pages
Chapter14 Example2
No ratings yet
Chapter14 Example2
35 pages
Introduction To Business Analytics: Prof. Kunal Ghosh
No ratings yet
Introduction To Business Analytics: Prof. Kunal Ghosh
30 pages
Introduction: Data Analytic Thinking
No ratings yet
Introduction: Data Analytic Thinking
38 pages
Case Study - Mining
No ratings yet
Case Study - Mining
5 pages
Data-Mining-OVERVIEW (1)
No ratings yet
Data-Mining-OVERVIEW (1)
8 pages
Business Intelligence and Data Mining Topic Two: BI/DM/BDA Applications, DA Models and Frameworks
No ratings yet
Business Intelligence and Data Mining Topic Two: BI/DM/BDA Applications, DA Models and Frameworks
33 pages
ETB 1 (Big data)
No ratings yet
ETB 1 (Big data)
28 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
46 pages
Lecture 1
No ratings yet
Lecture 1
58 pages
CHAPTER 02: Big Data Analytics
No ratings yet
CHAPTER 02: Big Data Analytics
73 pages
DM1
No ratings yet
DM1
26 pages
Unit 5 Plastic Analysis
No ratings yet
Unit 5 Plastic Analysis
20 pages
Lecture 6 Compress
No ratings yet
Lecture 6 Compress
9 pages
Data Mining: A Competitive Tool in The Banking and Retail Industries
No ratings yet
Data Mining: A Competitive Tool in The Banking and Retail Industries
7 pages
Big Data 1 - What Is Big Data - ACCA Global
No ratings yet
Big Data 1 - What Is Big Data - ACCA Global
6 pages
1 - Understanding Big Data
No ratings yet
1 - Understanding Big Data
46 pages
Assignment - Fundamentals of Big Data and Business Analytics
No ratings yet
Assignment - Fundamentals of Big Data and Business Analytics
9 pages
MIS NOTES(UNIT-3) 3
No ratings yet
MIS NOTES(UNIT-3) 3
16 pages
DMT UNIT 5
No ratings yet
DMT UNIT 5
25 pages
CHAPTER 02: Big Data Analytics
No ratings yet
CHAPTER 02: Big Data Analytics
62 pages
krithi-talk-impact.pptx
No ratings yet
krithi-talk-impact.pptx
169 pages
Data
No ratings yet
Data
9 pages
Chapter 1: Introduction
No ratings yet
Chapter 1: Introduction
17 pages
BTECH Data Mining Answer
No ratings yet
BTECH Data Mining Answer
35 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Module 4.1 - Data Science_c19a56558691ed09690242a995a65dbe
No ratings yet
Module 4.1 - Data Science_c19a56558691ed09690242a995a65dbe
56 pages
Marketing Analytics Session-I&II S
No ratings yet
Marketing Analytics Session-I&II S
45 pages
BI unit 1 Data warehouse.ppt
No ratings yet
BI unit 1 Data warehouse.ppt
169 pages
data mining
No ratings yet
data mining
17 pages
Making Big Data Work for Your Business: A guide to effective Big Data analytics
From Everand
Making Big Data Work for Your Business: A guide to effective Big Data analytics
Sudhi Sinha
No ratings yet
What Is Big Data
From Everand
What Is Big Data
Jay Kassing
No ratings yet
Student Feedback
No ratings yet
Student Feedback
2 pages
Introduction To DW
No ratings yet
Introduction To DW
127 pages
DW Architecture & Best Practices
No ratings yet
DW Architecture & Best Practices
67 pages
DW Prone To Failures
No ratings yet
DW Prone To Failures
3 pages
Waclimate
No ratings yet
Waclimate
10 pages
10 English Nelson Man Del A
No ratings yet
10 English Nelson Man Del A
5 pages
2051436 (1)
No ratings yet
2051436 (1)
15 pages
Utopia 1
No ratings yet
Utopia 1
147 pages
Phonological Theories
No ratings yet
Phonological Theories
13 pages
IBPS SO 28th Dec 2019 (Agriculture and IT Officer) Prelims Memory Based Paper English
No ratings yet
IBPS SO 28th Dec 2019 (Agriculture and IT Officer) Prelims Memory Based Paper English
43 pages
Assignment 4 Question
No ratings yet
Assignment 4 Question
3 pages
GPR-KPIs 2022
No ratings yet
GPR-KPIs 2022
29 pages
Q4 Week 4 Las Arts8
No ratings yet
Q4 Week 4 Las Arts8
4 pages
OIML R 124 Edition 1997
No ratings yet
OIML R 124 Edition 1997
32 pages
Region Bangsamoro Armm Bshm-21202
No ratings yet
Region Bangsamoro Armm Bshm-21202
94 pages
Viral Video On Instagram by Dawa Influencer: Qualitative Analysis
100% (1)
Viral Video On Instagram by Dawa Influencer: Qualitative Analysis
10 pages
Directive Principles of State Policy
No ratings yet
Directive Principles of State Policy
7 pages
Methods Minimize Gear Backlash
No ratings yet
Methods Minimize Gear Backlash
6 pages
Script For Flores de Mayo
No ratings yet
Script For Flores de Mayo
12 pages
Chiluba & Others Vs Attorney General - Court Ruling
0% (1)
Chiluba & Others Vs Attorney General - Court Ruling
19 pages
Chapter 3 Work and Energy Potential Energy
No ratings yet
Chapter 3 Work and Energy Potential Energy
5 pages
Choose A Job That Doesn't Feel Like Work: Media Take On Case Isn't Objective
No ratings yet
Choose A Job That Doesn't Feel Like Work: Media Take On Case Isn't Objective
1 page
História Da Alopecia Em Psicosoma
No ratings yet
História Da Alopecia Em Psicosoma
15 pages
Project Book Template
No ratings yet
Project Book Template
5 pages
Employment Contract Agreement
100% (1)
Employment Contract Agreement
4 pages
10 RCPI vs. Verchez, Et Al GR No. 164349 January 31, 2006
100% (3)
10 RCPI vs. Verchez, Et Al GR No. 164349 January 31, 2006
2 pages
Political Organization
No ratings yet
Political Organization
3 pages
JEB Brochure 2013
No ratings yet
JEB Brochure 2013
17 pages
Kerala Hindu Marriage Registration - Rules - 1957
No ratings yet
Kerala Hindu Marriage Registration - Rules - 1957
9 pages
Sci 5e Lesson Plan
No ratings yet
Sci 5e Lesson Plan
7 pages
Transducers Engineering Notes PDF
No ratings yet
Transducers Engineering Notes PDF
110 pages
Mortarless Interlocking Hollow Blocks: An Innovation
No ratings yet
Mortarless Interlocking Hollow Blocks: An Innovation
9 pages