Lect 2

Data Mining :
Introduction(2)
Chapter 1
Index
2
1.What is Data Mining?
2.Data Mining Functionalities
1. Characterization and Discrimination
2. MIning Frequent Patterns
3. Classification and Prediction
4. Cluster Analysis
5. Outlier Analysis
6. Evolution Analysis
3.Are all Patterns Interesting?
4.Major Issues in Data Mining

1. What is Data
Mining 3
Data mining is the process of
discovering interesting patterns (or
knowledge) from large amounts of
data.
The data sources can include
databases, data warehouses, the Web,
other information repositories, or data
that are streamed into the system
dynamically.
What is Data
Mining This is the
information of
domain we are
mining like concept
Communicates between users and data hierarchies, to
mining system. Visualizes results or organize attributes
perform exploration on data and schemas. onto various levels
of abstraction
Tests for interestingness of a pattern
Performs functionalities like

characterization, association, Also contains user
classification, prediction etc. beliefs, which can be
used to access
Is responsible for fetching relevant data interestingness of
based on user request pattern or
thresholds
This is usually the source of data.

The data may require cleaning and
integration.
Architecture of data mining system

Data Mining
Functionalities 4
Data Mining functionalities are used to
specify the kind of patterns to be found in
data mining tasks.
Data Mining tasks can be classified into two
categories
- Descriptive: Characterize general
properties of data in the database
- Predictive: perform inference on data
to make predictions
Data mining
Functionalities
There are number of data mining
functionalities:
1. Characterization and Discrimination.
2. Mining of frequent patterns, Association
and Correlation.
3. Classification and Regression.
4. Clustering analysis.
5. Outlier analysis
6. Evolution Analysis
1 Data Mining
Functionalities:
Characterization and 5
Discrimination
Data can be associated with classes or
concepts that can be described in
summarized, concise, and yet precise, terms.
Such descriptions of a concept or class are
called class/concept descriptions.
These descriptions can be derived via
- Data Characterization
- Data Discrimination
1.1Characterization
• Data characterization is a summarization

of the general characteristics or features
of a target class of data.
• The data corresponding to the user-
specified class are typically collected by
query.
• Several methods for data summarization
like statistical method, data cube based
approach, attribute oriented induction
technique can be used without step by
step user interaction.
1.1Characterization cont..
Example: To study the characteristics of a
software products with sales that
increased by 10% in the previous year,
the data related to such products can be
collected by executing an SQL query on
the sales database.
• The output of data characterization can
be presented in various forms:
Pie charts
Bar charts
Multidimensional data cubes
Multidimensional tables etc
1.2 Discrimination
• Data discrimination is a comparison
of the general features of the target
class data objects against the
general features of objects from one
or multiple contrasting classes with
respect to customers that share
specified generalized feature(s).
• Discrimination descriptions
expressed in the form of rules are
referred to as discriminant rules.
1.2 Data Mining
Functionalities
Discrimination Example 7
ex: compare change in sales of software products for
customers with given generalized feature: 40% of “Youth”
have sales that increased by more 10% from last year; 10%
of “Youth” have sales that decreased by at least 30% during
the same period; the remaining 50% of “Youth” change in
sales the fell in-between. “Youth” describes the generalized
tuple, while increase in sales by > 10% is the target class.
The other two amounts of change in sales are the
contrasting classes.
The forms of output presentation are similar to those for

characteristic descriptions, although discrimination
descriptions should include comparative measures that help
to distinguish between the target and contrasting classes.
2 Data Mining Functionalities:
Mining Frequent Patterns 8
Frequent patterns are the patterns

that occur frequently in the data.
Patterns can include itemsets,
sequences and subsequences.
A frequent itemset refers to a set of
items that often appear together in a
transactional data set.
ex: bread and milk
2.1 Data Mining
Functionalities:
Mining Frequent Patterns 9
Example
Association Rules
if a customer buys a computer, there is a 50% chance that he will buy software as well
buys(X, “computer”)=>buys(X, “software”) [support =1%, confidence =

50%]
1% of all the transactions under analysis show

Single Dimension Association Rule
that computer and software are purchased together
age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)

[support = 2%, confidence = 60%]
Multi-Dimension Association Rule
Association rules are discarded as uninteresting if they do not satisfy

minimum support threshold and minimum confidence threshold
2.2 Association (correlation
and causality)
• age(X, “20..29”) ^ income(X, “20..29K”) 
buys(X, “PC”) [support = 2%, confidence =
60%]
• contains(T, “computer”)  contains(x,

“software”) [1%, 75%]
• Confidence: 60% means that if a customer

buys a computer, there is 60% chance that she
will buy software as well.
• Support 2% means that 2% of all transactions

under analysis show that computer and
software are purchased together.
Multi-dimensional vs.
single-dimensional
association
• In a association rule that contains
a single predicate are referred to
as single dimensional association
rules.
• If it contains multiple predicate
then it is known as multi
dimensional association rule.
Classification and Prediction 10
Classification is the process of finding a model (or function) that describes
and distinguishes data classes or concepts. The model is derived based on
the analysis of a set of training data and is used to predict the class label of
objects for which the the class label is unknown.
Representation of Derived model

IF-THEN Rules
Decision Tree
Neural Network
Classification and Prediction 11
Prediction values continuous valued functions, i.e. it is used to predict

missing or unavailable numeric data values rather than class labels.
Prediction can be used for both numeric prediction and class label
prediction.
Regression analysis is a statistical method used numeric prediction.
Classification and regression may need to be preceded by relevance

analysis, which attempts to identify attributes that are significantly
relevant to the classification and regression process. Such attributes will
be selected for the classification and regression process. Other
attributes, which are irrelevant, can then be excluded from consideration
3 Classification and
Prediction
• Classification is the process of finding a model or

function that describes and distinguishes data classes.
• E.g., classify countries based on climate, or classify cars

based on gas mileage
• The model are derived based on the analysis of a set of
training data.
• The model is used to predict the class label of objects for

which the class label is not known.
• Presentation: decision-tree, classification rule,

neural network
• Prediction: Predict some unknown or missing

numerical values
4. Cluster analysis
• Class label is unknown: Group

data to form new classes, e.g.,
cluster houses to find distribution
patterns
• Clustering based on the principle:

maximizing the intra-class
similarity and minimizing the
interclass similarity
Cluster Analysis 12
Clustering analyzes data objects without
consulting class labels.
Clustering can be used to generate class
labels for a group of data which did not
exist at the beginning.
The objects are clustered or grouped
based on the principle of maximizing the
intra-class similarity and minimizing the
inter-class similarity.
Outlier Analysis 13
Outliers are data objects that do not comply with the general
behavior or model of data.
Many data mining techniques discard outliers or exceptions

as noise.
However, in some events these kind of events are more

interesting. This analysis of outlier data is referred to as
outlier analysis
ex: fraud detection.

Outlier analysis
•Outlier: a data object that does not

comply with the general behavior of
the data
•It can be considered as noise or

exception but is quite useful in fraud
detection, rare events analysis
6 Data Mining Functionalities
Evolution Analysis 14
Data evolution analysis describes and models regularities

or trends for objects whose behavior changes over time.
This may include characterization, discrimination,

association and correlation analysis, classification,
prediction or clustering of time related data.
Distinct features of such data include time series data

analysis, sequence or periodicity pattern matching and
similarity based data analysis.
3. Are all Patterns
3. Are all Patterns
Interesting?
Interesting? 15
We need to answer three questions to

say if patterns are interesting
1. What makes a pattern interesting?
2. Can a data mining system generate
all of the interesting patterns?
3. Can the system generate only the
interesting ones?
Decision
Tree
What makes a pattern is
interesting?
16
validates a hypothesis that user sought to confirm

Not known before
Novel, Potentially useful or desired,

understandable
and valid Easily understood by humans
Valid on new set of data with a degree

of certainty
Interestingness
measures
• Interestingness measures: A
pattern is interesting if it is easily
understood by humans, valid on
new or test data with some degree
of certainty, potentially useful,
novel, or validates some
hypothesis that a user seeks to
confirm
Are All the
“Discovered”
Patterns
• Interesting?
A data mining system/query may generate
thousands of patterns, not all of them are
interesting.
• Suggested approach: Human-centered, query-based, focused

mining
• Objective vs. subjective interestingness

measures:
• Objective: based on statistics and structures of patterns, e.g.,

support, confidence, etc.
• Subjective: based on user’s belief in the data, e.g.,

unexpectedness, novelty etc.
Are all Patterns
Interesting? 18
Many patterns that are interesting by objective standards may

represent common sense and, therefore, are actually
uninteresting.
So Objective measures are coupled with subjective measures that

reflects users needs and interests.
Subjective interestingness measures are based on user beliefs in

the data.
These measures find patterns interesting if the patterns are

unexpected (contradicting user’s belief), actionable (offer
strategic information on which the user can act) or
expected(confirm a hypothesis)
Objective measures of
interestingness
(measurable) 17
Support: The percentage of transactions

from transaction database that the given
rule satisfies
support(X=>Y) = P(XUY)
Confidence: The degree of certainty of
given transaction
Confidence(X=>Y)=P(Y|X)
Objective measures of
interestingness
(measurable)
• Support: it gives percentage of transactions from a
transaction database that the given rule satisfies.
• It is probability P(x U y), where x U y indicates that a

transaction contains both x and y.
• Confidence: it access the degree of certainty of the

detected association.
• It is taken as the conditional probability P(Y/X), that is

the probability that a transaction containing X also
contains Y.
• Hence,
• Support (X  Y) = P(x U Y),
• Confidence(X  Y) = P (y/x)
Other
interestingness
measure
• Accuracy: Accuracy tells us the
percentage of data that are
correctly classified by a rule.
• Coverage: It gives the percentage
of data to which a rule applies.
Can a data mining system
generate all of the
interesting patterns? 19
•A data mining algorithm is complete if it mines all
interesting patterns.
•It is often unrealistic and inefficient for data mining

systems to generate all possible patterns. Instead,
user-provided constraints and interestingness measures
should be used to focus the search.
•For some mining tasks, such as association, this is

often sufficient to ensure the completeness of the
algorithm.
Can a data mining
system generate only
interesting patterns? 20
A data mining algorithm is consistent if it mines only

interesting patterns. It is an optimization problem.
It is highly desirable for data mining systems to
generate only interesting patterns. This would be
efficient for users and data mining systems because
neither would have to search through the patterns
generated to identify the truly interesting ones.
Sufficient progress has been made in this direction,
but it still a challenging issue in data mining.
Can We Find All and Only
Interesting Patterns?
• Find all the interesting patterns: Completeness

• Association vs. classification vs. clustering
• Search for only interesting patterns: Optimization

• Approaches
• First general all the patterns and then filter out the
uninteresting ones.
• Generate only the interesting patterns—mining query

optimization
4. Major Issues in
Data Mining
1.Mining different kinds of data
2.Handling multiple levels of abstraction
3.Incorporation of background knowledge
4.Visualization of mining results
5.Handling of incomplete or noisy data
6.Scalability of algorithms

Lect 2

Uploaded by

Copyright:

Available Formats

Lect 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 2

Uploaded by

Copyright:

Available Formats

Data Mining :

2.Data Mining Functionalities

1. Characterization and Discrimination

2. MIning Frequent Patterns

3. Classification and Prediction

3.Are all Patterns Interesting?

4.Major Issues in Data Mining

Performs functionalities like

This is usually the source of data.

Architecture of data mining system

• Data characterization is a summarization

The forms of output presentation are similar to those for

Frequent patterns are the patterns

buys(X, “computer”)=>buys(X, “software”) [support =1%, confidence =

1% of all the transactions under analysis show

age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)

Association rules are discarded as uninteresting if they do not satisfy

• contains(T, “computer”)  contains(x,

• Confidence: 60% means that if a customer

• Support 2% means that 2% of all transactions

Representation of Derived model

Prediction values continuous valued functions, i.e. it is used to predict

Regression analysis is a statistical method used numeric prediction.

Classification and regression may need to be preceded by relevance

• Classification is the process of finding a model or

• E.g., classify countries based on climate, or classify cars

• The model is used to predict the class label of objects for

• Presentation: decision-tree, classification rule,

• Prediction: Predict some unknown or missing

• Class label is unknown: Group

• Clustering based on the principle:

Many data mining techniques discard outliers or exceptions

However, in some events these kind of events are more

ex: fraud detection.

•Outlier: a data object that does not

•It can be considered as noise or

Data evolution analysis describes and models regularities

This may include characterization, discrimination,

Distinct features of such data include time series data

We need to answer three questions to

validates a hypothesis that user sought to confirm

Novel, Potentially useful or desired,

Valid on new set of data with a degree

• Suggested approach: Human-centered, query-based, focused

• Objective vs. subjective interestingness

• Objective: based on statistics and structures of patterns, e.g.,

• Subjective: based on user’s belief in the data, e.g.,

Many patterns that are interesting by objective standards may

So Objective measures are coupled with subjective measures that

Subjective interestingness measures are based on user beliefs in

These measures find patterns interesting if the patterns are

Support: The percentage of transactions

• It is probability P(x U y), where x U y indicates that a

• Confidence: it access the degree of certainty of the

• It is taken as the conditional probability P(Y/X), that is

• Support (X  Y) = P(x U Y),

•It is often unrealistic and inefficient for data mining

•For some mining tasks, such as association, this is

A data mining algorithm is consistent if it mines only

• Find all the interesting patterns: Completeness