Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lect 2

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 35

Data Mining :

Introduction(2)
Chapter 1
Index
2
1.What is Data Mining?

2.Data Mining Functionalities

1. Characterization and Discrimination

2. MIning Frequent Patterns

3. Classification and Prediction

4. Cluster Analysis

5. Outlier Analysis

6. Evolution Analysis

3.Are all Patterns Interesting?

4.Major Issues in Data Mining


1. What is Data
Mining 3
Data mining is the process of
discovering interesting patterns (or
knowledge) from large amounts of
data.
The data sources can include
databases, data warehouses, the Web,
other information repositories, or data
that are streamed into the system
dynamically.
What is Data
Mining This is the
information of
domain we are
mining like concept
Communicates between users and data hierarchies, to
mining system. Visualizes results or organize attributes
perform exploration on data and schemas. onto various levels
of abstraction
Tests for interestingness of a pattern

Performs functionalities like


characterization, association, Also contains user
classification, prediction etc. beliefs, which can be
used to access
Is responsible for fetching relevant data interestingness of
based on user request pattern or
thresholds

This is usually the source of data.


The data may require cleaning and
integration.

Architecture of data mining system


Data Mining
Functionalities 4
Data Mining functionalities are used to
specify the kind of patterns to be found in
data mining tasks.
Data Mining tasks can be classified into two
categories
- Descriptive: Characterize general
properties of data in the database
- Predictive: perform inference on data
to make predictions
Data mining
Functionalities
There are number of data mining
functionalities:
1. Characterization and Discrimination.
2. Mining of frequent patterns, Association
and Correlation.
3. Classification and Regression.
4. Clustering analysis.
5. Outlier analysis
6. Evolution Analysis
1 Data Mining
Functionalities:
Characterization and 5
Discrimination
Data can be associated with classes or
concepts that can be described in
summarized, concise, and yet precise, terms.
Such descriptions of a concept or class are
called class/concept descriptions.
These descriptions can be derived via
- Data Characterization
- Data Discrimination
1.1Characterization

• Data characterization is a summarization


of the general characteristics or features
of a target class of data.
• The data corresponding to the user-
specified class are typically collected by
query.
• Several methods for data summarization
like statistical method, data cube based
approach, attribute oriented induction
technique can be used without step by
step user interaction.
1.1Characterization cont..
Example: To study the characteristics of a
software products with sales that
increased by 10% in the previous year,
the data related to such products can be
collected by executing an SQL query on
the sales database.
• The output of data characterization can
be presented in various forms:
Pie charts
Bar charts
Multidimensional data cubes
Multidimensional tables etc
1.2 Discrimination
• Data discrimination is a comparison
of the general features of the target
class data objects against the
general features of objects from one
or multiple contrasting classes with
respect to customers that share
specified generalized feature(s).
• Discrimination descriptions
expressed in the form of rules are
referred to as discriminant rules.
1.2 Data Mining
Functionalities
Discrimination Example 7
ex: compare change in sales of software products for
customers with given generalized feature: 40% of “Youth”
have sales that increased by more 10% from last year; 10%
of “Youth” have sales that decreased by at least 30% during
the same period; the remaining 50% of “Youth” change in
sales the fell in-between. “Youth” describes the generalized
tuple, while increase in sales by > 10% is the target class.
The other two amounts of change in sales are the
contrasting classes.

The forms of output presentation are similar to those for


characteristic descriptions, although discrimination
descriptions should include comparative measures that help
to distinguish between the target and contrasting classes.
2 Data Mining Functionalities:
Mining Frequent Patterns 8

Frequent patterns are the patterns


that occur frequently in the data.
Patterns can include itemsets,
sequences and subsequences.
A frequent itemset refers to a set of
items that often appear together in a
transactional data set.
ex: bread and milk
2.1 Data Mining
Functionalities:
Mining Frequent Patterns 9
Example

Association Rules
if a customer buys a computer, there is a 50% chance that he will buy software as well

buys(X, “computer”)=>buys(X, “software”) [support =1%, confidence =


50%]

1% of all the transactions under analysis show


Single Dimension Association Rule
that computer and software are purchased together

age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)


[support = 2%, confidence = 60%]
Multi-Dimension Association Rule

Association rules are discarded as uninteresting if they do not satisfy


minimum support threshold and minimum confidence threshold
2.2 Association (correlation
and causality)
• age(X, “20..29”) ^ income(X, “20..29K”) 
buys(X, “PC”) [support = 2%, confidence =
60%]

• contains(T, “computer”)  contains(x,


“software”) [1%, 75%]

• Confidence: 60% means that if a customer


buys a computer, there is 60% chance that she
will buy software as well.

• Support 2% means that 2% of all transactions


under analysis show that computer and
software are purchased together.
Multi-dimensional vs.
single-dimensional
association
• In a association rule that contains
a single predicate are referred to
as single dimensional association
rules.
• If it contains multiple predicate
then it is known as multi
dimensional association rule.
3 Data Mining Functionalities:
Classification and Prediction 10
Classification is the process of finding a model (or function) that describes
and distinguishes data classes or concepts. The model is derived based on
the analysis of a set of training data and is used to predict the class label of
objects for which the the class label is unknown.

Representation of Derived model


IF-THEN Rules

Decision Tree

Neural Network
3 Data Mining Functionalities:
Classification and Prediction 11

Prediction values continuous valued functions, i.e. it is used to predict


missing or unavailable numeric data values rather than class labels.

Prediction can be used for both numeric prediction and class label
prediction.

Regression analysis is a statistical method used numeric prediction.

Classification and regression may need to be preceded by relevance


analysis, which attempts to identify attributes that are significantly
relevant to the classification and regression process. Such attributes will
be selected for the classification and regression process. Other
attributes, which are irrelevant, can then be excluded from consideration
3 Classification and
Prediction

• Classification is the process of finding a model or


function that describes and distinguishes data classes.

• E.g., classify countries based on climate, or classify cars


based on gas mileage
• The model are derived based on the analysis of a set of
training data.

• The model is used to predict the class label of objects for


which the class label is not known.

• Presentation: decision-tree, classification rule,


neural network

• Prediction: Predict some unknown or missing


numerical values
4. Cluster analysis

• Class label is unknown: Group


data to form new classes, e.g.,
cluster houses to find distribution
patterns

• Clustering based on the principle:


maximizing the intra-class
similarity and minimizing the
interclass similarity
4 Data Mining Functionalities:
Cluster Analysis 12
Clustering analyzes data objects without
consulting class labels.
Clustering can be used to generate class
labels for a group of data which did not
exist at the beginning.
The objects are clustered or grouped
based on the principle of maximizing the
intra-class similarity and minimizing the
inter-class similarity.
5 Data Mining Functionalities:
Outlier Analysis 13
Outliers are data objects that do not comply with the general
behavior or model of data.

Many data mining techniques discard outliers or exceptions


as noise.

However, in some events these kind of events are more


interesting. This analysis of outlier data is referred to as
outlier analysis

ex: fraud detection.


Outlier analysis

•Outlier: a data object that does not


comply with the general behavior of
the data

•It can be considered as noise or


exception but is quite useful in fraud
detection, rare events analysis
6 Data Mining Functionalities
Evolution Analysis 14

Data evolution analysis describes and models regularities


or trends for objects whose behavior changes over time.

This may include characterization, discrimination,


association and correlation analysis, classification,
prediction or clustering of time related data.

Distinct features of such data include time series data


analysis, sequence or periodicity pattern matching and
similarity based data analysis.
3. Are all Patterns
3. Are all Patterns
Interesting?
Interesting? 15

We need to answer three questions to


say if patterns are interesting
1. What makes a pattern interesting?
2. Can a data mining system generate
all of the interesting patterns?
3. Can the system generate only the
interesting ones?

Decision
Tree
What makes a pattern is
interesting?
16

validates a hypothesis that user sought to confirm


Not known before

Novel, Potentially useful or desired,


understandable
and valid Easily understood by humans

Valid on new set of data with a degree


of certainty
Interestingness
measures
• Interestingness measures: A
pattern is interesting if it is easily
understood by humans, valid on
new or test data with some degree
of certainty, potentially useful,
novel, or validates some
hypothesis that a user seeks to
confirm
Are All the
“Discovered”
Patterns
• Interesting?
A data mining system/query may generate
thousands of patterns, not all of them are
interesting.

• Suggested approach: Human-centered, query-based, focused


mining

• Objective vs. subjective interestingness


measures:

• Objective: based on statistics and structures of patterns, e.g.,


support, confidence, etc.

• Subjective: based on user’s belief in the data, e.g.,


unexpectedness, novelty etc.
Are all Patterns
Interesting? 18

Many patterns that are interesting by objective standards may


represent common sense and, therefore, are actually
uninteresting.

So Objective measures are coupled with subjective measures that


reflects users needs and interests.

Subjective interestingness measures are based on user beliefs in


the data.

These measures find patterns interesting if the patterns are


unexpected (contradicting user’s belief), actionable (offer
strategic information on which the user can act) or
expected(confirm a hypothesis)
Objective measures of
interestingness
(measurable) 17

Support: The percentage of transactions


from transaction database that the given
rule satisfies
support(X=>Y) = P(XUY)
Confidence: The degree of certainty of
given transaction
Confidence(X=>Y)=P(Y|X)
Objective measures of
interestingness
(measurable)
• Support: it gives percentage of transactions from a
transaction database that the given rule satisfies.

• It is probability P(x U y), where x U y indicates that a


transaction contains both x and y.

• Confidence: it access the degree of certainty of the


detected association.

• It is taken as the conditional probability P(Y/X), that is


the probability that a transaction containing X also
contains Y.

• Hence,

• Support (X  Y) = P(x U Y),

• Confidence(X  Y) = P (y/x)
Other
interestingness
measure
• Accuracy: Accuracy tells us the
percentage of data that are
correctly classified by a rule.
• Coverage: It gives the percentage
of data to which a rule applies.
Can a data mining system
generate all of the
interesting patterns? 19
•A data mining algorithm is complete if it mines all
interesting patterns.

•It is often unrealistic and inefficient for data mining


systems to generate all possible patterns. Instead,
user-provided constraints and interestingness measures
should be used to focus the search.

•For some mining tasks, such as association, this is


often sufficient to ensure the completeness of the
algorithm.
Can a data mining
system generate only
interesting patterns? 20

A data mining algorithm is consistent if it mines only


interesting patterns. It is an optimization problem.
It is highly desirable for data mining systems to
generate only interesting patterns. This would be
efficient for users and data mining systems because
neither would have to search through the patterns
generated to identify the truly interesting ones.
Sufficient progress has been made in this direction,
but it still a challenging issue in data mining.
Can We Find All and Only
Interesting Patterns?

• Find all the interesting patterns: Completeness


• Association vs. classification vs. clustering

• Search for only interesting patterns: Optimization


• Approaches

• First general all the patterns and then filter out the
uninteresting ones.

• Generate only the interesting patterns—mining query


optimization
4. Major Issues in
Data Mining
1.Mining different kinds of data
2.Handling multiple levels of abstraction
3.Incorporation of background knowledge
4.Visualization of mining results
5.Handling of incomplete or noisy data
6.Scalability of algorithms

You might also like