02-Data Mining Functionalities-2
02-Data Mining Functionalities-2
Introduction
precise terms
◼ Such descriptions of a class or concept are called class/concept
description
Data characterization
◼ It is a summarization of the general characteristics or features of target class
of data.
◼ The data corresponding to the user-specified class are typically collected by a
database query.
◼ Example: the user may like to study the characteristics of software products
whose sales increased by 10% in the last year.
Data Discrimination
◼ It is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes.
◼ The target and contrasting classes can be specified by the user, and the
corresponding data objects are retrieved through database queries.
◼ Example: the user may like to compare the general features of software
products whose sales increased by 10% in the last year with those whose sales
decreased by at least 30% during the same period.
Mining Frequent Patterns, Associations, and Correlations
youth middle_aged,
senior
income? class C
high
low
class A class B
Classification and Prediction
◼ The data labels are not present in the training data because they are not known
to begin with. (Unsupervised learning)
◼ The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
◼ Outlier Analysis : A database may contain data objects that do not comply
with the general behavior or model of the data. These data objects are outliers.
◼ However, in some applications such as fraud detection, the rare events can be
more interesting than the more regularly occurring ones.
◼ Outlier values are detected with respect to the locations or the purchase
frequency and types of purchase
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects
whose behavior changes over time.
Example: A data mining study of stock exchange data may identify stock
evolution regularities for overall stocks and for the stocks of particular
companies.
Interestingness of Patterns
support(X=> Y) = P(XUY)
confidence(X=> Y) = P(Y/X)
No. of tuples containing both X and Y
support (X=> Y) = ---------------------------------------------------
total number of tuples
◼ Objective measures
◼ Accuracy and coverage for if-then-rules
◼ Accuracy: Percentage of data correctly classified by a rule.
◼ Coverage is similar to support percentage of data to which a rule
applies
◼ Subjective Measures:
◼ Based on user beliefs in the data: these measures find patterns
interesting if the patterns are unexpected or provide strategic
information on which the user can act referred as “ACTIONABLE”
◼ Can a DM provides all of interesting patterns – Completeness
◼ Can a DM generate only interesting patterns – an optimization problem
Classification of Data Mining Systems
Data mining is an interdisciplinary field, including database systems, statistics, machine
learning, visualization, and information science
Classification according to the kinds of databases mined:
If classifying according to the special types of data handled, we may have time-series,
text stream data, multimedia data mining systems, or World Wide Web mining system.
Classification according to the kinds of knowledge mined:
◼ Based on data mining functionalities such as characterization, discrimination,
association and correlation analysis, classification, clustering, prediction, outlier and
evolution analysis.
◼ Based on levels of abstraction including generalized knowledge (high level of
abstraction), primitive-level knowledge (raw data level), knowledge at multiple levels
(several levels of abstraction)
Classification according to the kinds of techniques utilized:
◼ Data mining systems can be categorized according to the underlying data mining
techniques employed or degree of user interaction.
Classification according to the applications adapted:
◼ For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
Data Mining Task Primitives
◼ These primitives allow the user interactively communicate with the data
mining system during discovery in order to direct the mining process, or
examine the findings from different angles or depths.
◼ The set of task-relevant data to be mined: This specifies the portions of the
database or the set of data in which the user is interested. This includes the
database attributes or data warehouse dimensions of interest.
◼ The kind of knowledge to be mined: This specifies the data mining functions
to be performed, such as characterization, discrimination, association or
correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
Data Mining Task Primitives
The issues in data mining regarding mining methodology are given below.