DWDM R13 Unit 1 PDF
DWDM R13 Unit 1 PDF
DWDM R13 Unit 1 PDF
Introduction
What motivated Data Mining?
Why it is important?
Data Mining – On what kind of data
Data Mining Functionalities
What kinds of patterns can be mined?
Are all of the patterns interesting?
Classification of Data Mining Systems
Data Mining Task Primitives
Integration of a Data Mining System with a Database or Data Warehouse System
Major Issues in Data Mining
Data Mining refers to extracting or mining knowledge from large amounts of data.
Knowledge mining from databases, knowledge extraction, data/ pattern analysis, data
archeology and data dredging
Database, data warehouse, World Wide Web, or other information repository: This is
one or a set of databases, data warehouses, spreadsheets, or other kinds of information
repositories. Data cleaning and data integration techniques are performed.
Database or data warehouse server: it is responsible for fetching relevant data from
repository based on data mining task.
Knowledge Base: this is the domain knowledge which is used to guide the mining
process. Includes concept hierarchies, thresholds or interestingness and meta data.
Data Mining Engine: contains functional modules like:
Classification
Association
Cluster analysis
Evolution analysis
Outlier analysis
Pattern Evaluation Module: employs interestingness measures. Interacts with data
mining engine in search of interesting patterns.
Graphical User Interface: allows user to interact with the system by providing data
mining query or task.
1. Relational Databases
2. Data Warehouses – Data Cube
3. Transactional Databases – Transactional Data Set
4. Advanced Database Systems and Advanced Database Applications
a. Object Oriented Databases
b. Object Relational Databases
c. Spatial Databases
d. Temporal and Time Series Databases
e. Text Databases & Multimedia Databases
f. Heterogeneous Databases and Legacy Databases
g. World Wide Web
Data Mining Functionalities – What Kinds of Patterns Can Be Mined?
1. Concept/ Class Description: Characterization and Discrimination:
Data can be associated with classes or concepts. It is useful to describe individual
classes or concepts. Such descriptions are called class/ concept descriptions. These are:
(a) Data Characterization: by summarizing the data of the class (target class).
(b) Data Discrimination: by comparison of target class with one or set of
comparative classes.
The output of data characterization can be presented in various forms like pie-
charts, bar charts, curves, multidimensional data cubes and tables.
Discrimination descriptions are expressed in rule forms are referred as discriminant
rules.
2. Association Analysis:
It is the discovery of association rules showing attribute value conditions that occur
frequently together in a given set. Association analysis is widely used for market basket
or transaction analysis.
Rules are of the form, X=>Y
Uses support and confidence values
Ex: contains(T, “Computer”) => contains(T, “Software”)
[support=1%, confidence=50%]
3. Classification and Prediction:
Classification is the process of finding a set of models that describe and distinguish
classes or concepts. Classification predicts a class of objects whose class label is
unknown which is based on the analysis of training dataset (objects whose class label is
known). This is represented by simple “IF-THEN” rules.
A decision tree is a flow chart like tree structure where each node denotes a test on
the attribute value, each branch represents an outcome of the test and tree leaves
represents classes. A neural network when used for classification is typically a collection
of neuron like processing units with weighted connections between the units.
Classification can be used for predicting the class label of data objects. In some
applications users wish to predict some missing or unavailable data values rather than
class labels. This is usually the case when the predicted values are numerical data and is
often specifically referred to as Prediction.
4. Cluster Analysis:
Clustering analyzes data objects without consulting a known class label. These
objects are clustered based on the principle:
“maximizing the intra-class similarity and minimizing the inter-class similarity”
Each cluster can be viewed as a set of objects from which rules for that cluster can
be derived.
This also facilitates grouping formation i.e. hierarchy of classes.
5. Outlier Analysis:
A database may contain some objects which do not fit into model of data. Such
objects are called outliers. Most of the mining methods exclude outliers as noise or
exceptions during mining. Outliers in some applications like fraud detection proved to
be interesting. Analysis of outlier data is referred as outlier mining.
Outliers are identified by using statistical test like distribution or probability model
or distance measures. Rather than these deviations based methods identify outliers by
examining differences in the main characteristics of objects in a group.
6. Evolution Analysis:
Evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. This includes characterization, discrimination, association,
classification or clustering of time related data.