Unit 1
Unit 1
Unit 1
The past two decades has seen a dramatic increase in the amount of information or data
being stored in electronic format .This accumulation of data has taken place as at an explosive
rate.
Data storage became easier as the availability of large amount of computing power at low
cost i.e. the cost of processing power and storage is falling, made data cheap. There was also
the introduction of new machine learning methods for knowledge representation based on
logic programming etc, in addition to traditional statistical analysis of data. The new
methods tend to be computationally intensive hence a demand for more processing power.
Having concentrated so much attention on the accumulation of data the problem was
what to do with valuable resource? It was recognized that information is at heart of business
operation and that decision-makers could make use of the data stored to gain valuable
insight into the business. Database management systems gave access to the data stored but
this was only a small part of what could be gained for the data stored to gain valuable insight
to the business.
Database management system gave access to the data stored but this was only a small
part of what could be gained from the data. Traditional on-line transaction procession
systems, OLTPs, are good at putting data into databases quickly, safely and efficiently but
are not good at delivering meaningful analysis in return.
Analyzing data can provide further knowledge about the business by going beyond the
data explicitly stored to drive knowledge about the business. This is where data mining or
YEAR& BRANCH: III –II CSE A & B
statics, machine learning, databases and parallel computing.
5. Data mining is the processer of discovering meaningful new correlation pattern and trends
by shifting through large amount of data stored in repositories, using pattern recognition
techniques as well as statistical and mathematical techniques.
KDD vs. Data mining.
Knowledge Discovery database (KDD) was formalized 1989, with reference to general
concept of being broad and high level in the pursuit of seeking knowledge from data. Data
Mining is only one of the many steps involved in knowledge discovery in data bases .The
KDD process tends to be highly iterative and interactive. Data mining analysis tends to work
up from the data and the best techniques are developed with an orientation towards large
volumes of data, making use of as much data as possible to arrive at reliable conclusions and
decisions. Fayyad et.al distinguishes between KDD and data mining by giving the following
definitions:
Data mining is the process of discovering interesting Knowledge from large amounts of data
stored either in databases, data warehouse or other information repositories based on this view,
the architecture if a typical data mining system may have the following major components.
*Database, Data Warehouse or Other Information Repository: This is a single or a collection of
multiple databases, data warehouses, Flat files, spreadsheets or other kinds of information
repositories. Data cleaning and data integration techniques may be performed on the data
*Database or data warehouse server: The database or data warehouse server fetches the relevant
data, based on the user’s data mining request.
*knowledge Base: This is the domain knowledge that is used to guide the search, or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to
organize attribute values into different levels of abstraction knowledge such as user beliefs,
thresholds and metadata can be used to access a pattern’s interestingness.
*Data mining engine: This is essential to the data mining system and ideally consists of a set of
functional modules for task such as characterization, association, classification, cluster analysis,
evolution and outlier analysis.
*Pattern evaluation module: This component typically employs interestingness measures and
interacts with the data mining modules so as to focus the search towards increasing patterns. It
may use interestingness thresholds to filter out discovered patterns. Alternately, the pattern
evaluation module may also be integrated with mining module.
*Graphical user interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a task or data mining query
for performing exploratory data mining based on intermediate data mining results. This module
also allows the user to browse database and data ware house schemes or data structures, evaluate
mined patterns and visualize the pattern in different forms such as maps, charts etc.
1.3 Data mining- on what kind of data?
Data mining can be applied to any kind of information repositories such as relational
YEAR& BRANCH: III –II CSE A & B
Employee
Emp-id Name Dept Salary
Data ware house is modeled by data cubes. Each dimension is an attribute and each cell
represents the aggregate measure. A data ware house collects information about subjects that
span an entire organization whereas data mart focuses on selected subjects. The
multidimensional data views marks (OLAP) online analytical processing easier.
Data characterization:
It is a summarization of the general characteristics of a target class of data.
The data corresponding to the user specified class are collected by a database query.
Several methods like OLAP roll up operation and attribute-oriented technique are
used for effective data summarization and characterization.
The output of data characterization can be presented in various forms like
Pie charts
Bar charts
Curves
Multimedimensional cubes
Multimedimensional tables etc.
Database systems themselves can be classified according to different criteria such as data
YEAR& BRANCH: III –II CSE A & B
)-(Text or multimedia data mining
)-(WWW mining system
Classification according to the kinds of knowledge mined
Neural Networks have the remarkable ability to derive meaning from complicated or imprecise
data can be used to extract pattern and detect trends that are too complex to be noticed by either
humans or other computer techniques. A trained neutral network can be thought of as an “expert
“in the category of information it has been given to analyze. These experts can be used to
provide projections given new situation s of interest and answer “what if” questions.
Neutral networks use a set of processing elements a (or nodes) analogous to neurons in
the brain. These processing elements are interconnected in a net work that can be identify
patterns in data once it is exposed to data, i.e., the network learns from experience just as people
YEAR& BRANCH: III –II CSE A & B
do.
The bottom layer represents the input layer, in this case with 5 input label X1 through X5 .in the
middle is something called the hidden layer, with a variable number of nodes .The out put layer
in this case has two nodes, Z1 and Z2 representing out put values we are trying to determine
from the inputs. Each node in the hidden layer I s fully connected to the inputs which means that
what is learned in a hidden node is based on all the inputs that taken together...
Decision trees
Decision trees are simple knowledge representation and they classify examples to
a finite number of classes, the nodes are labeled with attribute names, the edge are
labeled with possible values for this attributes and leaves labeled with different
classes. Tree shaped structure s represents sets of decisions. These decisions
generate rules for the classification of a dataset. Decision trees produce rules that
are mutually exclusive and collectively exhaustive with respect to the training
databases .Specification decision tree method includes classification and
regression trees (CART) and chi square automatic interaction detection (CHAID).
The following is an example of object that describes the weather at a given time
.The objects contain information on the outlook, humidity etc. Some objects are
positive examples denoted by P and other are negative i.e. N.
YEAR& BRANCH: III –II CSE A & B
Nearest neighbor Method: A technique that classifies each record in a data set based on a
combination of the classes of the K record(S) most similar to it in a historical dataset (where𝑲𝟑
1) is cacibed in terms of measurements or by relationship with other objects. Clustering is
sometimes used to mean segmentation. Clustering and segmentation basically partition the
database so that each partition or group is similar according to some criteria or metric. Many data
mining applications make use of clustering to similarity for example to segment a
client/customer base. Some of the clustering algorithms are DBSCAN, CHAMELEON and k-
medics.
Rule induction
Rule induction is the process of extracting useful if –then rules from data based on statistical.
rule induction on a data base can be a massive undertaking in which all possible pattern are
systematically pulled out of the data and then accuracy and significance calculated, telling users
how strong the pattern is and how likely it is to occur again
Genetic Algorithms
Genetic algorithms refer to the algorithm that dictates how populations of organisms should
formed, evaluated and modified. Genetic algorithm is a optimization techniques that use
processes such as genetic combination, mutation, and natural selection in a has a variety of
forms, but in general their application is made on top of an existing data mining techniques such
as neural net works or decision tree
Data visualization
Data visualization makes it possible for the analyst to gain a deeper, more intuitive
understanding of the data and as such can work well along side data mining. Data mining allows
the analyst to focus on certain patterns and trends and explore in-depth using visualization. On its
own data visualization can be overwhelmed by the volume of data in a data base but in
conjunction with data mining can help with exploration.
It is unrealistic to expect one system to mine all kinds of data, given the diversity of data types
and different goals of data mining specific data mining systems should be constructed for mining
specific kinds of data. Therefore, one may expect to have different data mining systems for
different kinds of data.
Mining information from heterogeneous databases and global information