BCA-404: Data Mining and Data Ware Housing

BCA-404: Data Mining And Data Ware Housing
Unit-1
What Motivated Data Mining
The major reason for using data mining techniques is requirement of useful information
and knowledge from huge amounts of data. The information and knowledge gained can be
used in many applications such as business management, production control etc. Data
mining came into existence as a result of the natural evolution of information technology.
Why is it Important?
Data mining starts with the client. Clients naturally collect data simply by doing business;
so that is where the entire process begins. But Customer Relationship Management (CRM)
Data is only one part of the puzzle. The other part of the equation is competitive data,
industry survey data, blogs, and social media conversations. By themselves, CRM data and
survey data can provide very good information, but when combined with the other data
available it is powerful.
Data Mining is the process of analyzing and exploring that data to discover patterns and
trends.
The term Data Mining is one that is used frequently in the research world, but it is often
misunderstood by many people. Sometimes people misuse the term to mean any kind of
extraction of data or data processing. However, data mining is so much more than simple
data analysis. According to Doug Alexander at the University of Texas, data mining is, “the
computer-assisted process of digging through and analyzing enormous sets of data and then
extracting the meaning of the data. Data mining tools predict behaviours and future trends,
allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can
answer business questions that traditionally were too time consuming to resolve. They scour
databases for hidden patterns, finding predictive information that experts may miss because
it lies outside their expectations.”
Data mining consists of five major elements:
1) Extract, transform, and load transaction data onto the data warehouse system.
2) Store and manage the data in a multidimensional database system.
3) Provide data access to business analysts and information technology professionals.
4) Analyze the data by application software.
5) Present the data in a useful format, such as a graph or table.
Prepared by: ALPESH LIMBACHIYA Page 1

Tasks and Functionalities of Data Mining
Data Mining functions are used to define the trends or correlations contained in data
mining activities.
In comparison, data mining activities can be divided into 2 categories:
1. Descriptive Data Mining:
It includes certain knowledge to understand what is happening within the data without
a previous idea. The common data features are highlighted in the data set.
For examples: count, average etc.
2. Predictive Data Mining:
It helps developers to provide unlabeled definitions of attributes. Based on previous
tests, the software estimates the characteristics that are absent.
For example: Judging from the findings of a patient’s medical examinations that is he
suffering from any particular disease.
Data Mining Functionality:
1. Class/Concept Descriptions:
Classes or definitions can be correlated with results. In simplified, descriptive and yet
accurate ways, it can be helpful to define individual groups and concepts.
These class or concept definitions are referred to as class/concept descriptions.
• Data Characterization:
This refers to the summary of general characteristics or features of the class that is
under the study. For example. To study the characteristics of a software product whose
sales increased by 15% two years ago, anyone can collect these type of data related to
such products by running SQL queries.
• Data Discrimination:
It compares common features of class which is under study. The output of this process
can be represented in many forms. Eg., bar charts, curves and pie charts.
2. Mining Frequent Patterns, Associations, and Correlations:

Frequent patterns are nothing but things that are found to be most common in the data.
There are different kinds of frequency that can be observed in the dataset.
• Frequent item set:
This applies to a number of items that can be seen together regularly for eg: milk and
sugar.
• Frequent Subsequence:
This refers to the pattern series that often occurs regularly such as purchasing a phone
followed by a back cover.

• Frequent Substructure:
It refers to the different kinds of data structures such as trees and graphs that may be
combined with the item set or subsequence.
Association Analysis:
The process involves uncovering the relationship between data and deciding the rules of
the association. It is a way of discovering the relationship between various items. for
example, it can be used to determine the sales of items that are frequently purchased
together.
Correlation Analysis:
Correlation is a mathematical technique that can show whether and how strongly the pairs
of attributes are related to each other. For example, Highted people tend to have more
weight.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the CS Theory Course at a student-friendly price and
become industry ready.
Interestingness Of Patterns
A data mining system has the potential to generate thousands or even millions of patterns,
or rules. then “are all of the patterns interesting?” Typically not—only a small fraction of
the patterns potentially generated would actually be of interest to any given user.
This raises some serious questions for data mining. You may wonder, “What makes
a pattern interesting? Can a data mining system generate all of the interesting patterns?
Can a data mining system generate only interesting patterns?”
To answer the first question, a pattern is interesting if it is
easily understood by humans,

(2)valid on new or test data with some degree of certainty,
potentially useful, and
novel.
A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An
interesting pattern represents knowledge.
Several objective measures of pattern interestingness exist. These are based on the
structure of discovered patterns and the statistics underlying them. An objective measure

for association rules of the form X Y is rule support, representing the percentage of
transactions from a transaction database that the given rule satisfies.
This is taken to be the probability P(XUY),where XUY indicates that a transaction contains
both X and Y, that is, the union of itemsets X and Y. Another objective measure for
association rules is confidence, which assesses the degree of certainty of the detected
association. This is taken to be the conditional probability P(Y | X), that is, the probability
that a transaction containing X also contains Y. More formally, support and confidence are
defined as
support(X Y) = P(XUY) confidence(X Y) = P(Y | X)
In general, each interestingness measure is associated with a threshold, which may be

controlled by the user. For example, rules that do not satisfy a confidence threshold of, say,
50% can be considered uninteresting. Rules below the threshold threshold likely reflect
noise, exceptions, or minority cases and are probably of less value.
The second question—―Can a data mining system generate all of the

interesting patterns?‖—refers to the completeness of a data mining algorithm. It is often
unrealistic and inefficient for data mining systems to generate all of the possible patterns.
Instead, user-provided constraints and interestingness measures should be used to focus the
search
Classification of Data Mining:

This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
Data mining techniques can be classified by different criteria, as follows:
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..

iii. Classification of data mining frameworks as per the kind of knowledge

discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks offering a
few data mining functionalities together.
iv. Classification of data mining frameworks according to data mining techniques
used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in
the data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.
Integrating a Data Mining System with a DB/DW System
If a data mining system is not integrated with a database or a data warehouse system, then
there will be no system to communicate with. This scheme is known as the non-coupling
scheme. In this scheme, the main focus is on data mining design and on developing
efficient and effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
• No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source and
processes that data using some data mining algorithms. The data mining result is
stored in another file.
• Loose Coupling − In this scheme, the data mining system may use some of the
functions of database and data warehouse system. It fetches the data from the
datarespiratory managed by these systems and performs data mining on that data. It
then stores the mining result either in a file or in a designated place in a database or
in a data warehouse.
• Semi−tight Coupling − In this scheme, the data mining system is linked with a
database or a data warehouse system and in addition to that, efficient
implementations of a few data mining primitives can be provided in the database.

• Tight coupling − In this coupling scheme, the data mining system is smoothly
integrated into the database or data warehouse system. The data mining subsystem is
treated as one functional component of an information system.
Data Mining - Issues

Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous
data sources. These factors also create some issues. Here in this tutorial, we will discuss
the major issues regarding −
• Mining Methodology and User Interaction

• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −

• Mining different kinds of knowledge in databases − Different users may be

interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns will
be poor.
• Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −

• Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from
scratch.
Diverse Data Types Issues

• Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
Data Mining Processes

Introduction
The whole process of data mining cannot be completed in a single step. In other
words, you cannot get the required information from the large volumes of data as
simple as that. It is a very complex process than we think involving a number of
processes. The processes including data cleaning, data integration, data selection,
data transformation, data mining, pattern evaluation and knowledge representation
are to be completed in the given order.
Types of Data Mining Processes

Different data mining processes can be classified into two types: data preparation or
data pre-processing and data mining. In fact, the first four processes, that are data
cleaning, data integration, data selection and data transformation, are considered as
data preparation processes. The last three processes including data mining, pattern
evaluation and knowledge representation are integrated into one process called data
mining.

a) Data Cleaning
Data cleaning is the process where the data gets cleaned. Data in the real world is
normally incomplete, noisy and inconsistent. The data available in data sources
might be lacking attribute values, data of interest etc. For example, you want the
demographic data of customers and what if the available data does not include
attributes for the gender or age of the customers? Then the data is of course
incomplete. Sometimes the data might contain errors or outliers. An example is an
age attribute with value 200. It is obvious that the age value is wrong in this case.
The data could also be inconsistent. For example, the name of an employee might be
stored differently in different data tables or documents. Here, the data is
inconsistent. If the data is not clean, the data mining results would be neither reliable
nor accurate.
Data cleaning involves a number of techniques including filling in the missing

values manually, combined computer and human inspection, etc. The output of data
cleaning process is adequately cleaned data.
b) Data Integration
Data integration is the process where data from different data sources are integrated
into one. Data lies in different formats in different locations. Data could be stored in
databases, text files, spreadsheets, documents, data cubes, Internet and so on. Data
integration is a really complex and tricky task because data from different sources
does not match normally. Suppose a table A contains an entity named customer_id
where as another table B contains an entity named number. It is really difficult to
ensure that whether both these entities refer to the same value or not. Metadata can
be used effectively to reduce errors in the data integration process. Another issue
faced is data redundancy. The same data might be available in different tables in the
same database or even in different data sources. Data integration tries to reduce
redundancy to the maximum possible level without affecting the reliability of data.
c) Data Selection
Data mining process requires large volumes of historical data for analysis. So,
usually the data repository with integrated data contains much more data than
actually required. From the available data, data of interest needs to be selected and
stored. Data selection is the process where the data relevant to the analysis is
retrieved from the database.

d) Data Transformation
Data transformation is the process of transforming and consolidating the data into
different forms that are suitable for mining. Data transformation normally involves
normalization, aggregation, generalization etc. For example, a data set available as "-
5, 37, 100, 89, 78" can be transformed as "-0.05, 0.37, 1.00, 0.89, 0.78". Here data
becomes more suitable for data mining. After data integration, the available data is
ready for data mining.
e) Data Mining
Data mining is the core process where a number of complex and intelligent methods
are applied to extract patterns from data. Data mining process includes a number of
tasks such as association, classification, prediction, clustering, time series analysis
and so on.
f) Pattern Evaluation
The pattern evaluation identifies the truly interesting patterns representing
knowledge based on different types of interestingness measures. A pattern is
considered to be interesting if it is potentially useful, easily understandable by
humans, validates some hypothesis that someone wants to confirm or valid on new
data with some degree of certainty.
g) Knowledge Representation
The information mined from the data needs to be presented to the user in an
appealing way. Different knowledge representation and visualization techniques are
applied to provide the output of data mining to the users.
Summary
The data preparation methods along with data mining tasks complete the data
mining process as such. The data mining process is not as simple as we explain.
Each data mining process faces a number of challenges and issues in real life
scenario and extracts potentially useful information.
Data Discretization and Concept Hierarchy Generation

Data Discretization techniques can be used to divide the range of continuous

attribute into intervals. Numerous continuous attribute values are replaced by small
interval labels.
This leads to a concise, easy-to-use, knowledge-level representation of mining

results.
Top-down discretization
If the process starts by first finding one or a few points (called split points or
cut points) to split the entire attribute range, and then repeats this recursively on the
resulting intervals, then it is called top-down discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to form intervals, then it is called
bottom-up discretization or merging. Discretization can be performed rapidly on an attribute
to provide a hierarchical partitioning of the attribute values, known as a concept hierarchy.
Concept hierarchies
Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input/output operations and is more
efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
Discretization and Concept Hierarchy Generation for Numerical Data
Typical methods
1 Binning
Binning is a top-down splitting technique based on a specified number of bins.Binning is an

unsupervised discretization technique.

2 Histogram Analysis
Because histogram analysis does not use class information so it is an unsupervised

discretization technique. Histograms partition the values for an attribute into disjoint ranges
called buckets.
3 Cluster Analysis
Cluster analysis is a popular data discretization method. A clustering algorithm can

be applied to discrete a numerical attribute of A by partitioning the values of A into clusters
or groups.Each initial cluster or partition may be further decomposed into several
subcultures, forming a lower level of the hierarchy.
Some other methods are
Entropy-Based Discretization
Discretization by Intuitive Partitioning
Descriptive Data Summarization

Motivation
– To better understand the data
– To highlight which data values should be treated as noise
or
outliers.
Data characteristics
– Measures of central tendency

Mean, median, mode, and midrange

– Measures of data dispersion
Rang, quartiles, interquartile range (IQR), and
variance
Outline
Measuring the Central Tendency
Measuring the Dispersion of Data
Graphic Displays of Basic Descriptive Data
Summaries
References
Measuring the Central Tendency
In this section, we look at various ways to measure the
central tendency of data, include:
– Mean
– Weighted mean
– Trimmed mean
– Median
– Mode

– Midrange
Mean: The most common and most effective numerical
measure of the “center” of a set of data is the (arithmetic)
mean. (sample vs. population)
Weighted mean: Sometimes, each value in a set may be

Weighted mean:
Sometimes, each value in a set may be associated with a
weight, the weights reflect the significance,importance, or
occurrence frequency attached to their respective values.
Trimmed mean
– A major problem with the mean is its sensitivity to
extreme (e.g., outlier) values.
– Even a small number of extreme values can corrupt the
mean.

– the trimmed mean is the mean obtained after cutting off

values at the high and low extremes.
– For example, we can sort the values and remove the top
and bottom 2% before computing the mean.
– We should avoid trimming too large a portion (such as
20%) at both ends as this can result in the loss of valuable
Information.
Median
Suppose that a given data set of N distinct values is sorted
in numerical order.
The median is the middle value if odd number of values, or
average of the middle two values otherwise For skewed
(asymmetric) data, a better measure of the
Mean, median, and mode of symmetric versus positively
and
negatively skewed data.
• Mean, median, and mode of symmetric versus positively and
negatively skewed data.

• Positively skewed, where the mode is smaller than the median

(b), and negatively skewed, where the mode is greater than the
median (c).
Measuring the Dispersion of Data
• The degree to which numerical data tend to spread is

called the dispersion, or variance of the data.
• The most common measures of data dispersion are:
– Range
– Five-number summary (based on quartiles)
– Interquartile range (IQR)
–Standard deviation
Range
difference between highest and lowest observed values
EX: RANGE =L-S
• Inter-quartile range (IQR): IQR = Q3 – Q1
– IQR is a simple measure of spread that gives the range covered by

the middle half of the data

• Quartiles:
– First quartile (Q1): The first quartile is the value, where 25% of
the values are smaller than Q1 and 75% are larger.
– Third quartile (Q3): The third quartile is the value, where 75% of the
values are smaller than Q3 and 25% are larger.
• Outlier: usually, a value higher/lower than 1.5 x IQR
Graphic Displays of Basic Descriptive Data Summaries
• There are many types of graphs for the display of data summaries
and distributions, such as:
– Bar charts
– Pie charts
– Line graphs
– Boxplot
– Histograms
– Quantile plots
Histogram Analysis
• Histograms or frequency histograms
– A univariate graphical method
– Consists of a set of rectangles that reflect the counts or frequencies of
the classes present in the given data
– If the attribute is categorical, such as automobile _model, then one
rectangle is drawn for each known value of A, and the resulting graph is
more commonly referred to as a bar chart.
– If the attribute is numeric, the term histogram is preferred

Quantile Plot
• A quintile plot is a simple and effective way to have a first look at a
univariate data distribution.
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)

• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
• Note that the 0.25 quantile corresponds to quartile Q1, the 0.50
quantile is the median, and the 0.75 quantile is Q3.
Scatter plot
• A scatter plot is one of the most effective graphical methods for
determining if there appears to be a relationship, clusters of points, or
outliers between two numerical attributes.
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane

Loess Curve
• Adds a smooth curve to a scatter plot in order to provide better perception
of the pattern of dependence
• The word loess is short for local regression.
• Loess curve is fitted by setting two parameters: a smoothing

parameter, and the degree of the polynomials that are fitted by the
regression

BCA-404: Data Mining and Data Ware Housing

Uploaded by

Copyright:

Available Formats

BCA-404: Data Mining and Data Ware Housing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BCA-404: Data Mining and Data Ware Housing

Uploaded by

Copyright:

Available Formats

BCA-404: Data Mining And Data Ware Housing

Data mining consists of five major elements:

2) Store and manage the data in a multidimensional database system.

3) Provide data access to business analysts and information technology professionals.

4) Analyze the data by application software.

5) Present the data in a useful format, such as a graph or table.

Prepared by: ALPESH LIMBACHIYA Page 1

Tasks and Functionalities of Data Mining

Data Mining Functionality:

2. Mining Frequent Patterns, Associations, and Correlations:

Prepared by: ALPESH LIMBACHIYA Page 2

easily understood by humans,

Prepared by: ALPESH LIMBACHIYA Page 3

support(X Y) = P(XUY) confidence(X Y) = P(Y | X)

In general, each interestingness measure is associated with a threshold, which may be

The second question—―Can a data mining system generate all of the

Classification of Data Mining:

Data mining techniques can be classified by different criteria, as follows:

Prepared by: ALPESH LIMBACHIYA Page 4

iii. Classification of data mining frameworks as per the kind of knowledge

Integrating a Data Mining System with a DB/DW System

Prepared by: ALPESH LIMBACHIYA Page 5

Data Mining - Issues

• Mining Methodology and User Interaction

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

Prepared by: ALPESH LIMBACHIYA Page 6

• Mining different kinds of knowledge in databases − Different users may be

There can be performance-related issues such as follows −

Diverse Data Types Issues

Prepared by: ALPESH LIMBACHIYA Page 7

Data Mining Processes

Types of Data Mining Processes

Prepared by: ALPESH LIMBACHIYA Page 8

Data cleaning involves a number of techniques including filling in the missing

Prepared by: ALPESH LIMBACHIYA Page 9

Data Discretization and Concept Hierarchy Generation

Prepared by: ALPESH LIMBACHIYA Page 10

Data Discretization techniques can be used to divide the range of continuous

This leads to a concise, easy-to-use, knowledge-level representation of mining

Discretization and Concept Hierarchy Generation for Numerical Data

Binning is a top-down splitting technique based on a specified number of bins.Binning is an

Prepared by: ALPESH LIMBACHIYA Page 11

Because histogram analysis does not use class information so it is an unsupervised

Cluster analysis is a popular data discretization method. A clustering algorithm can

Some other methods are

Discretization by Intuitive Partitioning

Descriptive Data Summarization

Prepared by: ALPESH LIMBACHIYA Page 12

Mean, median, mode, and midrange

Prepared by: ALPESH LIMBACHIYA Page 13

Weighted mean: Sometimes, each value in a set may be

Prepared by: ALPESH LIMBACHIYA Page 14

– the trimmed mean is the mean obtained after cutting off

Prepared by: ALPESH LIMBACHIYA Page 15

• Positively skewed, where the mode is smaller than the median

Measuring the Dispersion of Data

• The degree to which numerical data tend to spread is

• The most common measures of data dispersion are: