BCA-404: Data Mining and Data Ware Housing
BCA-404: Data Mining and Data Ware Housing
BCA-404: Data Mining and Data Ware Housing
Unit-1
What Motivated Data Mining
The major reason for using data mining techniques is requirement of useful information
and knowledge from huge amounts of data. The information and knowledge gained can be
used in many applications such as business management, production control etc. Data
mining came into existence as a result of the natural evolution of information technology.
Why is it Important?
Data mining starts with the client. Clients naturally collect data simply by doing business;
so that is where the entire process begins. But Customer Relationship Management (CRM)
Data is only one part of the puzzle. The other part of the equation is competitive data,
industry survey data, blogs, and social media conversations. By themselves, CRM data and
survey data can provide very good information, but when combined with the other data
available it is powerful.
Data Mining is the process of analyzing and exploring that data to discover patterns and
trends.
The term Data Mining is one that is used frequently in the research world, but it is often
misunderstood by many people. Sometimes people misuse the term to mean any kind of
extraction of data or data processing. However, data mining is so much more than simple
data analysis. According to Doug Alexander at the University of Texas, data mining is, “the
computer-assisted process of digging through and analyzing enormous sets of data and then
extracting the meaning of the data. Data mining tools predict behaviours and future trends,
allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can
answer business questions that traditionally were too time consuming to resolve. They scour
databases for hidden patterns, finding predictive information that experts may miss because
it lies outside their expectations.”
1) Extract, transform, and load transaction data onto the data warehouse system.
Data Mining functions are used to define the trends or correlations contained in data
mining activities.
In comparison, data mining activities can be divided into 2 categories:
1. Descriptive Data Mining:
It includes certain knowledge to understand what is happening within the data without
a previous idea. The common data features are highlighted in the data set.
For examples: count, average etc.
2. Predictive Data Mining:
It helps developers to provide unlabeled definitions of attributes. Based on previous
tests, the software estimates the characteristics that are absent.
For example: Judging from the findings of a patient’s medical examinations that is he
suffering from any particular disease.
1. Class/Concept Descriptions:
Classes or definitions can be correlated with results. In simplified, descriptive and yet
accurate ways, it can be helpful to define individual groups and concepts.
These class or concept definitions are referred to as class/concept descriptions.
• Data Characterization:
This refers to the summary of general characteristics or features of the class that is
under the study. For example. To study the characteristics of a software product whose
sales increased by 15% two years ago, anyone can collect these type of data related to
such products by running SQL queries.
• Data Discrimination:
It compares common features of class which is under study. The output of this process
can be represented in many forms. Eg., bar charts, curves and pie charts.
• Frequent Substructure:
It refers to the different kinds of data structures such as trees and graphs that may be
combined with the item set or subsequence.
Association Analysis:
The process involves uncovering the relationship between data and deciding the rules of
the association. It is a way of discovering the relationship between various items. for
example, it can be used to determine the sales of items that are frequently purchased
together.
Correlation Analysis:
Correlation is a mathematical technique that can show whether and how strongly the pairs
of attributes are related to each other. For example, Highted people tend to have more
weight.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the CS Theory Course at a student-friendly price and
become industry ready.
Interestingness Of Patterns
A data mining system has the potential to generate thousands or even millions of patterns,
or rules. then “are all of the patterns interesting?” Typically not—only a small fraction of
the patterns potentially generated would actually be of interest to any given user.
This raises some serious questions for data mining. You may wonder, “What makes
a pattern interesting? Can a data mining system generate all of the interesting patterns?
Can a data mining system generate only interesting patterns?”
To answer the first question, a pattern is interesting if it is
for association rules of the form X Y is rule support, representing the percentage of
transactions from a transaction database that the given rule satisfies.
This is taken to be the probability P(XUY),where XUY indicates that a transaction contains
both X and Y, that is, the union of itemsets X and Y. Another objective measure for
association rules is confidence, which assesses the degree of certainty of the detected
association. This is taken to be the conditional probability P(Y | X), that is, the probability
that a transaction containing X also contains Y. More formally, support and confidence are
defined as
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
If a data mining system is not integrated with a database or a data warehouse system, then
there will be no system to communicate with. This scheme is known as the non-coupling
scheme. In this scheme, the main focus is on data mining design and on developing
efficient and effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
• No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source and
processes that data using some data mining algorithms. The data mining result is
stored in another file.
• Loose Coupling − In this scheme, the data mining system may use some of the
functions of database and data warehouse system. It fetches the data from the
datarespiratory managed by these systems and performs data mining on that data. It
then stores the mining result either in a file or in a designated place in a database or
in a data warehouse.
• Semi−tight Coupling − In this scheme, the data mining system is linked with a
database or a data warehouse system and in addition to that, efficient
implementations of a few data mining primitives can be provided in the database.
• Tight coupling − In this coupling scheme, the data mining system is smoothly
integrated into the database or data warehouse system. The data mining subsystem is
treated as one functional component of an information system.
Performance Issues
• Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
a) Data Cleaning
Data cleaning is the process where the data gets cleaned. Data in the real world is
normally incomplete, noisy and inconsistent. The data available in data sources
might be lacking attribute values, data of interest etc. For example, you want the
demographic data of customers and what if the available data does not include
attributes for the gender or age of the customers? Then the data is of course
incomplete. Sometimes the data might contain errors or outliers. An example is an
age attribute with value 200. It is obvious that the age value is wrong in this case.
The data could also be inconsistent. For example, the name of an employee might be
stored differently in different data tables or documents. Here, the data is
inconsistent. If the data is not clean, the data mining results would be neither reliable
nor accurate.
b) Data Integration
Data integration is the process where data from different data sources are integrated
into one. Data lies in different formats in different locations. Data could be stored in
databases, text files, spreadsheets, documents, data cubes, Internet and so on. Data
integration is a really complex and tricky task because data from different sources
does not match normally. Suppose a table A contains an entity named customer_id
where as another table B contains an entity named number. It is really difficult to
ensure that whether both these entities refer to the same value or not. Metadata can
be used effectively to reduce errors in the data integration process. Another issue
faced is data redundancy. The same data might be available in different tables in the
same database or even in different data sources. Data integration tries to reduce
redundancy to the maximum possible level without affecting the reliability of data.
c) Data Selection
Data mining process requires large volumes of historical data for analysis. So,
usually the data repository with integrated data contains much more data than
actually required. From the available data, data of interest needs to be selected and
stored. Data selection is the process where the data relevant to the analysis is
retrieved from the database.
d) Data Transformation
Data transformation is the process of transforming and consolidating the data into
different forms that are suitable for mining. Data transformation normally involves
normalization, aggregation, generalization etc. For example, a data set available as "-
5, 37, 100, 89, 78" can be transformed as "-0.05, 0.37, 1.00, 0.89, 0.78". Here data
becomes more suitable for data mining. After data integration, the available data is
ready for data mining.
e) Data Mining
Data mining is the core process where a number of complex and intelligent methods
are applied to extract patterns from data. Data mining process includes a number of
tasks such as association, classification, prediction, clustering, time series analysis
and so on.
f) Pattern Evaluation
The pattern evaluation identifies the truly interesting patterns representing
knowledge based on different types of interestingness measures. A pattern is
considered to be interesting if it is potentially useful, easily understandable by
humans, validates some hypothesis that someone wants to confirm or valid on new
data with some degree of certainty.
g) Knowledge Representation
The information mined from the data needs to be presented to the user in an
appealing way. Different knowledge representation and visualization techniques are
applied to provide the output of data mining to the users.
Summary
The data preparation methods along with data mining tasks complete the data
mining process as such. The data mining process is not as simple as we explain.
Each data mining process faces a number of challenges and issues in real life
scenario and extracts potentially useful information.
Top-down discretization
If the process starts by first finding one or a few points (called split points or
cut points) to split the entire attribute range, and then repeats this recursively on the
resulting intervals, then it is called top-down discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to form intervals, then it is called
bottom-up discretization or merging. Discretization can be performed rapidly on an attribute
to provide a hierarchical partitioning of the attribute values, known as a concept hierarchy.
Concept hierarchies
Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input/output operations and is more
efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
Typical methods
1 Binning
2 Histogram Analysis
3 Cluster Analysis
Entropy-Based Discretization
– Midrange
Mean: The most common and most effective numerical
measure of the “center” of a set of data is the (arithmetic)
mean. (sample vs. population)
Trimmed mean
– A major problem with the mean is its sensitivity to
extreme (e.g., outlier) values.
– Even a small number of extreme values can corrupt the
mean.
– Range
–Standard deviation
Range
difference between highest and lowest observed values
EX: RANGE =L-S
• Quartiles:
– First quartile (Q1): The first quartile is the value, where 25% of
the values are smaller than Q1 and 75% are larger.
– Third quartile (Q3): The third quartile is the value, where 75% of the
values are smaller than Q3 and 25% are larger.
• Outlier: usually, a value higher/lower than 1.5 x IQR
Graphic Displays of Basic Descriptive Data Summaries
• There are many types of graphs for the display of data summaries
and distributions, such as:
– Bar charts
– Pie charts
– Line graphs
– Boxplot
– Histograms
– Quantile plots
Histogram Analysis
• Histograms or frequency histograms
– A univariate graphical method
– Consists of a set of rectangles that reflect the counts or frequencies of
the classes present in the given data
– If the attribute is categorical, such as automobile _model, then one
rectangle is drawn for each known value of A, and the resulting graph is
more commonly referred to as a bar chart.
– If the attribute is numeric, the term histogram is preferred
Quantile Plot
• A quintile plot is a simple and effective way to have a first look at a
univariate data distribution.
• Displays all of the data (allowing the user to assess both the overall
Scatter plot
• A scatter plot is one of the most effective graphical methods for
determining if there appears to be a relationship, clusters of points, or
outliers between two numerical attributes.
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane
Loess Curve
• Adds a smooth curve to a scatter plot in order to provide better perception
of the pattern of dependence