Data Mining Techniques Unit-1
Data Mining Techniques Unit-1
Data Mining Techniques Unit-1
• Pre-requisites: Data Mining
• In data mining, pattern evaluation is the process
of assessing the quality of discovered patterns.
This process is important in order to determine
whether the patterns are useful and whether
they can be trusted.
• There are a number of different measures that
can be used to evaluate patterns, and the choice
of measure will depend on the application.
There are several ways to evaluate pattern mining algorithms:
• Accuracy
The accuracy of a data mining model is a measure of
how correctly the model predicts the target values. The
accuracy is measured on a test dataset, which is
separate from the training dataset that was used to
train the model.
• There are a number of ways to measure accuracy, but
the most common is to calculate the percentage of
correct predictions. This is known as the accuracy
rate.
• Other measures of accuracy include the root mean
squared error (RMSE) and the mean absolute error
(MAE). The RMSE is the square root of the mean
squared error, and the MAE is the mean of the
absolute errors. The accuracy of a data mining model
is important, but it is not the only thing that should
be considered. The model should also be robust and
generalizable.
Classification Accuracy
• This measures how accurately the patterns
discovered by the algorithm can be used to classify
new data. This is typically done by taking a set of
data that has been labeled with known class labels
and then using the discovered patterns to predict the
class labels of the data. The accuracy can then be
computed by comparing the predicted labels to the
actual labels.
Clustering Accuracy
• This measures how accurately the patterns
discovered by the algorithm can be used to cluster
new data. This is typically done by taking a set of
data that has been labeled with known cluster labels
and then using the discovered patterns to predict the
cluster labels of the data. The accuracy can then be
computed by comparing the predicted labels to the
actual labels.
• There are a few ways to evaluate the accuracy of a clustering
algorithm:
• External indices: these indices compare the clusters produced by the
algorithm to some known ground truth. For example, the Rand Index or
the Jaccard coefficient can be used if the ground truth is known.
• Internal indices: these indices assess the goodness of clustering
without reference to any external information. The most popular
internal index is the Dunn index.
• Stability: this measures how robust the clustering is to small changes in
the data. A clustering algorithm is said to be stable if, when applied to
different samples of the same data, it produces the same results.
• Efficiency: this measures how quickly the algorithm converges to the
correct clustering.
Coverage
• This measures how many of the possible patterns in
the data are discovered by the algorithm. This can
be computed by taking the total number of possible
patterns and dividing it by the number of patterns
discovered by the algorithm. A Coverage Pattern is
a type of sequential pattern that is found by looking
for items that tend to appear together in sequential
order. For example, a coverage pattern might be
“customers who purchase item A also tend to
purchase item B within the next month.”
Visual Inspection
• This is perhaps the most common method, where the
data miner simply looks at the patterns to see if they
make sense. In visual inspection, the data is plotted in a
graphical format and the pattern is observed. This method
is used when the data is not too large and can be easily
plotted. It is also used when the data is categorical in
nature. Visual inspection is a pattern evaluation method
in data mining where the data is visually inspected for
patterns. This can be done by looking at a graph or plot of
the data, or by looking at the raw data itself. This method
is often used to find outliers or unusual patterns.
Running Time
• This measures how long it takes for the
algorithm to find the patterns in the data. This
is typically measured in seconds or minutes.
There are a few different ways to measure the
performance of a machine learning algorithm,
but one of the most common is to simply
measure the amount of time it takes to train
the model and make predictions. This is
known as the running time pattern evaluation.
Support
• The support of a pattern is the percentage of the
total number of records that contain the pattern.
Support Pattern evaluation is a process of finding
interesting and potentially useful patterns in data.
The purpose of support pattern evaluation is to
identify interesting patterns that may be useful for
decision-making. Support pattern evaluation is
typically used in data mining and machine learning
applications.
Confidence
The confidence of a pattern is the percentage of times that
the pattern is found to be correct. Confidence Pattern
evaluation is a method of data mining that is used to assess
the quality of patterns found in data.
This evaluation is typically performed by calculating the
percentage of times a pattern is found in a data set and
comparing this percentage to the percentage of times the
pattern is expected to be found based on the overall
distribution of data. If the percentage of times a pattern is
found is significantly higher than the expected percentage,
then the pattern is said to be a strong confidence pattern.
Lift
• The lift pattern is a plot of the true positive rate
(TPR) against the false positive rate (FPR). The TPR is
the percentage of positive instances that are
correctly classified by the model, while the FPR is
the percentage of negative instances that are
incorrectly classified as positive. Ideally, the TPR
would be 100% and the FPR would be 0%, but this is
rarely the case in practice. The lift pattern can be
used to evaluate how close the model is to this
ideal.
Prediction
• The prediction of a pattern is the percentage of times
that the pattern is found to be correct. Prediction
Pattern evaluation is a data mining technique used to
assess the accuracy of predictive models. It is used to
determine how well a model can predict future
outcomes based on past data. Prediction Pattern
evaluation can be used to compare different models,
or to evaluate the performance of a single model.
Precision
• Precision Pattern Evaluation is a method for analyzing
data that has been collected from a variety of
sources. This method can be used to identify patterns
and trends in the data, and to evaluate the accuracy
of data. Precision Pattern Evaluation can be used to
identify errors in the data, and to determine the
cause of the errors. This method can also be used to
determine the impact of the errors on the overall
accuracy of the data.
Cross-Validation
• This method involves partitioning the data into two
sets, training the model on one set, and then testing
it on the other. This can be done multiple times, with
different partitions, to get a more reliable estimate of
the model’s performance. Cross-validation is a model
validation technique for assessing how the results of
a data mining analysis will generalize to an
independent data set.
Test Set
• This method involves partitioning the data into two sets,
training the model on the entire data set, and then testing it on
the held-out test set. This is more reliable than cross-validation
but can be more expensive if the data set is large. There are a
number of ways to evaluate the performance of a model on a
test set.
• This method involves partitioning the data into two sets,
training the model on the entire data set, and then testing it on
the held-out test set. This is more reliable than cross-validation
but can be more expensive if the data set is large. There are a
number of ways to evaluate the performance of a model on a
test set.
Data Mining - Classification & Prediction:
• INTRODUCTION :
Data integration in data mining refers to the process of
combining data from multiple sources into a single, unified
view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may
exist between the different sources. The goal of data
integration is to make the data more useful and meaningful
for the purposes of analysis and decision making.
Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and
data federation.
• Data Integration is a data preprocessing technique that
combines data from multiple heterogeneous data sources
into a coherent data store and provides a unified view of
the data. These sources may include multiple data cubes,
databases, or flat files.
• The data integration approaches are formally defined as
triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and
global schema.
What is data integration
• Data integration is the process of combining
data from multiple sources into a cohesive and
consistent view. This process involves
identifying and accessing the different data
sources, mapping the data to a common
format, and reconciling any inconsistencies or
discrepancies between the sources.
• The goal of data integration is to make it
easier to access and analyze data that is
spread across multiple systems or platforms,
in order to gain a more complete and accurate
understanding of the data.
• Data integration can be challenging due to the
variety of data formats, structures, and
semantics used by different data sources.
• Different data sources may use different data
types, naming conventions, and schemas,
making it difficult to combine the data into a
single view. Data integration typically involves
a combination of manual and automated
processes, including data profiling, data
mapping, data transformation, and data
reconciliation.
• Data integration is used in a wide range of
applications, such as business intelligence, data
warehousing, master data management, and
analytics. Data integration can be critical to the
success of these applications, as it enables
organizations to access and analyze data that is
spread across different systems, departments, and
lines of business, in order to make better decisions,
improve operational efficiency, and gain a
competitive advantage.
• There are mainly 2 major approaches for data integration – one is
the “tight coupling approach” and another is the “loose coupling
approach”.
• Tight Coupling:
• This approach involves creating a centralized repository or data
warehouse to store the integrated data. The data is extracted
from various sources, transformed and loaded into a data
warehouse. Data is integrated in a tightly coupled manner,
meaning that the data is integrated at a high level, such as at the
level of the entire dataset or schema. This approach is also
known as data warehousing, and it enables data consistency and
integrity, but it can be inflexible and difficult to change or update.
• Here, a data warehouse is treated as an
information retrieval component.
• In this coupling, data is combined from
different sources into a single physical location
through the process of ETL – Extraction,
Transformation, and Loading.
Loose Coupling:
• There are several issues that can arise when integrating data from
multiple sources, including:
• Data Quality: Inconsistencies and errors in the data can make it difficult
to combine and analyze.
• Data Semantics: Different sources may use different terms or
definitions for the same data, making it difficult to combine and
understand the data.
• Data Heterogeneity: Different sources may use different data formats,
structures, or schemas, making it difficult to combine and analyze the
data.
• Data Privacy and Security: Protecting sensitive information and
maintaining security can be difficult when integrating data from
multiple sources.
• Scalability: Integrating large amounts of data from multiple sources can be
computationally expensive and time-consuming.
• Data Governance: Managing and maintaining the integration of data from
multiple sources can be difficult, especially when it comes to ensuring
data accuracy, consistency, and timeliness.
• Performance: Integrating data from multiple sources can also affect the
performance of the system.
• Integration with existing systems: Integrating new data sources with
existing systems can be a complex task, requiring significant effort and
resources.
• Complexity: The complexity of integrating data from multiple sources can
be high, requiring specialized skills and knowledge.
There are three issues to consider during data integration:
Schema Integration, Redundancy Detection, and resolution of
data value conflicts. These are explained in brief below.
• Schema Integration:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are referred to
as the entity identification problem.ER
• 2. Redundancy Detection:
• An attribute may be redundant if it can be derived or
obtained from another attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in
the resulting data set.
• Some redundancies can be detected by correlation analysis.
• Resolution of data value conflicts:
• This is the third critical issue in data integration.
• Attribute values from different sources may differ for the same real-world entity.
• An attribute in one system may be recorded at a lower level of abstraction than the
“same” attribute in another.
•
• Data Mining challenges
• Security and Social Challenges.
• Noisy and Incomplete Data.
• Distributed Data.
• Complex Data.
• Performance.
• Scalability and Efficiency of the Algorithms.
• Improvement of Mining Algorithms.
• Incorporation of Background Knowledge.
Data Mining - Issues
• 1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.
• (a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
– Fill the Missing values:
There are various ways to do this task. You can
choose to fill the missing values manually, by
attribute mean or the most probable value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be
interpreted by machines.It can be generated due
to faulty data collection, data entry errors etc. It
can be handled in following ways :
– Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed
to complete the task. Each segmented is handled separately. One can replace
all data in a segment by its mean or boundary values can be used to complete
the task.
– Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
– Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
• . Data Transformation:
This step is taken in order to transform the data in appropriate
forms suitable for mining process. This involves following ways:
• Normalization:
It is done in order to scale the data values in a specified range (-
1.0 to 1.0 or 0.0 to 1.0)
• Attribute Selection:
In this strategy, new attributes are constructed from the given
set of attributes to help the mining process.
• Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
• Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
•
• 3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
• The various steps to data reduction are:
•Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the
attribute having p-value greater than significance level can be discarded.
Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Component
Analysis).
•
•