Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Mining Techniques Unit-1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 122

Data Mining Techniques

Subject Code : C-214


By Dr. Sanjay Kumar
Unit-1
Data Mining Tutorial
• Data mining is one of the most useful techniques that
help entrepreneurs, researchers, and individuals to
extract valuable information from huge sets of data.
• Data mining is also called Knowledge Discovery in
Database (KDD). The knowledge discovery process
includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.
What is Data Mining?

• The process of extracting information to identify patterns,


trends, and useful data that would allow the business to
take the data-driven decision from huge sets of data is
called Data Mining.
• In other words, we can say that Data Mining is the process
of investigating hidden patterns of information to various
perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data
warehouses, efficient analysis, data mining algorithm,
helping decision making and other data requirement to
eventually cost-cutting and generating revenue.
• Data mining is the act of automatically searching for large
stores of information to find trends and patterns that go
beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and
evaluates the probability of future events. Data Mining is
also called Knowledge Discovery of Data (KDD).
• Data Mining is a process used by organizations to
extract specific data from huge databases to solve
business problems. It primarily turns raw data into
useful information.
Types of Data Mining

• Data mining can be performed on the following types of


data:
• Relational Database:
• A relational database is a collection of multiple data
sets formally organized by tables, records, and
columns from which data can be accessed in various
ways without having to recognize the database
tables. Tables convey and share information, which
facilitates data searchability, reporting, and
organization.
• A Data Warehouse is the technology that collects the
data from various sources within the organization to
provide meaningful business insights. The huge
amount of data comes from multiple places such as
Marketing and Finance. The extracted data is utilized
for analytical purposes and helps in decision- making
for a business organization. The data warehouse is
designed for the analysis of data rather than
transaction processing.
Data Repositories:

• The Data Repository generally refers to a destination for


data storage. However, many IT professionals utilize the
term more clearly to refer to a specific kind of setup within
an IT structure. For example, a group of databases, where
an organization has kept various kinds of information.
• Object-Relational Database:
• A combination of an object-oriented database
model and relational database model is called an
object-relational model. It supports Classes,
Objects, Inheritance, etc.
Transactional Database:

• A transactional database refers to a database


management system (DBMS) that has the
potential to undo a database transaction if it is
not performed appropriately. Even though this
was a unique capability a very long while back,
today, most of the relational database systems
support transactional database activities.
Advantages of Data Mining

• The Data Mining technique enables organizations to obtain


knowledge-based data.
• Data mining enables organizations to make lucrative modifications
in operation and production.
• Compared with other statistical data applications, data mining is a
cost-efficient.
• Data Mining helps the decision-making process of an organization.
• It Facilitates the automated discovery of hidden patterns as well as
the prediction of trends and behaviors.
• It can be induced in the new system as well as the existing
platforms.
Disadvantages of Data Mining

• There is a probability that the organizations may sell useful data


of customers to other organizations for money. As per the
report, American Express has sold credit card purchases of their
customers to other organizations.
• Many data mining analytics software is difficult to operate and
needs advance training to work on.
• Different data mining instruments operate in distinct ways due
to the different algorithms used in their design. Therefore, the
selection of the right data mining tools is a very challenging task.
• The data mining techniques are not precise, so that it may lead
to severe consequences in certain conditions.
Data Mining Applications

• Data Mining is primarily used by organizations with


intense consumer demands- Retail, Communication,
Financial, marketing company, determine price,
consumer preferences, product positioning, and
impact on sales, customer satisfaction, and corporate
profits.
• Data mining enables a retailer to use point-of-sale
records of customer purchases to develop products
and promotions that help the organization to attract
the customer.
Data Mining in Healthcare:

• Data mining in healthcare has excellent potential to


improve the health system. It uses data and analytics
for better insights and to identify best practices that
will enhance health care services and reduce costs.
• Analysts use data mining approaches such as
Machine learning, Multi-dimensional database, Data
visualization, Soft computing, and statistics. Data
Mining can be used to forecast patients in each
category.
Data Mining in Market Basket Analysis:

• Market basket analysis is a modeling method based on a


hypothesis. If you buy a specific group of products, then
you are more likely to buy another group of products.
This technique may enable the retailer to understand
the purchase behavior of a buyer.
• This data may assist the retailer in understanding the
requirements of the buyer and altering the store's
layout accordingly. Using a different analytical
comparison of results between various stores, between
customers in different demographic groups can be done.
Data mining in Education:

• Education data mining is a newly emerging field,


concerned with developing techniques that explore
knowledge from the data generated from educational
Environments.
• EDM objectives are recognized as affirming student's
future learning behavior, studying the impact of
educational support, and promoting learning science.
• An organization can use data mining to make precise
decisions and also to predict the results of the student.
With the results, the institution can concentrate on what
to teach and how to teach.
Data Mining in Manufacturing Engineering:

• Knowledge is the best asset possessed by a


manufacturing company. Data mining tools can be
beneficial to find patterns in a complex manufacturing
process.
• Data mining can be used in system-level designing to
obtain the relationships between product architecture,
product portfolio, and data needs of the customers.
• It can also be used to forecast the product development
period, cost, and expectations among the other tasks.
Data Mining in CRM (Customer Relationship Management):

• Customer Relationship Management (CRM) is all


about obtaining and holding Customers, also
enhancing customer loyalty and implementing
customer-oriented strategies.
• To get a decent relationship with the customer, a
business organization needs to collect data and
analyze the data. With data mining technologies, the
collected data can be used for analytics.
Data Mining in Fraud detection:

• Billions of dollars are lost to the action of frauds.


Traditional methods of fraud detection are a little bit time
consuming and sophisticated. Data mining provides
meaningful patterns and turning data into information.
• An ideal fraud detection system should protect the data of
all the users. Supervised methods consist of a collection of
sample records, and these records are classified as
fraudulent or non-fraudulent.
• A model is constructed using this data, and the technique
is made to identify whether the document is fraudulent or
not.
Data Mining in Lie Detection:

• Apprehending a criminal is not a big deal, but


bringing out the truth from him is a very challenging
task. Law enforcement may use data mining
techniques to investigate offenses, monitor
suspected terrorist communications, etc.
• This technique includes text mining also, and it seeks
meaningful patterns in data, which is usually
unstructured text. The information collected from
the previous investigations is compared, and a model
for lie detection is constructed.
Data Mining Financial Banking:

• The Digitalization of the banking system is supposed


to generate an enormous amount of data with every
new transaction.
• The data mining technique can help bankers by
solving business-related problems in banking and
finance by identifying trends, casualties, and
correlations in business information and market costs
that are not instantly evident to managers or
executives because the data volume is too large or
are produced too rapidly on the screen by experts.
Challenges of Implementation in Data mining

• Although data mining is very powerful, it faces


many challenges during its execution. Various
challenges could be related to performance,
data, methods, and techniques, etc.
• The process of data mining becomes effective
when the challenges or problems are correctly
recognized and adequately resolved.
Incomplete and noisy data:

• The process of extracting useful data from large


volumes of data is data mining. The data in the real-
world is heterogeneous, incomplete, and noisy. Data
in huge quantities will usually be inaccurate or
unreliable.
• These problems may occur due to data measuring
instrument or because of human errors. Suppose a
retail chain collects phone numbers of customers
who spend more than $ 500, and the accounting
employees put the information into their system.
Data Distribution:

• Real-worlds data is usually stored on various


platforms in a distributed computing environment. It
might be in a database, individual systems, or even
on the internet.
• Practically, It is a quite tough task to make all the
data to a centralized data repository mainly due to
organizational and technical concerns. For example,
various regional offices may have their servers to
store their data. It is not feasible to store, all the data
from all the offices on a central server.
Complex Data:

• Real-world data is heterogeneous, and it could be


multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on.
Managing these various types of data and extracting
useful information is a tough task.
• Most of the time, new technologies, new tools, and
methodologies would have to be refined to obtain
specific information.
Performance:

• The data mining system's performance relies primarily on


the efficiency of algorithms and techniques used. If the
designed algorithm and techniques are not up to the mark,
then the efficiency of the data mining process will be
affected adversely.
• Data Privacy and Security:
• Data mining usually leads to serious issues in terms of data
security, governance, and privacy. For example, if a retailer
analyzes the details of the purchased items, then it reveals
data about buying habits and preferences of the customers
without their permission.
Data Visualization:

• In data mining, data visualization is a very important


process because it is the primary method that shows the
output to the user in a presentable way. The extracted
data should convey the exact meaning of what it intends
to express.
• But many times, representing the information to the
end-user in a precise and easy way is difficult. The input
data and the output information being complicated, very
efficient, and successful data visualization processes
need to be implemented to make it successful.
Types of Sources of Data in Data Mining:

Types of Data Mining


• Each of the following data mining techniques serves
several different business problems and provides a
different insight into each of them.
• However, understanding the type of business
problem you need to solve will also help in knowing
which technique will be best to use, which will yield
the best results. The Data Mining types can be
divided into two basic parts that are as follows:
• Predictive Data Mining Analysis
• Descriptive Data Mining Analysis
1. Predictive Data Mining

• As the name signifies, Predictive Data-Mining


analysis works on the data that may help to know
what may happen later (or in the future) in
business. Predictive Data-Mining can also be further
divided into four types that are listed below:
• Classification Analysis
• Regression Analysis
• Time Serious Analysis
• Prediction Analysis
2. Descriptive Data Mining

• The main goal of the Descriptive Data Mining tasks is


to summarize or turn given data into relevant
information. The Descriptive Data-Mining Tasks can
also be further divided into four types that are as
follows:
• Clustering Analysis
• Summarization Analysis
• Association Rules Analysis
• Sequence Discovery Analysis
1. CLASSIFICATION ANALYSIS
• This type of data mining technique is generally used
in fetching or retrieving important and relevant
information about the data & metadata.
• It is also even used to categories the different types
of data format into different classes. If you focus on
this article until it ends, you may definitely find out
that Classification and clustering are similar data
mining types. As clustering also categorizes or classify
the data segments into the different data records
known as the classes.
REGRESSION ANALYSIS

• In statistical terms, regression analysis is a process


usually used to identify and analyze the relationship
among variables. It means one variable is dependent
on another, but it is not vice versa.
• It is generally used for prediction and forecasting
purposes. It can also help you understand the
characteristic value of the dependent variable
changes if any of the independent variables is varied.
Time Serious Analysis

• A time series is a sequence of data points that are


usually recorded at specific time intervals of points.
Usually, they are - most often in regular time
intervals (seconds, hours, days, months etc.).
• Almost every organization generates a high volume
of data every day, such as sales figures, revenue,
traffic, or operating cost. Time series data mining can
help in generating valuable information for long-term
business decisions, yet they are underutilized in most
organizations.
Prediction Analysis
• This technique is generally used to predict the
relationship that exists between both the
independent and dependent variables as well as the
independent variables alone.
• It can also use to predict profit that can be achieved
in future depending on the sale. Let us imagine that
profit and sale are dependent and independent
variables, respectively. Now, on the basis of what
the past sales data says, we can make a profit
prediction of the future using a regression curve.
Clustering Analysis

• In Data Mining, this technique is used to create


meaningful object clusters that contain the same
characteristics. Usually, most people get confused
with Classification, but they won't have any issues if
they properly understand how both these techniques
actually work.
• Unlike Classification that collects the objects into
predefined classes, clustering stores objects in
classes that are defined by it.
Functionalities of Data Mining

• Data mining tasks are designed to be semi-automatic or fully


automatic and on large data sets to uncover patterns such as
groups or clusters, unusual or over the top data called anomaly
detection and dependencies such as association and sequential
pattern.
• Once patterns are uncovered, they can be thought of as a
summary of the input data, and further analysis may be carried
out using Machine Learning and Predictive analytics. For
example, the data mining step might help identify multiple
groups in the data that a decision support system can use.
Note that data collection, preparation, reporting are not part of
data mining.
• Descriptive Data Mining: It includes certain knowledge to
understand what is happening within the data without a
previous idea. The common data features are highlighted
in the data set. For example, count, average etc.
• Predictive Data Mining: It helps developers to provide
unlabeled definitions of attributes. With previously
available or historical data, data mining can be used to
make predictions about critical business metrics based on
data's linearity. For example, predicting the volume of
business next quarter based on performance in the
previous quarters over several years or judging from the
findings of a patient's medical examinations that is he
suffering from any particular disease.
Functionalities of Data Mining

• Data mining functionalities are used to represent the type of


patterns that have to be discovered in data mining tasks. Data
mining tasks can be classified into two types: descriptive and
predictive. Descriptive mining tasks define the common features
of the data in the database, and the predictive mining tasks act in
inference on the current information to develop predictions.
• Data mining is extensively used in many areas or sectors. It is
used to predict and characterize data. But the ultimate objective
in Data Mining Functionalities is to observe the various trends in
data mining. There are several data mining functionalities that
the organized and scientific methods offer, such as:
Class/Concept Descriptions

• A class or concept implies there is a data set or set of


features that define the class or a concept. A class
can be a category of items on a shop floor, and a
concept could be the abstract idea on which data
may be categorized like products to be put on
clearance sale and non-sale products.
• There are two concepts here, one that helps with
grouping and the other that helps in differentiating.
• Data Characterization: This refers to the summary of
general characteristics or features of the class, resulting
in specific rules that define a target class. A data analysis
technique called Attribute-oriented Induction is
employed on the data set for achieving characterization.
• Data Discrimination: Discrimination is used to separate
distinct data sets based on the disparity in attribute
values. It compares features of a class with features of
one or more contrasting classes.g., bar charts, curves
and pie charts.
Mining Frequent Patterns

• One of the functions of data mining is finding data patterns.


Frequent patterns are things that are discovered to be most
common in data. Various types of frequency can be found in
the dataset.
• Frequent item set:This term refers to a group of items that
are commonly found together, such as milk and sugar.
• Frequent substructure: It refers to the various types of data
structures that can be combined with an item set or
subsequences, such as trees and graphs.
• Frequent Subsequence: A regular pattern series, such as
buying a phone followed by a cover.
Association Analysis
• It analyses the set of items that generally occur
together in a transactional dataset. It is also known
as Market Basket Analysis for its wide use in retail
sales. Two parameters are used for determining the
association rules:
• It provides which identifies the common item set in
the database.
• Confidence is the conditional probability that an item
occurs when another item occurs in a transaction.
Classification
• Classification is a data mining technique that
categorizes items in a collection based on some
predefined properties. It uses methods like if-then,
decision trees or neural networks to predict a class or
essentially classify a collection of items.
• A training set containing items whose properties are
known is used to train the system to predict the
category of items from an unknown collection of
items.
Prediction
• It defines predict some unavailable data values
or spending trends. An object can be
anticipated based on the attribute values of
the object and attribute values of the classes.
• It can be a prediction of missing numerical
values or increase or decrease trends in time-
related information. There are primarily two
types of predictions in data mining: numeric
and class predictions.
• Numeric predictions are made by creating a
linear regression model that is based on historical
data. Prediction of numeric values helps
businesses ramp up for a future event that might
impact the business positively or negatively.
• Class predictions are used to fill in missing class
information for products using a training data set
where the class for products is known.
Cluster Analysis
• In image processing, pattern recognition and
bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the
classes are not predefined. Data attributes represent
the classes. Similar data are grouped together, with
the difference being that a class label is not known.
Clustering algorithms group data based on similar
features and dissimilarities.
Outlier Analysis

• Outlier analysis is important to understand the


quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier
analysis determines if there is something out of turn
in the data and whether it indicates a situation that a
business needs to consider and take measures to
mitigate. An outlier analysis of the data that cannot
be grouped into any classes by the algorithms is
pulled up.
Evolution and Deviation Analysis

• Evolution Analysis pertains to the study of


data sets that change over time. Evolution
analysis models are designed to capture
evolutionary trends in data helping to
characterize, classify, cluster or discriminate
time-related data.
Correlation Analysis

• Correlation is a mathematical technique for


determining whether and how strongly two
attributes is related to one another.
• It refers to the various types of data structures, such
as trees and graphs that can be combined with an
item set or subsequence. It determines how well
two numerically measured continuous variables are
linked. Researchers can use this type of analysis to
see if there are any possible correlations between
variables in their study.
Interestingness patterns in data mining:

• Pre-requisites: Data Mining
• In data mining, pattern evaluation is the process
of assessing the quality of discovered patterns.
This process is important in order to determine
whether the patterns are useful and whether
they can be trusted.
• There are a number of different measures that
can be used to evaluate patterns, and the choice
of measure will depend on the application.
There are several ways to evaluate pattern mining algorithms:

• Accuracy
The accuracy of a data mining model is a measure of
how correctly the model predicts the target values. The
accuracy is measured on a test dataset, which is
separate from the training dataset that was used to
train the model.
• There are a number of ways to measure accuracy, but
the most common is to calculate the percentage of
correct predictions. This is known as the accuracy
rate.
• Other measures of accuracy include the root mean
squared error (RMSE) and the mean absolute error
(MAE). The RMSE is the square root of the mean
squared error, and the MAE is the mean of the
absolute errors. The accuracy of a data mining model
is important, but it is not the only thing that should
be considered. The model should also be robust and
generalizable.
Classification Accuracy
• This measures how accurately the patterns
discovered by the algorithm can be used to classify
new data. This is typically done by taking a set of
data that has been labeled with known class labels
and then using the discovered patterns to predict the
class labels of the data. The accuracy can then be
computed by comparing the predicted labels to the
actual labels.
Clustering Accuracy
• This measures how accurately the patterns
discovered by the algorithm can be used to cluster
new data. This is typically done by taking a set of
data that has been labeled with known cluster labels
and then using the discovered patterns to predict the
cluster labels of the data. The accuracy can then be
computed by comparing the predicted labels to the
actual labels.
• There are a few ways to evaluate the accuracy of a clustering
algorithm:
• External indices: these indices compare the clusters produced by the
algorithm to some known ground truth. For example, the Rand Index or
the Jaccard coefficient can be used if the ground truth is known.
• Internal indices: these indices assess the goodness of clustering
without reference to any external information. The most popular
internal index is the Dunn index.
• Stability: this measures how robust the clustering is to small changes in
the data. A clustering algorithm is said to be stable if, when applied to
different samples of the same data, it produces the same results.
• Efficiency: this measures how quickly the algorithm converges to the
correct clustering.
Coverage
• This measures how many of the possible patterns in
the data are discovered by the algorithm. This can
be computed by taking the total number of possible
patterns and dividing it by the number of patterns
discovered by the algorithm. A Coverage Pattern is
a type of sequential pattern that is found by looking
for items that tend to appear together in sequential
order. For example, a coverage pattern might be
“customers who purchase item A also tend to
purchase item B within the next month.”
Visual Inspection
• This is perhaps the most common method, where the
data miner simply looks at the patterns to see if they
make sense. In visual inspection, the data is plotted in a
graphical format and the pattern is observed. This method
is used when the data is not too large and can be easily
plotted. It is also used when the data is categorical in
nature. Visual inspection is a pattern evaluation method
in data mining where the data is visually inspected for
patterns. This can be done by looking at a graph or plot of
the data, or by looking at the raw data itself. This method
is often used to find outliers or unusual patterns.
Running Time
• This measures how long it takes for the
algorithm to find the patterns in the data. This
is typically measured in seconds or minutes.
There are a few different ways to measure the
performance of a machine learning algorithm,
but one of the most common is to simply
measure the amount of time it takes to train
the model and make predictions. This is
known as the running time pattern evaluation.
Support
• The support of a pattern is the percentage of the
total number of records that contain the pattern.
Support Pattern evaluation is a process of finding
interesting and potentially useful patterns in data.
The purpose of support pattern evaluation is to
identify interesting patterns that may be useful for
decision-making. Support pattern evaluation is
typically used in data mining and machine learning
applications.
Confidence
The confidence of a pattern is the percentage of times that
the pattern is found to be correct. Confidence Pattern
evaluation is a method of data mining that is used to assess
the quality of patterns found in data.
This evaluation is typically performed by calculating the
percentage of times a pattern is found in a data set and
comparing this percentage to the percentage of times the
pattern is expected to be found based on the overall
distribution of data. If the percentage of times a pattern is
found is significantly higher than the expected percentage,
then the pattern is said to be a strong confidence pattern.
Lift
• The lift pattern is a plot of the true positive rate
(TPR) against the false positive rate (FPR). The TPR is
the percentage of positive instances that are
correctly classified by the model, while the FPR is
the percentage of negative instances that are
incorrectly classified as positive. Ideally, the TPR
would be 100% and the FPR would be 0%, but this is
rarely the case in practice. The lift pattern can be
used to evaluate how close the model is to this
ideal.
Prediction
• The prediction of a pattern is the percentage of times
that the pattern is found to be correct. Prediction
Pattern evaluation is a data mining technique used to
assess the accuracy of predictive models. It is used to
determine how well a model can predict future
outcomes based on past data. Prediction Pattern
evaluation can be used to compare different models,
or to evaluate the performance of a single model.  
Precision
• Precision Pattern Evaluation is a method for analyzing
data that has been collected from a variety of
sources. This method can be used to identify patterns
and trends in the data, and to evaluate the accuracy
of data. Precision Pattern Evaluation can be used to
identify errors in the data, and to determine the
cause of the errors. This method can also be used to
determine the impact of the errors on the overall
accuracy of the data.  
Cross-Validation
•  This method involves partitioning the data into two
sets, training the model on one set, and then testing
it on the other. This can be done multiple times, with
different partitions, to get a more reliable estimate of
the model’s performance. Cross-validation is a model
validation technique for assessing how the results of
a data mining analysis will generalize to an
independent data set.
Test Set
•  This method involves partitioning the data into two sets,
training the model on the entire data set, and then testing it on
the held-out test set. This is more reliable than cross-validation
but can be more expensive if the data set is large. There are a
number of ways to evaluate the performance of a model on a
test set.
•  This method involves partitioning the data into two sets,
training the model on the entire data set, and then testing it on
the held-out test set. This is more reliable than cross-validation
but can be more expensive if the data set is large. There are a
number of ways to evaluate the performance of a model on a
test set.
Data Mining - Classification & Prediction:
 

• There is a large variety of data mining systems available. Data


mining systems may integrate techniques from the following −
• Spatial Data Analysis
• Information Retrieval
• Pattern Recognition
• Image Analysis
• Signal Processing
• Computer Graphics
• Web Technology
• Business
• Bioinformatics
Data Mining System Classification

• A data mining system can be classified


according to the following criteria −
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization
• Other Disciplines
Classification Based on the Databases Mined

• We can classify a data mining system according to


the kind of databases mined. Database system can be
classified according to different criteria such as data
models, types of data, etc. And the data mining
system can be classified accordingly.
• For example, if we classify a database according to
the data model, then we may have a relational,
transactional, object-relational, or data warehouse
mining system.
Classification Based on the kind of Knowledge Mined

• We can classify a data mining system according to the kind


of knowledge mined. It means the data mining system is
classified on the basis of functionalities such as −
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Outlier Analysis
• Evolution Analysis
Classification Based on the Techniques Utilized

• We can classify a data mining system according to


the kind of techniques used. We can describe these
techniques according to the degree of user
interaction involved or the methods of analysis
employed.
Classification Based on the Applications Adapted

• We can classify a data mining system according to


the applications adapted. These applications are as
follows −
• Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail
Integrating a Data Mining System with a DB/DW
System

• If a data mining system is not integrated with a


database or a data warehouse system, then there will
be no system to communicate with. This scheme is
known as the non-coupling scheme. In this scheme,
the main focus is on data mining design and on
developing efficient and effective algorithms for
mining the available data sets.
•The list of Integration Schemes is as follows −
•No Coupling − In this scheme, the data mining system does not utilize any of the database
or data warehouse functions. It fetches the data from a particular source and processes
that data using some data mining algorithms. The data mining result is stored in another
file.
•Loose Coupling − In this scheme, the data mining system may use some of the functions of
database and data warehouse system. It fetches the data from the data respiratory
managed by these systems and performs data mining on that data. It then stores the mining
result either in a file or in a designated place in a database or in a data warehouse.
•Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a
data warehouse system and in addition to that, efficient implementations of a few data
mining primitives can be provided in the database.
•Tight coupling − In this coupling scheme, the data mining system is smoothly integrated
into the database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system.
Data Mining Task Primitives
• A data mining task can be specified in the form of
a data mining query, which is input to the data
mining system. A data mining query is defined in
terms of data mining task primitives. These
primitives allow the user to interactively
communicate with the data mining system during
discovery to direct the mining process or examine
the findings from different angles or depths. The
data mining primitives specify the following
• Set of task-relevant data to be mined.
• Kind of knowledge to be mined.
• Background knowledge to be used in the
discovery process.
• Interestingness measures and thresholds for
pattern evaluation.
• Representation for visualizing the discovered
patterns.
• A data mining query language can be designed to
incorporate these primitives, allowing users to interact
with data mining systems flexibly. Having a data
mining query language provides a foundation on
which user-friendly graphical interfaces can be built.
• Designing a comprehensive data mining language
is challenging because data mining covers a wide
spectrum of tasks, from data characterization to
evolution analysis. Each task has different
requirements.
List of Data Mining Task Primitives

• 1. The set of task-relevant data to be mined


This specifies the portions of the database or the set of
data in which the user is interested. This includes the
database attributes or data warehouse dimensions of
interest (the relevant attributes or dimensions).
The data collection process results in a new data
relational called the initial data relation. The initial
data relation can be ordered or grouped according to
the conditions specified in the query. This data retrieval
can be thought of as a subtask of the data mining task.
2. The kind of knowledge to be mined

• This specifies the data mining functions to be


performed, such as characterization,
discrimination, association or correlation
analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.
3. The background knowledge to be used in the discovery process

• This knowledge about the domain to be mined is


useful for guiding the knowledge discovery process
and evaluating the patterns found. Concept
hierarchies are a popular form of background
knowledge, which allows data to be mined at
multiple levels of abstraction.
• Concept hierarchy defines a sequence of
mappings from low-level concepts to higher-
level, more general concepts.
• Rolling Up - Generalization of data: Allow to view
data at more meaningful and explicit abstractions
and makes it easier to understand. It compresses
the data, and it would require fewer input/output
operations.
• Drilling Down - Specialization of data: Concept
values replaced by lower-level concepts. Based on
different user viewpoints, there may be more than
one concept hierarchy for a given attribute or
dimension.
4. The interestingness measures and thresholds for pattern evaluation

• Different kinds of knowledge may have different


interesting measures. They may be used to guide the
mining process or, after discovery, to evaluate the
discovered patterns. For example, interesting
measures for association rules include support and
confidence. Rules whose support and confidence
values are below user-specified thresholds are
considered uninteresting.
• Simplicity: A factor contributing to the interestingness of a pattern is the
pattern's overall simplicity for human comprehension. For example, the
more complex the structure of a rule is, the more difficult it is to interpret,
and hence, the less interesting it is likely to be. Objective measures of
pattern simplicity can be viewed as functions of the pattern structure,
defined in terms of the pattern size in bits or the number of attributes or
operators appearing in the pattern.
• Certainty (Confidence): Each discovered pattern should have a measure of
certainty associated with it that assesses the validity or "trustworthiness" of
the pattern. A certainty measure for association rules of the form "A =>B"
where A and B are sets of items is confidence. Confidence is a certainty
measure. Given a set of task-relevant data tuples, the confidence of "A => B"
is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
• Utility (Support): The potential usefulness of a pattern is a
factor defining its interestingness. It can be estimated by a
utility function, such as support. The support of an association
pattern refers to the percentage of task-relevant data tuples (or
transactions) for which the pattern istrue.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of
tuples
• Novelty: Novel patterns are those that contribute new
information or increased performance to the given pattern set.
For example -> A data exception. Another strategy for detecting
novelty is to remove redundant patterns.
5. The expected representation for visualizing the discovered patterns

• This refers to the form in which discovered patterns are to be


displayed, which may include rules, tables, cross tabs, charts,
graphs, decision trees, cubes, or other visual representations.
• Users must be able to specify the forms of presentation to be
used for displaying the discovered patterns. Some
representation forms may be better suited than others for
particular kinds of knowledge.
• For example, generalized relations and their corresponding
cross tabs or pie/bar charts are good for presenting
characteristic descriptions, whereas decision trees are
common for classification.
Example of Data Mining Task Primitives

• Suppose, as a marketing manager of AllElectronics, you would


like to classify customers based on their buying patterns. You
are especially interested in those customers whose salary is no
less than $40,000 and who have bought more than $1,000
worth of items, each of which is priced at no less than $100.
• In particular, you are interested in the customer's age, income,
the types of items purchased, the purchase location, and
where the items were made. You would like to view the
resulting classification in the form of rules. This data mining
query is expressed in DMQL3 as follows, where each line of the
query has been enumerated to aid in our discussion.
• use database AllElectronics_db
• use hierarchy location_hierarchy for T.branch,
age_hierarchy for C.age
• mine classification as promising_customers
• in relevance to C.age, C.income, I.type, I.place_made,
T.branch
• from customer C, an item I, transaction T
• where I.item_ID = T.item_ID and C.cust_ID = T.cust_ID
and C.income ≥ 40,000 and I.price ≥ 100
• group by T.cust_ID
Data Integration in Data Mining

• INTRODUCTION :
Data integration in data mining refers to the process of
combining data from multiple sources into a single, unified
view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may
exist between the different sources. The goal of data
integration is to make the data more useful and meaningful
for the purposes of analysis and decision making.
Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and
data federation.
• Data Integration is a data preprocessing technique that
combines data from multiple heterogeneous data sources
into a coherent data store and provides a unified view of
the data. These sources may include multiple data cubes,
databases, or flat files.
• The data integration approaches are formally defined as
triple <G, S, M> where, 
G stand for the global schema, 
S stands for the heterogeneous source of schema, 
M stands for mapping between the queries of source and
global schema.  
What is data integration
• Data integration is the process of combining
data from multiple sources into a cohesive and
consistent view. This process involves
identifying and accessing the different data
sources, mapping the data to a common
format, and reconciling any inconsistencies or
discrepancies between the sources.
• The goal of data integration is to make it
easier to access and analyze data that is
spread across multiple systems or platforms,
in order to gain a more complete and accurate
understanding of the data.
• Data integration can be challenging due to the
variety of data formats, structures, and
semantics used by different data sources.
• Different data sources may use different data
types, naming conventions, and schemas,
making it difficult to combine the data into a
single view. Data integration typically involves
a combination of manual and automated
processes, including data profiling, data
mapping, data transformation, and data
reconciliation.
•    Data integration is used in a wide range of
applications, such as business intelligence, data
warehousing, master data management, and
analytics. Data integration can be critical to the
success of these applications, as it enables
organizations to access and analyze data that is
spread across different systems, departments, and
lines of business, in order to make better decisions,
improve operational efficiency, and gain a
competitive advantage.
• There are mainly 2 major approaches for data integration – one is
the “tight coupling approach” and another is the “loose coupling
approach”. 
• Tight Coupling: 
• This approach involves creating a centralized repository or data
warehouse to store the integrated data. The data is extracted
from various sources, transformed and loaded into a data
warehouse. Data is integrated in a tightly coupled manner,
meaning that the data is integrated at a high level, such as at the
level of the entire dataset or schema. This approach is also
known as data warehousing, and it enables data consistency and
integrity, but it can be inflexible and difficult to change or update.
• Here, a data warehouse is treated as an
information retrieval component.
• In this coupling, data is combined from
different sources into a single physical location
through the process of ETL – Extraction,
Transformation, and Loading.
Loose Coupling:  

• This approach involves integrating data at the lowest


level, such as at the level of individual data elements
or records. Data is integrated in a loosely coupled
manner, meaning that the data is integrated at a low
level, and it allows data to be integrated without
having to create a central repository or data
warehouse. This approach is also known as data
federation, and it enables data flexibility and easy
updates, but it can be difficult to maintain consistency
and integrity across multiple data sources.
• Here, an interface is provided that takes the
query from the user, transforms it in a way the
source database can understand, and then
sends the query directly to the source
databases to obtain the result.
• And the data only remains in the actual source
databases.
Issues in Data Integration:

•  There are several issues that can arise when integrating data from
multiple sources, including:
• Data Quality: Inconsistencies and errors in the data can make it difficult
to combine and analyze.
• Data Semantics: Different sources may use different terms or
definitions for the same data, making it difficult to combine and
understand the data.
• Data Heterogeneity: Different sources may use different data formats,
structures, or schemas, making it difficult to combine and analyze the
data.
• Data Privacy and Security: Protecting sensitive information and
maintaining security can be difficult when integrating data from
multiple sources.
• Scalability: Integrating large amounts of data from multiple sources can be
computationally expensive and time-consuming.
• Data Governance: Managing and maintaining the integration of data from
multiple sources can be difficult, especially when it comes to ensuring
data accuracy, consistency, and timeliness.
• Performance: Integrating data from multiple sources can also affect the
performance of the system.
• Integration with existing systems: Integrating new data sources with
existing systems can be a complex task, requiring significant effort and
resources.
• Complexity: The complexity of integrating data from multiple sources can
be high, requiring specialized skills and knowledge.
There are three issues to consider during data integration:
Schema Integration, Redundancy Detection, and resolution of
data value conflicts. These are explained in brief below.
• Schema Integration: 
• Integrate metadata from different sources.
• The real-world entities from multiple sources are referred to
as the entity identification problem.ER
• 2. Redundancy Detection: 
• An attribute may be redundant if it can be derived or
obtained from another attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in
the resulting data set.
• Some redundancies can be detected by correlation analysis.
• Resolution of data value conflicts: 
• This is the third critical issue in data integration.
• Attribute values from different sources may differ for the same real-world entity.
• An attribute in one system may be recorded at a lower level of abstraction than the
“same” attribute in another.
•  
• Data Mining challenges
• Security and Social Challenges.
• Noisy and Incomplete Data.
• Distributed Data.
• Complex Data.
• Performance.
• Scalability and Efficiency of the Algorithms.
• Improvement of Mining Algorithms.
• Incorporation of Background Knowledge.
Data Mining - Issues

• Data mining is not an easy task, as the algorithms used


can get very complex and data is not always available at
one place. It needs to be integrated from various
heterogeneous data sources. These factors also create
some issues. Here in this tutorial, we will discuss the
major issues regarding −
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues
• The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −

• Mining different kinds of knowledge in databases − Different users may be


interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search
for patterns, providing and refining data mining requests based on the returned
results.
• Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only
in concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and
flexible data mining.
• Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations
should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods
are required to handle the noise and incomplete objects while
mining the data regularities. If the data cleaning methods are
not there then the accuracy of the discovered patterns will be
poor.
• Pattern evaluation − The patterns discovered should be
interesting because either they represent common knowledge
or lack novelty.
Performance Issues

• Efficiency and scalability of data mining algorithms − In order


to effectively extract the information from huge amount of data
in databases, data mining algorithm must be efficient and
scalable.
• Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of
data, and complexity of data mining methods motivate the
development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions
which is further processed in a parallel fashion. Then the results
from the partitions is merged. The incremental algorithms,
update databases without mining the data again from scratch.
Diverse Data Types Issues

• Handling of relational and complex types of data − The


database may contain complex data objects, multimedia
data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and
global information systems − The data is available at
different data sources on LAN or WAN. These data
source may be structured, semi structured or
unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
 
Data Pre-processing in Data Mining

• Data pre-processing is an important step in


the data mining process. It refers to the
cleaning, transforming, and integrating of data
in order to make it ready for analysis. The goal
of data pre-processing is to improve the
quality of the data and to make it more
suitable for the specific data mining task
Some common steps in data preprocessing include:

• Data cleaning: this step involves identifying and removing


missing, inconsistent, or irrelevant data. This can include
removing duplicate records, filling in missing values, and
handling outliers.
• Data integration: this step involves combining data from
multiple sources, such as databases, spread sheets, and text
files. The goal of integration is to create a single, consistent view
of the data.
• Data transformation: this step involves converting the data into
a format that is more suitable for the data mining task. This can
include normalizing numerical data, creating dummy variables,
and encoding categorical data.
• Data reduction: this step is used to select a subset
of the data that is relevant to the data mining task.
This can include feature selection (selecting a
subset of the variables) or feature extraction
(extracting new variables from the data).
• Data discretization: this step is used to convert
continuous numerical data into categorical data,
which can be used for decision tree and other
categorical data mining techniques.
Preprocessing in Data Mining: 
Steps Involved in Data Preprocessing: 

• 1. Data Cleaning: 
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc. 
 
• (a). Missing Data: 
This situation arises when some data is missing in the data. It can be
handled in various ways. 
Some of them are: 
• Ignore the tuples: 
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple. 
– Fill the Missing values: 
There are various ways to do this task. You can
choose to fill the missing values manually, by
attribute mean or the most probable value. 
 
• (b). Noisy Data: 
Noisy data is a meaningless data that can’t be
interpreted by machines.It can be generated due
to faulty data collection, data entry errors etc. It
can be handled in following ways : 
– Binning Method: 
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed
to complete the task. Each segmented is handled separately. One can replace
all data in a segment by its mean or boundary values can be used to complete
the task. 
 
– Regression: 
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables). 
 
– Clustering: 
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters. 
• . Data Transformation: 
This step is taken in order to transform the data in appropriate
forms suitable for mining process. This involves following ways: 
• Normalization: 
It is done in order to scale the data values in a specified range (-
1.0 to 1.0 or 0.0 to 1.0) 
 
• Attribute Selection: 
In this strategy, new attributes are constructed from the given
set of attributes to help the mining process. 
• Discretization: 
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels. 
 
• Concept Hierarchy Generation: 
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”. 
•  
• 3. Data Reduction: 
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs. 
• The various steps to data reduction are: 
•Data Cube Aggregation: 
Aggregation operation is applied to data for the construction of the data cube. 
 Attribute Subset Selection: 
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the
attribute having p-value greater than significance level can be discarded. 
 Numerosity Reduction: 
This enable to store the model of data instead of whole data, for example: Regression
Models. 
 Dimensionality Reduction: 
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Component
Analysis). 
• 
•  

You might also like