DM Module1
DM Module1
DM Module1
Data Mining
Introduction -
Data mining is the process of extracting useful information from large sets of data.
It involves using various techniques from statistics, machine learning, and database
systems to identify patterns, relationships, and trends in the data.
This information can then be used to make data-driven decisions, solve business
problems.
Applications-
Data mining tools and technologies are widely used in various industries, including
finance, healthcare, retail, and telecommunications.
Means of Mining -
It is basically the process carried out for the extraction of useful information
from a bulk of data or data warehouses.
In that sense, we can think of Data Mining as a step in the process of Knowledge
Discovery or Knowledge Extraction.
Nowadays, data mining is used in almost all places where a large amount of data
is stored and processed.
for examples -
banks typically use ‘data mining’ to find out their prospective customers who
could be interested in credit cards, personal loans, or insurance as well. Since
banks have the transaction details and detailed profiles of their customers, they
analyze all this data and try to find out patterns that help them predict that certain
customers could be interested in personal loans, etc.
Basically, Data mining has been integrated with many other techniques from other
domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, etc. to gather more
information about the data and to help predict hidden patterns, future trends, and
behaviors and allows businesses to make decisions.
Definitions – Data Mining
Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis. Data
mining techniques and tools help enterprises to predict future trends and make
more informed business decisions.
Data mining is a key part of data analytics and one of the core disciplines in data
science, which uses advanced analytics techniques to find useful information in
data sets. At a more granular level, data mining is a step in the knowledge
discovery in databases (KDD) process, a data science methodology for gathering,
processing and analyzing data. Data mining and KDD are sometimes referred to
interchangeably, but they're more commonly seen as distinct things.
1. Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
i. Cleaning in case of Missing values.
ii. Cleaning noisy data, where noise is a random or variance error.
iii. Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration
Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Clustering, and Regression methods.
4. Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a
Two step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.
5. Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Challenges –
2] Data Complexity
Data complexity refers to the vast amounts of data generated by various
sources, such as sensors, social media, and the internet of things (IoT).
The complexity of the data may make it challenging to process, analyze,
and understand. In addition, the data may be in different formats, making
it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced
techniques such as clustering, classification, and association rule mining.
These techniques help to identify patterns and relationships in the data,
which can then be used to gain insights and make predictions.
4] Scalability
Data mining algorithms must be scalable to handle large datasets
efficiently. As the size of the dataset increases, the time and
computational resources required to perform data mining operations also
increase. Moreover, the algorithms must be able to handle streaming
data, which is generated continuously and must be processed in real-
time.
To address this challenge, data mining practitioners use distributed
computing frameworks such as Hadoop and Spark. These frameworks
distribute the data and processing across multiple nodes, making it
possible to process large datasets quickly and efficiently.
4] Ethics
Data mining raises ethical concerns related to the collection, use, and
dissemination of data.
a) Classification
Classification derives a model to determine the class of an object based on its
attributes. A collection of records will be available, each record with a set of attributes.
One of the attributes will be class attribute and the goal of classification task is
assigning a class attribute to new set of records as accurately as possible.
Classification can be used in direct marketing, that is to reduce marketing costs by
targeting a set of customers who are likely to buy a new product. Using the available
data, it is possible to know which customers purchased similar products and who did not
purchase in the past. Hence, {purchase, don’t purchase} decision forms the class
attribute in this case.
b) Prediction
Prediction task predicts the possible values of missing or future data. Prediction
involves developing a model based on the available data and this model is used in
predicting future values of a new data set of interest. For example, a model can predict
the income of an employee based on education, experience and other demographic
factors like place of stay, gender etc. Also prediction analysis is used in different areas
including medical diagnosis, fraud detection etc.
d) Association
Association discovers the association or connection among a set of items. Association
identifies the relationships between objects. Association analysis is used for commodity
management, advertising, catalog design, direct marketing etc. A retailer can identify
the products that normally customers purchase together or even find the customers who
respond to the promotion of same kind of products. If a retailer finds that beer and
nappy are bought together mostly, he can put nappies on sale to promote the sale of
beer.
e) Clustering
Clustering is used to identify data objects that are similar to one another. The similarity
can be decided based on a number of factors like purchase behavior, responsiveness to
certain actions, geographical locations and so on. For example, an insurance company
can cluster its customers based on age, residence, income etc. This group information
will be helpful to understand the customers better and hence provide better customized
services.
f) Summarization
Summarization is the generalization of data. A set of relevant data is summarized which
result in a smaller set that gives aggregated information of the data. For example, the
shopping done by a customer can be summarized into total products, total spending,
offers used, etc. Such high level summarized information can be useful for sales or
customer relationship team for detailed customer and purchase behavior analysis. Data
can be summarized in different abstraction levels and from different angles.
Data pre-processing -
Data Integration: This involves combining data from multiple sources to create
a unified dataset. Data integration can be challenging as it requires handling
data with different formats, structures, and semantics. Techniques such as
record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format
for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to convert
continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved through
techniques such as feature selection and feature extraction. Feature selection
involves selecting a subset of relevant features from the dataset, while feature
extraction involves transforming the data into a lower-dimensional space while
preserving the important information.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are
performed to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
1.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information. This
is done to improve the efficiency of data analysis and to avoid over fitting of the
model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features
from the dataset. Feature selection is often performed to remove
irrelevant or redundant features from the dataset. It can be done using
various techniques such as correlation analysis, mutual information, and
principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional
and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the
dataset. Sampling is often used to reduce the size of the dataset while
preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be done
using techniques such as k-means, hierarchical clustering, and density-
based clustering.
Compression: This involves compressing the dataset while preserving
the important information. Compression is often used to reduce the size
of the dataset for storage and transmission purposes. It can be done
using techniques such as wavelet compression, JPEG compression, and
gzip compression.