Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DM Module1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Module – 1

Data Mining

Introduction -

Data mining is the process of extracting useful information from large sets of data.
It involves using various techniques from statistics, machine learning, and database
systems to identify patterns, relationships, and trends in the data.

This information can then be used to make data-driven decisions, solve business
problems.

Applications-

Few Applications of data mining include

i- Customer profiling and segmentation

ii- market based analysis

iii- anomaly detection

iv- predictive modeling.

Data mining tools and technologies are widely used in various industries, including
finance, healthcare, retail, and telecommunications.

Means of Mining -

In general terms, “Mining” is the process of extraction of some valuable material


from the earth e.g. coal mining, diamond mining, etc.
In the context of computer science

"Data Mining” can be referred to as knowledge mining from data, knowledge


extraction, data/pattern analysis, data archaeology, and data dredging.

It is basically the process carried out for the extraction of useful information
from a bulk of data or data warehouses.

In that sense, we can think of Data Mining as a step in the process of Knowledge
Discovery or Knowledge Extraction.
Nowadays, data mining is used in almost all places where a large amount of data
is stored and processed.

for examples -

banks typically use ‘data mining’ to find out their prospective customers who
could be interested in credit cards, personal loans, or insurance as well. Since
banks have the transaction details and detailed profiles of their customers, they
analyze all this data and try to find out patterns that help them predict that certain
customers could be interested in personal loans, etc.

Main Purpose of Data Mining

Basically, Data mining has been integrated with many other techniques from other
domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, etc. to gather more
information about the data and to help predict hidden patterns, future trends, and
behaviors and allows businesses to make decisions.
Definitions – Data Mining

Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis. Data
mining techniques and tools help enterprises to predict future trends and make
more informed business decisions.

Data mining is a key part of data analytics and one of the core disciplines in data
science, which uses advanced analytics techniques to find useful information in
data sets. At a more granular level, data mining is a step in the knowledge
discovery in databases (KDD) process, a data science methodology for gathering,
processing and analyzing data. Data mining and KDD are sometimes referred to
interchangeably, but they're more commonly seen as distinct things.

KDD (knowledge discovery in database)


KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets.

The following steps are included in KDD process:

1. Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
i. Cleaning in case of Missing values.
ii. Cleaning noisy data, where noise is a random or variance error.
iii. Cleaning with Data discrepancy detection and Data transformation tools.

2. Data Integration

Data integration is defined as heterogeneous data from multiple sources


combined in a common source (DataWarehouse).
Data integration using
Data Migration tools
Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
3. Data Selection

Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Clustering, and Regression methods.

4. Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a
Two step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.

5. Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Challenges –

Data mining, the process of extracting knowledge from data, has


become increasingly important as the amount of data generated by
individuals, organizations, and machines has grown exponentially.
However, data mining is not without its challenges.
1] Data Quality
The quality of data used in data mining is one of the most significant
challenges. The accuracy, completeness, and consistency of the data
affect the accuracy of the results obtained. The data may contain errors,
omissions, duplications, or inconsistencies, which may lead to inaccurate
results.

Data quality issues can arise due to a variety of reasons.

a) Data entry errors

b) Data storage issues

c) Data integration problems

d) Data transmission errors

To address these challenges, data mining practitioners must apply data


cleaning and data pre-processing techniques to improve the quality of the
data. Data cleaning involves detecting and correcting errors, while data
pre-processing involves transforming the data to make it suitable for data
mining.

2] Data Complexity
Data complexity refers to the vast amounts of data generated by various
sources, such as sensors, social media, and the internet of things (IoT).
The complexity of the data may make it challenging to process, analyze,
and understand. In addition, the data may be in different formats, making
it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced
techniques such as clustering, classification, and association rule mining.
These techniques help to identify patterns and relationships in the data,
which can then be used to gain insights and make predictions.

3] Data Privacy and Security


Data privacy and security is another significant challenge in data mining.
As more data is collected, stored, and analyzed, the risk of data breaches
and cyber-attacks increases. The data may contain personal, sensitive, or
confidential information that must be protected.
To address this challenge, data mining practitioners must apply data
encryption techniques to protect the privacy and security of the data.

while data encryption involves using algorithms to encode the data to


make it unreadable to unauthorized users.

4] Scalability
Data mining algorithms must be scalable to handle large datasets
efficiently. As the size of the dataset increases, the time and
computational resources required to perform data mining operations also
increase. Moreover, the algorithms must be able to handle streaming
data, which is generated continuously and must be processed in real-
time.
To address this challenge, data mining practitioners use distributed
computing frameworks such as Hadoop and Spark. These frameworks
distribute the data and processing across multiple nodes, making it
possible to process large datasets quickly and efficiently.

4] Ethics
Data mining raises ethical concerns related to the collection, use, and
dissemination of data.

Moreover, data mining algorithms may not be transparent, making it


challenging to detect biases or discrimination.

DATA MINING TASKS & FUNCTIONALITY-

Data mining activities can be divided into 2 categories.

1] Descriptive Data Mining:

Descriptive data mining is often used to summarize or explore the data,


and it can be used to answer questions such as: What are the most
common patterns or relationships in the data?
2] Predictive Data Mining: This category of data mining is concerned
with developing models that can predict future behaviour or outcomes
based on historical data. Predictive data mining is often used for
classification or regression tasks, and it can be used to answer questions
such as: What is the likelihood that a customer will churn? What is the
expected revenue for a new product launch?

Data Mining Tasks –

a) Classification
Classification derives a model to determine the class of an object based on its
attributes. A collection of records will be available, each record with a set of attributes.
One of the attributes will be class attribute and the goal of classification task is
assigning a class attribute to new set of records as accurately as possible.
Classification can be used in direct marketing, that is to reduce marketing costs by
targeting a set of customers who are likely to buy a new product. Using the available
data, it is possible to know which customers purchased similar products and who did not
purchase in the past. Hence, {purchase, don’t purchase} decision forms the class
attribute in this case.

b) Prediction
Prediction task predicts the possible values of missing or future data. Prediction
involves developing a model based on the available data and this model is used in
predicting future values of a new data set of interest. For example, a model can predict
the income of an employee based on education, experience and other demographic
factors like place of stay, gender etc. Also prediction analysis is used in different areas
including medical diagnosis, fraud detection etc.

c) Time - Series Analysis


Time series is a sequence of events where the next event is determined by one or more
of the preceding events. Time series reflects the process being measured and there are
certain components that affect the behavior of a process. Time series analysis includes
methods to analyze time-series data in order to extract useful patterns, trends, rules and
statistics. Stock market prediction is an important application of time- series analysis.

d) Association
Association discovers the association or connection among a set of items. Association
identifies the relationships between objects. Association analysis is used for commodity
management, advertising, catalog design, direct marketing etc. A retailer can identify
the products that normally customers purchase together or even find the customers who
respond to the promotion of same kind of products. If a retailer finds that beer and
nappy are bought together mostly, he can put nappies on sale to promote the sale of
beer.

e) Clustering
Clustering is used to identify data objects that are similar to one another. The similarity
can be decided based on a number of factors like purchase behavior, responsiveness to
certain actions, geographical locations and so on. For example, an insurance company
can cluster its customers based on age, residence, income etc. This group information
will be helpful to understand the customers better and hence provide better customized
services.

f) Summarization
Summarization is the generalization of data. A set of relevant data is summarized which
result in a smaller set that gives aggregated information of the data. For example, the
shopping done by a customer can be summarized into total products, total spending,
offers used, etc. Such high level summarized information can be useful for sales or
customer relationship team for detailed customer and purchase behavior analysis. Data
can be summarized in different abstraction levels and from different angles.

Data pre-processing -

Data pre-processing is an important step in the data mining process. It refers to


the cleaning, transforming, and integrating of data in order to make it ready for
analysis. The goal of data pre-processing is to improve the quality of the data
and to make it more suitable for the specific data mining task.

Some common steps in data pre-processing include:


Data pre-processing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for analysis.
Some common steps in data pre-processing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation, removal,
and transformation.

Data Integration: This involves combining data from multiple sources to create
a unified dataset. Data integration can be challenging as it requires handling
data with different formats, structures, and semantics. Techniques such as
record linkage and data fusion can be used for data integration.

Data Transformation: This involves converting the data into a suitable format
for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to convert
continuous data into discrete categories.

Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved through
techniques such as feature selection and feature extraction. Feature selection
involves selecting a subset of relevant features from the dataset, while feature
extraction involves transforming the data into a lower-dimensional space while
preserving the important information.

Data Discretization: This involves dividing continuous data into discrete


categories or intervals. Discretization is often used in data mining and machine
learning algorithms that require categorical data. Discretization can be achieved
through techniques such as equal width binning, equal frequency binning, and
clustering.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:

1. Ignore the tuples:


This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are
performed to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

1.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information. This
is done to improve the efficiency of data analysis and to avoid over fitting of the
model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features
from the dataset. Feature selection is often performed to remove
irrelevant or redundant features from the dataset. It can be done using
various techniques such as correlation analysis, mutual information, and
principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional
and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the
dataset. Sampling is often used to reduce the size of the dataset while
preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be done
using techniques such as k-means, hierarchical clustering, and density-
based clustering.
Compression: This involves compressing the dataset while preserving
the important information. Compression is often used to reduce the size
of the dataset for storage and transmission purposes. It can be done
using techniques such as wavelet compression, JPEG compression, and
gzip compression.

You might also like