0% found this document useful (0 votes)

13 views

DM Module1

Uploaded by

Shubhi Srivastava

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

DM Module1

Uploaded by

Shubhi Srivastava

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Module – 1

Data Mining

Introduction -

Data mining is the process of extracting useful information from large sets of data.
It involves using various techniques from statistics, machine learning, and database
systems to identify patterns, relationships, and trends in the data.

This information can then be used to make data-driven decisions, solve business
problems.

Applications-

Few Applications of data mining include

i- Customer profiling and segmentation

ii- market based analysis

iii- anomaly detection

iv- predictive modeling.

Data mining tools and technologies are widely used in various industries, including
finance, healthcare, retail, and telecommunications.

Means of Mining -

In general terms, “Mining” is the process of extraction of some valuable material

from the earth e.g. coal mining, diamond mining, etc.
In the context of computer science

"Data Mining” can be referred to as knowledge mining from data, knowledge

extraction, data/pattern analysis, data archaeology, and data dredging.

It is basically the process carried out for the extraction of useful information
from a bulk of data or data warehouses.

In that sense, we can think of Data Mining as a step in the process of Knowledge
Discovery or Knowledge Extraction.
Nowadays, data mining is used in almost all places where a large amount of data
is stored and processed.

for examples -

banks typically use ‘data mining’ to find out their prospective customers who
could be interested in credit cards, personal loans, or insurance as well. Since
banks have the transaction details and detailed profiles of their customers, they
analyze all this data and try to find out patterns that help them predict that certain
customers could be interested in personal loans, etc.

Main Purpose of Data Mining

Basically, Data mining has been integrated with many other techniques from other
domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, etc. to gather more
information about the data and to help predict hidden patterns, future trends, and
behaviors and allows businesses to make decisions.
Definitions – Data Mining

Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis. Data
mining techniques and tools help enterprises to predict future trends and make
more informed business decisions.

Data mining is a key part of data analytics and one of the core disciplines in data
science, which uses advanced analytics techniques to find useful information in
data sets. At a more granular level, data mining is a step in the knowledge
discovery in databases (KDD) process, a data science methodology for gathering,
processing and analyzing data. Data mining and KDD are sometimes referred to
interchangeably, but they're more commonly seen as distinct things.

KDD (knowledge discovery in database)

KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets.

The following steps are included in KDD process:

1. Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
i. Cleaning in case of Missing values.
ii. Cleaning noisy data, where noise is a random or variance error.
iii. Cleaning with Data discrepancy detection and Data transformation tools.

2. Data Integration

Data integration is defined as heterogeneous data from multiple sources

combined in a common source (DataWarehouse).
Data integration using
Data Migration tools
Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
3. Data Selection

Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Clustering, and Regression methods.

4. Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a
Two step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.

5. Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Challenges –

Data mining, the process of extracting knowledge from data, has

become increasingly important as the amount of data generated by
individuals, organizations, and machines has grown exponentially.
However, data mining is not without its challenges.
1] Data Quality
The quality of data used in data mining is one of the most significant
challenges. The accuracy, completeness, and consistency of the data
affect the accuracy of the results obtained. The data may contain errors,
omissions, duplications, or inconsistencies, which may lead to inaccurate
results.

Data quality issues can arise due to a variety of reasons.

a) Data entry errors

b) Data storage issues

c) Data integration problems

d) Data transmission errors

To address these challenges, data mining practitioners must apply data

cleaning and data pre-processing techniques to improve the quality of the
data. Data cleaning involves detecting and correcting errors, while data
pre-processing involves transforming the data to make it suitable for data
mining.

2] Data Complexity
Data complexity refers to the vast amounts of data generated by various
sources, such as sensors, social media, and the internet of things (IoT).
The complexity of the data may make it challenging to process, analyze,
and understand. In addition, the data may be in different formats, making
it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced
techniques such as clustering, classification, and association rule mining.
These techniques help to identify patterns and relationships in the data,
which can then be used to gain insights and make predictions.

3] Data Privacy and Security

Data privacy and security is another significant challenge in data mining.
As more data is collected, stored, and analyzed, the risk of data breaches
and cyber-attacks increases. The data may contain personal, sensitive, or
confidential information that must be protected.
To address this challenge, data mining practitioners must apply data
encryption techniques to protect the privacy and security of the data.

while data encryption involves using algorithms to encode the data to

make it unreadable to unauthorized users.

4] Scalability
Data mining algorithms must be scalable to handle large datasets
efficiently. As the size of the dataset increases, the time and
computational resources required to perform data mining operations also
increase. Moreover, the algorithms must be able to handle streaming
data, which is generated continuously and must be processed in real-
time.
To address this challenge, data mining practitioners use distributed
computing frameworks such as Hadoop and Spark. These frameworks
distribute the data and processing across multiple nodes, making it
possible to process large datasets quickly and efficiently.

4] Ethics
Data mining raises ethical concerns related to the collection, use, and
dissemination of data.

Moreover, data mining algorithms may not be transparent, making it

challenging to detect biases or discrimination.

DATA MINING TASKS & FUNCTIONALITY-

Data mining activities can be divided into 2 categories.

1] Descriptive Data Mining:

Descriptive data mining is often used to summarize or explore the data,

and it can be used to answer questions such as: What are the most
common patterns or relationships in the data?
2] Predictive Data Mining: This category of data mining is concerned
with developing models that can predict future behaviour or outcomes
based on historical data. Predictive data mining is often used for
classification or regression tasks, and it can be used to answer questions
such as: What is the likelihood that a customer will churn? What is the
expected revenue for a new product launch?

Data Mining Tasks –

a) Classification
Classification derives a model to determine the class of an object based on its
attributes. A collection of records will be available, each record with a set of attributes.
One of the attributes will be class attribute and the goal of classification task is
assigning a class attribute to new set of records as accurately as possible.
Classification can be used in direct marketing, that is to reduce marketing costs by
targeting a set of customers who are likely to buy a new product. Using the available
data, it is possible to know which customers purchased similar products and who did not
purchase in the past. Hence, {purchase, don’t purchase} decision forms the class
attribute in this case.

b) Prediction
Prediction task predicts the possible values of missing or future data. Prediction
involves developing a model based on the available data and this model is used in
predicting future values of a new data set of interest. For example, a model can predict
the income of an employee based on education, experience and other demographic
factors like place of stay, gender etc. Also prediction analysis is used in different areas
including medical diagnosis, fraud detection etc.

c) Time - Series Analysis

Time series is a sequence of events where the next event is determined by one or more
of the preceding events. Time series reflects the process being measured and there are
certain components that affect the behavior of a process. Time series analysis includes
methods to analyze time-series data in order to extract useful patterns, trends, rules and
statistics. Stock market prediction is an important application of time- series analysis.

d) Association
Association discovers the association or connection among a set of items. Association
identifies the relationships between objects. Association analysis is used for commodity
management, advertising, catalog design, direct marketing etc. A retailer can identify
the products that normally customers purchase together or even find the customers who
respond to the promotion of same kind of products. If a retailer finds that beer and
nappy are bought together mostly, he can put nappies on sale to promote the sale of
beer.

e) Clustering
Clustering is used to identify data objects that are similar to one another. The similarity
can be decided based on a number of factors like purchase behavior, responsiveness to
certain actions, geographical locations and so on. For example, an insurance company
can cluster its customers based on age, residence, income etc. This group information
will be helpful to understand the customers better and hence provide better customized
services.

f) Summarization
Summarization is the generalization of data. A set of relevant data is summarized which
result in a smaller set that gives aggregated information of the data. For example, the
shopping done by a customer can be summarized into total products, total spending,
offers used, etc. Such high level summarized information can be useful for sales or
customer relationship team for detailed customer and purchase behavior analysis. Data
can be summarized in different abstraction levels and from different angles.

Data pre-processing -

Data pre-processing is an important step in the data mining process. It refers to

the cleaning, transforming, and integrating of data in order to make it ready for
analysis. The goal of data pre-processing is to improve the quality of the data
and to make it more suitable for the specific data mining task.

Some common steps in data pre-processing include:

Data pre-processing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for analysis.
Some common steps in data pre-processing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation, removal,
and transformation.

Data Integration: This involves combining data from multiple sources to create
a unified dataset. Data integration can be challenging as it requires handling
data with different formats, structures, and semantics. Techniques such as
record linkage and data fusion can be used for data integration.

Data Transformation: This involves converting the data into a suitable format
for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to convert
continuous data into discrete categories.

Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved through
techniques such as feature selection and feature extraction. Feature selection
involves selecting a subset of relevant features from the dataset, while feature
extraction involves transforming the data into a lower-dimensional space while
preserving the important information.

Data Discretization: This involves dividing continuous data into discrete

categories or intervals. Discretization is often used in data mining and machine
learning algorithms that require categorical data. Discretization can be achieved
through techniques such as equal width binning, equal frequency binning, and
clustering.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:

1. Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.

(b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are
performed to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

1.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information. This
is done to improve the efficiency of data analysis and to avoid over fitting of the
model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features
from the dataset. Feature selection is often performed to remove
irrelevant or redundant features from the dataset. It can be done using
various techniques such as correlation analysis, mutual information, and
principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional
and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the
dataset. Sampling is often used to reduce the size of the dataset while
preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be done
using techniques such as k-means, hierarchical clustering, and density-
based clustering.
Compression: This involves compressing the dataset while preserving
the important information. Compression is often used to reduce the size
of the dataset for storage and transmission purposes. It can be done
using techniques such as wavelet compression, JPEG compression, and
gzip compression.

Math 1090 Linear Programming Project
0% (1)
Math 1090 Linear Programming Project
3 pages
3 DOF Gyroscope Courseware Sample For MATLAB Users
No ratings yet
3 DOF Gyroscope Courseware Sample For MATLAB Users
8 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
Module-1 DM
No ratings yet
Module-1 DM
15 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
unit1DM
No ratings yet
unit1DM
16 pages
BIDW Lecture 2
No ratings yet
BIDW Lecture 2
33 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
DWDM 1
No ratings yet
DWDM 1
17 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
30 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Data-Mining-OVERVIEW (1)
No ratings yet
Data-Mining-OVERVIEW (1)
8 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
data mining
No ratings yet
data mining
4 pages
Datawarehouse&Data mining_ALL
No ratings yet
Datawarehouse&Data mining_ALL
46 pages
VO_MCA_S4_Data Mining Unit 1
No ratings yet
VO_MCA_S4_Data Mining Unit 1
18 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Chapter 1 - What is Data Mining
No ratings yet
Chapter 1 - What is Data Mining
8 pages
B SC (IT) VI-DSE3-M5
No ratings yet
B SC (IT) VI-DSE3-M5
13 pages
Presentation On Data Mining
100% (1)
Presentation On Data Mining
51 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
29 pages
Unit II Data Mining
No ratings yet
Unit II Data Mining
8 pages
DM UNIT -3
No ratings yet
DM UNIT -3
10 pages
DWM 4
No ratings yet
DWM 4
23 pages
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
No ratings yet
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
58 pages
DATA_MINING_UNIT_1
No ratings yet
DATA_MINING_UNIT_1
13 pages
Data Mining
No ratings yet
Data Mining
19 pages
My Notes DWDM
No ratings yet
My Notes DWDM
18 pages
Lesson 1
No ratings yet
Lesson 1
32 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
New Note
No ratings yet
New Note
23 pages
Data Mining Note
No ratings yet
Data Mining Note
79 pages
Subject Data Warehouse
No ratings yet
Subject Data Warehouse
42 pages
Data Mining - Docx Unit 1
No ratings yet
Data Mining - Docx Unit 1
10 pages
Dadm (1) Sidra
No ratings yet
Dadm (1) Sidra
9 pages
Data Mining.pdf
No ratings yet
Data Mining.pdf
6 pages
Adm Unit - 1
No ratings yet
Adm Unit - 1
62 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data mining
No ratings yet
Data mining
8 pages
Data Mining
No ratings yet
Data Mining
20 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Whats App
No ratings yet
Whats App
23 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
Unit-1
No ratings yet
Unit-1
7 pages
Chapter 1___Data Mining and Data Warehouse
No ratings yet
Chapter 1___Data Mining and Data Warehouse
44 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
No ratings yet
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
4 pages
Data Mining
No ratings yet
Data Mining
25 pages
1_Lect 1 & 2 Data Mining
No ratings yet
1_Lect 1 & 2 Data Mining
20 pages
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Cst301 Flat Dec 2022
No ratings yet
Cst301 Flat Dec 2022
3 pages
QB Unit 1 DC
No ratings yet
QB Unit 1 DC
8 pages
Module 4 Question Bank: Big Data Analytics
No ratings yet
Module 4 Question Bank: Big Data Analytics
2 pages
Factoring GCF and Difference of Two Squares
No ratings yet
Factoring GCF and Difference of Two Squares
23 pages
Robotics
86% (7)
Robotics
579 pages
2 2024 Trial Sem 1 Negeri Kelantan (A) - 231227 - 112857
No ratings yet
2 2024 Trial Sem 1 Negeri Kelantan (A) - 231227 - 112857
10 pages
DSPD Question Bank
No ratings yet
DSPD Question Bank
4 pages
Atbash Cipher
No ratings yet
Atbash Cipher
5 pages
4 - Foundations of Technical Analysis Computational Algorithms, Statistical Inference, and Empirical Implementation
No ratings yet
4 - Foundations of Technical Analysis Computational Algorithms, Statistical Inference, and Empirical Implementation
62 pages
Computational Methods in Process Engineering Lab Experiment - 1
No ratings yet
Computational Methods in Process Engineering Lab Experiment - 1
10 pages
For The Block Diagram, The Output Is To: Meq Given Equal
No ratings yet
For The Block Diagram, The Output Is To: Meq Given Equal
4 pages
Application of Partial Fraction in Chemical Kinetics and Pharmacokinetics
No ratings yet
Application of Partial Fraction in Chemical Kinetics and Pharmacokinetics
9 pages
DSP Lab Manual 2021 22
No ratings yet
DSP Lab Manual 2021 22
66 pages
The Mathematics of Modern Physics: Description
No ratings yet
The Mathematics of Modern Physics: Description
2 pages
DS Oops
No ratings yet
DS Oops
2 pages
65 SC Tae1 A3
No ratings yet
65 SC Tae1 A3
3 pages
1 Laplace Transform: T S T S
No ratings yet
1 Laplace Transform: T S T S
8 pages
Sa2 08HSL PDF
No ratings yet
Sa2 08HSL PDF
94 pages
Trinanjan Saha: Data Scientist
No ratings yet
Trinanjan Saha: Data Scientist
1 page
Supervised Neural Networks For The Classification of Structures
No ratings yet
Supervised Neural Networks For The Classification of Structures
22 pages
Chapter5v2 0
No ratings yet
Chapter5v2 0
25 pages
Finite Element Analysis
No ratings yet
Finite Element Analysis
2 pages
Generative Ai
No ratings yet
Generative Ai
13 pages
Phase Plane Analysis: - Main Ideas
No ratings yet
Phase Plane Analysis: - Main Ideas
18 pages
Quantum Computing The Future of Information Processing
No ratings yet
Quantum Computing The Future of Information Processing
348 pages
Lecture 04 Slides Mm206 v2
No ratings yet
Lecture 04 Slides Mm206 v2
30 pages
Yuxuan Full PLL
No ratings yet
Yuxuan Full PLL
4 pages
PLC Analog Input and Output Programming PLC Academy
No ratings yet
PLC Analog Input and Output Programming PLC Academy
1 page