Data Mining Chapter 1

Data mining is the process of automatically searching large datasets to discover patterns and trends. It uses sophisticated mathematical algorithms to segment data and evaluate the probability of future events. The key aspects of data mining are automatic discovery of patterns, prediction of outcomes, creation of actionable information, and focusing on large datasets. Data mining can answer questions that simple queries cannot by discovering non-obvious relationships in the data.

Uploaded by

Rony saha

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

336 views

Data Mining Chapter 1

Uploaded by

Rony saha

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

What Is Data Mining?

Data mining is the practice of automatically searching large stores of data

to discover patterns and trends that go beyond simple analysis. Data
mining uses sophisticated mathematical algorithms to segment the data
and evaluate the probability of future events. Data mining is also known as
Knowledge Discovery in Data (KDD).

The key properties of data mining are:

 Automatic discovery of patterns

 Prediction of likely outcomes
 Creation of actionable information
 Focus on large data sets and databases

Data mining can answer questions that cannot be addressed through

simple query and reporting techniques.

Why is data mining important?

So why is data mining important? You’ve seen the staggering numbers –
the volume of data produced is doubling every two years. Unstructured
data alone makes up 90 percent of the digital universe. But more
information does not necessarily mean more knowledge.

Data mining allows you to:

 Sift through all the chaotic and repetitive noise in your data.
 Understand what is relevant and then make good use of that
information to assess likely outcomes.
 Accelerate the pace of making informed decisions.
Learn more about data mining techniques in Data Mining From A to Z, a
paper that shows how organizations can use predictive analytics and data
mining to reveal new insights from data.
1.3 What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to any kind of
data as long as the data are meaningful for a target application. The
most basic forms of data for mining applications are database data
(Section 1.3.1), data warehouse data (Section 1.3.2), and
transactional data (Section 1.3.3). The concepts and techniques
presented in this book focus on such data. Data mining can also be
applied to other forms of data (e.g., data streams, ordered/sequence
data, graph or networked data, spatial data, text data, multimedia
data, and the WWW). We present an overview of such data in Section
1.3.4. Techniques for mining of these kinds of data are briefly
introduced in Chapter 13.
1.4What Kinds of Patterns Can Be Mined?
We have observed various types of data and information repositories on
which data mining can be performed. Let us now examine the kinds of
patterns that can be mined. There are a number of data mining
functionalities. These include characterization and discrimination
(Section 1.4.1); the mining of frequent patterns, associations, and
correlations(Section1.4.2);classiﬁcationandregression(Section1.4.3);clus
teringanalysis (Section 1.4.4); and outlier analysis (Section 1.4.5). Data
mining functionalities are used to specify the kinds of patterns to be
found in data mining tasks. In general, such tasks can be classiﬁed into
two categories: descriptive and predictive. Descriptive mining tasks
characterize properties of the data in a target data set. Predictive mining
tasks perform induction on the current data in order to make predictions.
Dataminingfunctionalities,andthekindsofpatternstheycandiscover,aredesc
ribed below. In addition, Section 1.4.6 looks at what makes a pattern
interesting. Interesting patterns represent knowledge.KDD Process in
Data Mining
Last Updated: 20-08-2019
Data Mining – Knowledge Discovery in Databases(KDD).
Why we need Data Mining?
Volume of information is increasing everyday that we can handle from
business transactions, scientific data, sensor data, Pictures, videos, etc.
So, we need a system that will be capable of extracting essence of
information available and that can automatically generate report,
views or summary of data for better decision-making.
Why Data Mining is used in Business?
Data mining is used in business to make better managerial decisions by:
 Automatic summarization of data
 Extracting essence of information stored.
 Discovering patterns in raw data.
Data Mining also known as Knowledge Discovery in Databases, refers
to the nontrivial extraction of implicit, previously unknown and
potentially useful information from data stored in databases.
Steps Involved in KDD Process:
KDD process

1. Data Cleaning: Data cleaning is defined as removal of noisy and

irrelevant data from collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance
error.
 Cleaning with Data discrepancy detection and Data
transformation tools.
2. Data Integration: Data integration is defined as heterogeneous
data from multiple sources combined in a common
source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation)
process.
3. Data Selection: Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data
collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the
process of transforming data into appropriate form required by
mining procedure.
Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to
destination to capture transformations.
 Code generation: Creation of the actual transformation
program.
5. Data Mining: Data mining is defined as clever techniques that are
applied to extract patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model
using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying
strictly increasing patterns representing knowledge based on given
measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data
understandable by user.
7. Knowledge representation: Knowledge representation is defined
as technique which utilizes visualization tools to represent data
mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification
rules, characterization rules, etc.
Note:
 KDD is an iterative process where evaluation measures can be
enhanced, mining can be refined, new data can be integrated and
transformed in order to get different and more appropriate results.
 Preprocessing of databases consists of Data cleaning and Data
Integration.

What is Machine Learning?

“Machine Learning is a subset of artificial intelligence. It focuses mainly on

the designing of systems, thereby allowing them to learn and make
predictions based on some set of matrices in machines”.

How does Machine Learning work?

One of the approaches is where the machine learning( ML) algorithm is
trained using a labelled or unlabelled training data set to produce a model.
New input data is introduced to the ML algorithm and make a prediction
based on the model, the prediction is then evaluated for accuracy and if the
accuracy is acceptable the machine learning algorithm is deployed.

But, what if the accuracy is not acceptable, the ML algorithm is trained again
and again within an augmented training data set, this was just a high-level
example as there are other steps involved in it. Let’s move on and quickly
parse through Machine learning into different types, see how each of them
are, how they worked and how each of them is used in various fields.

Types of Machine Learning

1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

Supervised Machine Learning

In supervised learning, you train your model on a labelled dataset that means
we have both raw input data as well as its results. We split our data into a
training dataset and test dataset where the training dataset is used to train
our network whereas the test dataset acts as new data for predicting results
or to see the accuracy of our model.

Hence, in supervised learning, our model learns from seen results the same as
a teacher teaches his students because the teacher already knows the results.
Accuracy is what we achieve in supervised learning as model perfection is
usually high.

Some algorithms for supervised learning

1. Linear Regression
2. Random Forest
3. Support Vector Machines (SVM)

Unsupervised Learning

In unsupervised learning, the information used to train is neither classified nor
labelled in the dataset. Unsupervised learning studies on how systems can
infer a function to describe a hidden structure from unlabelled data. The main
task of unsupervised learning is to find patterns in the data.

Once a model learns to develop patterns, it can easily predict patterns for any
new dataset in the form of clusters. The system doesn’t figure out the right
output, but it explores the data and can draw inferences from datasets to
describe hidden structures from unlabeled data.

Some algorithms available for unsupervised learning are

1. Principal Component Analysis Algorithm
2. K-means Algorithm
3. Singular Value Decomposition Algorithm

Reinforcement Learning

It is a Machine Learning algorithm that allows software agents and machines
to automatically determine the ideal behaviour within a specific context to
maximize its performance. It does not have labelled dataset or results
associated with data so the only way to perform a given task is to learn from
experience.

For every correct action or decision of algorithm, it is rewarded with positive
reinforcement whereas, for every incorrect action, it is rewarded with negative
reinforcement. In this way, it learns which actions are needed to perform and
which are not. Reinforcement learning can, therefore, help in industrial
automation as well as the gaming sector primarily.

Major Issues in Data Mining

Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −

 Mining Methodology and User Interaction

 Performance Issues
 Diverse Data Types Issues

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the
search for patterns, providing and refining data mining requests based on the
returned results.
 Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not
only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and
flexible data mining.
 Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required
to handle the noise and incomplete objects while mining the data regularities. If
the data cleaning methods are not there then the accuracy of the discovered
patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such
as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the data
again from scratch.

Diverse Data Types Issues

 Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN.
These data source may be structured, semi structured or unstructured.
Therefore mining the knowledge from them adds challenges to data mining.

What is data mining?

In your answer, address the following:Data Mining is the process or method for
extracting “mines” the interesting information or patterns fromlarge amount of data
to be able to take a decision based on that
(a) Is it another hype?
Data mining is note another Hype but actually the need for data mining is due to the
wide availability ofhuge amount of data and the need for transforming such data into
a useful information that we can takea decision or Analysis based on that. So Data
mining is the result of evolution of information technology
(b) Is it a simple transformation of technology developed from databases, statistics
and machinelearning?
No, Data mining is more thanNo. Data mining is more than a simple transformation
of technology developed from databases,statistics, and machine learning. Its
involves integration rather than a simple transformation oftechniques from multiple
disciplines such as database technology, statistics, machine learning, high-
performance computing, pattern recognition, neural networks, data visualization,
information retrieval,image and signal processing, and spatial data analysis.
(c) We have presented a view that data mining is the result of the evolution of
database technology.Do you think that data mining is also the result of the
evolution of machine learning research? Can youpresent such views based on the
historical progress of this discipline? Address the same for the fieldsof statistics
and pattern recognition.
Database technology began with the development of data collection and database
creation mechanismsthat led to the development of effective mechanisms for data
management including data storage,retrieval, query and transaction processing. The
large number of database systems offering query andtransaction processing
eventually and naturally led to the need for data analysis and understanding.Hence,
data mining began its development out of this necessity.
(d) Describe the steps involved in data mining when viewed as a process of
knowledge discovery.
Data mining knowledge discovery are as follows:-Data cleaning, a process that
removes or transforms noise and inconsistent data - Data integration,where multiple
data sources may be combined.-Data selection, where data relevant to the analysis
task are retrieved from the database-Data transformation, where data are transformed
or consolidated into forms appropriate for mining-Data mining, an essential process
where intelligent and efficient methods are applied in order to extractpatterns

Installation Material Overview
No ratings yet
Installation Material Overview
35 pages
Chapter 15 Business Studies Grade 12 Notes On Presentation and Data Response
No ratings yet
Chapter 15 Business Studies Grade 12 Notes On Presentation and Data Response
8 pages
E-Commerce Mobile App Project Report 1
No ratings yet
E-Commerce Mobile App Project Report 1
8 pages
Spherical Storage Tank Presentation - Rev.0
100% (5)
Spherical Storage Tank Presentation - Rev.0
33 pages
Query Optimization MCQ
No ratings yet
Query Optimization MCQ
12 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
18 pages
LAB # 07 Facts and Rules in PROLOG: Objective
No ratings yet
LAB # 07 Facts and Rules in PROLOG: Objective
6 pages
Quiz 4 - Data Preparation
100% (1)
Quiz 4 - Data Preparation
2 pages
Prediction of Mobile Phone Price Class Using Supervised Machine Learning Techniques
No ratings yet
Prediction of Mobile Phone Price Class Using Supervised Machine Learning Techniques
4 pages
BIG DATA ANALYTICS - Syllabus
No ratings yet
BIG DATA ANALYTICS - Syllabus
4 pages
ND - Olap Lab Manual 19-20
100% (1)
ND - Olap Lab Manual 19-20
52 pages
Data Mining
No ratings yet
Data Mining
14 pages
Convolution Neural Networks U2
No ratings yet
Convolution Neural Networks U2
24 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Data Binning
No ratings yet
Data Binning
9 pages
Admas University: Research Methods in Computer Science
No ratings yet
Admas University: Research Methods in Computer Science
61 pages
Data Mining-Exams
100% (2)
Data Mining-Exams
3 pages
For Seminar Presentation-Edited (Feb5)
No ratings yet
For Seminar Presentation-Edited (Feb5)
33 pages
Recommendation System Final
No ratings yet
Recommendation System Final
16 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
47 pages
Ai QB
No ratings yet
Ai QB
28 pages
DWDM Important Questions
No ratings yet
DWDM Important Questions
2 pages
Sample Questions Answers
No ratings yet
Sample Questions Answers
8 pages
Chapter 2 AI
No ratings yet
Chapter 2 AI
56 pages
IS328 Data Mining-Tutorial 1 Solution
No ratings yet
IS328 Data Mining-Tutorial 1 Solution
5 pages
Purbanchal University: BCA274CO User Interface Design
No ratings yet
Purbanchal University: BCA274CO User Interface Design
1 page
Data Mining and Warehousing
100% (3)
Data Mining and Warehousing
30 pages
RecSys - Final (Solution)
No ratings yet
RecSys - Final (Solution)
6 pages
Histogram
No ratings yet
Histogram
34 pages
Hadoop Questions and Answers Part 100
No ratings yet
Hadoop Questions and Answers Part 100
34 pages
Ch01 Dss Turban at
No ratings yet
Ch01 Dss Turban at
56 pages
Human-Computer Interaction: - Dr. Muhammad Raza - Assistant Professor
No ratings yet
Human-Computer Interaction: - Dr. Muhammad Raza - Assistant Professor
25 pages
Smart Parking System Using MERN Stack
No ratings yet
Smart Parking System Using MERN Stack
6 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Lecture 1
100% (1)
Lecture 1
21 pages
IT366 Advanced Database Management Systems
0% (1)
IT366 Advanced Database Management Systems
2 pages
WEKA Assignment Rahul Aggarwal 10BM60065
100% (1)
WEKA Assignment Rahul Aggarwal 10BM60065
10 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Web Analytics, Web Mining, and Social Analytics
No ratings yet
Web Analytics, Web Mining, and Social Analytics
53 pages
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
No ratings yet
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
12 pages
Olx's Presentation
No ratings yet
Olx's Presentation
26 pages
MC0088 Data Warehousing & Data Mining
No ratings yet
MC0088 Data Warehousing & Data Mining
10 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
VTU Question Paper of 15CS82 Big Data Analytics Jan-2021
No ratings yet
VTU Question Paper of 15CS82 Big Data Analytics Jan-2021
2 pages
Ba7031 Managerial Behaviour and Effectiveness
No ratings yet
Ba7031 Managerial Behaviour and Effectiveness
118 pages
Unit 3 Greedy & Dynamic Programming
No ratings yet
Unit 3 Greedy & Dynamic Programming
217 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Big Data Essentials
No ratings yet
Big Data Essentials
25 pages
A Brief Introduction To Data Mining (DM) : Bs Cs - V Iii BY Sanianayab
No ratings yet
A Brief Introduction To Data Mining (DM) : Bs Cs - V Iii BY Sanianayab
23 pages
2mark With Answer
No ratings yet
2mark With Answer
38 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
No ratings yet
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
4 pages
PPT1
No ratings yet
PPT1
93 pages
Simulation and Modeling Syllabus
No ratings yet
Simulation and Modeling Syllabus
3 pages
Unit4 Datascience
No ratings yet
Unit4 Datascience
43 pages
Multiple Choice Questions
No ratings yet
Multiple Choice Questions
56 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Motion Detection
No ratings yet
Motion Detection
33 pages
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
Final Project
No ratings yet
Final Project
13 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Data Mining and KDD
No ratings yet
Data Mining and KDD
15 pages
Heart Disease Prediction Using Machine Learning Techniques: Raparthi Yaswanth, Y. Md. Riyazuddin
No ratings yet
Heart Disease Prediction Using Machine Learning Techniques: Raparthi Yaswanth, Y. Md. Riyazuddin
5 pages
Heart Disease Prediction Using Machine Learning IJERTV9IS080128
No ratings yet
Heart Disease Prediction Using Machine Learning IJERTV9IS080128
3 pages
Bangladesh Army University of Engineering & Technology (BAUET)
No ratings yet
Bangladesh Army University of Engineering & Technology (BAUET)
2 pages
Heart Disease Prediction With Machine Learning Approaches
No ratings yet
Heart Disease Prediction With Machine Learning Approaches
5 pages
Data Mining Chapter 2
100% (1)
Data Mining Chapter 2
8 pages
Machine Learning Classification Techniques For Heart Disease Prediction: A Review
No ratings yet
Machine Learning Classification Techniques For Heart Disease Prediction: A Review
8 pages
Data Mining
No ratings yet
Data Mining
8 pages
Heart Disease Prediction With Machine Learning Approaches
No ratings yet
Heart Disease Prediction With Machine Learning Approaches
6 pages
Nondini
No ratings yet
Nondini
44 pages
Thesis Task 1
No ratings yet
Thesis Task 1
4 pages
Integrated Design Project Report Format
No ratings yet
Integrated Design Project Report Format
8 pages
IMO Elevator Pump
No ratings yet
IMO Elevator Pump
16 pages
Nss Format 3rd Sem
No ratings yet
Nss Format 3rd Sem
5 pages
CS QP - CLASS XI ANNUAL EXAM APRIL 30TH (1)
No ratings yet
CS QP - CLASS XI ANNUAL EXAM APRIL 30TH (1)
5 pages
Visual Aids
No ratings yet
Visual Aids
8 pages
Fundamentals of Networks - Lab - Manual
No ratings yet
Fundamentals of Networks - Lab - Manual
36 pages
SoMove V2.9.7 ReleaseNotes
No ratings yet
SoMove V2.9.7 ReleaseNotes
8 pages
Nist - sp.800 210 Draft
No ratings yet
Nist - sp.800 210 Draft
35 pages
Assignment 3_553
No ratings yet
Assignment 3_553
9 pages
(Revised) How To Copy and Paste Text From Scanned or Secured PDF-1
No ratings yet
(Revised) How To Copy and Paste Text From Scanned or Secured PDF-1
3 pages
Saroja Devi Sex StoriesMalavin Leelaikal Part5
30% (10)
Saroja Devi Sex StoriesMalavin Leelaikal Part5
34 pages
Journal of Mobilization Vol. 17 (1) January-March, 2022 Final Book 28.4.22
No ratings yet
Journal of Mobilization Vol. 17 (1) January-March, 2022 Final Book 28.4.22
354 pages
Esa Pro 2 Manual en
No ratings yet
Esa Pro 2 Manual en
36 pages
STULZ C7000 G57 Modifications
No ratings yet
STULZ C7000 G57 Modifications
6 pages
Design of Flat Belt Drive
67% (3)
Design of Flat Belt Drive
18 pages
Vidyaa Vikas College of Engineering and Technology: Course Material
No ratings yet
Vidyaa Vikas College of Engineering and Technology: Course Material
89 pages
Weekly Home Learning Plan (Grade 10 - TLE - CSS)
100% (1)
Weekly Home Learning Plan (Grade 10 - TLE - CSS)
12 pages
Week 1
No ratings yet
Week 1
2 pages
Business Proposal1
No ratings yet
Business Proposal1
20 pages
Oem Price Ford
No ratings yet
Oem Price Ford
28 pages
Chs Caradol Ed56 200
No ratings yet
Chs Caradol Ed56 200
4 pages
Lecture No.45 Data Structures: Dr. Sohail Aslam
No ratings yet
Lecture No.45 Data Structures: Dr. Sohail Aslam
54 pages
Renewable and Sustainable Energy Reviews: Nelson Fumo
No ratings yet
Renewable and Sustainable Energy Reviews: Nelson Fumo
8 pages
Configratation SAP PP QM PM
No ratings yet
Configratation SAP PP QM PM
30 pages
Cybenetics - Evaluation - Report - Be Quiet! - Silent Wings 3 120mm
No ratings yet
Cybenetics - Evaluation - Report - Be Quiet! - Silent Wings 3 120mm
13 pages
Dsce-course Registration Form_6th Sem._2024-25[1]
No ratings yet
Dsce-course Registration Form_6th Sem._2024-25[1]
1 page
MiniPack Sealmatic - 56T-79T
No ratings yet
MiniPack Sealmatic - 56T-79T
78 pages
2020 Hitachi Electric Chain Hoist - Product Sheet
No ratings yet
2020 Hitachi Electric Chain Hoist - Product Sheet
5 pages