Data Mining - An Overview

This document provides an overview of data mining, including definitions, common uses, and typical steps. It defines data mining as the process of extracting useful information from large data sets. Data mining is used in various fields like business, health research, and more. Typical steps in data mining projects include obtaining data, preprocessing, reducing dimensions, determining the task, applying algorithms, and interpreting results. Both supervised and unsupervised learning algorithms are discussed.

Uploaded by

ShyamBhatt

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views

Data Mining - An Overview

Uploaded by

ShyamBhatt

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Data mining – An Overview

IIM Udaipur
In this session we shall learn
• Data Mining - who cares? what is it? where is
it used?
• Some concepts in Data Mining
• Learning types
• Typical steps in Data Mining
What is Data Mining
• “Extracting useful information from large data
sets” – Hand, Mannila, and Smyth (2001)
• “Data mining is the process of exploration and
analysis, by automatic or semi-automatic
means, of large quantities of data in order to
discover meaningful patterns and rules” –
Berry and Linoff (1997)
What is Data Mining
• “[Data mining is] statistics at scale and speed”
- Pregibon (1999)
• “[Data Mining is] the process of discovering
meaningful correlations, patterns and trends
by sifting through large amounts of data
stored in repositories. Data mining employs
pattern recognition technologies, as well as
statistical and mathematical techniques” –
Gartner Group
Where is it used
• Medical research (or broadly, Health research)
• Science and Engineering research
• Military
• Intelligence
• Security
• Business research
• Sports
• And many more….
In the Business World
• From a list of prospective customers, which are most likely
to respond?
• Which customers are most likely to commit fraud?
• Which loan applications are likely to default?
• Which customers are most likely to abandon a subscription
service
(telephone, magazine etc.)?
In the Business World
All the questions above can be answered
through classification techniques – logistic
regression or classification trees!
• Individuals whose data matches the best with
that of the existing customers
• Higher probability of involving fraud
• Probability of leaving
How did they get here
• Statistical tools – linear regression, logistic
regression, discriminant analysis, principal
components analysis, clustering techniques,
time series analysis and forecasting
• Computer Science tools (machine learning
techniques) – classification trees, artificial
neural networks (ANN), support vector
machines (SVM)
How did they get here
• Shmueli, Patel and Bruce’s (2010) extension of
Pregibon’s (1999) idea of data mining –
“statistics at scale, speed, and simplicity”
“Big” Data and Data Mining
• Walmart captured 20 million transactions per
day in a 10-terabyte database in 2003
• Lyman and Varian (2003) estimate that 5-
exabytes of information were produced in
2002 (1-exabyte = 1 million terabytes)
• Scannable bar codes, POS devices, GPS
• Growth of Internet
• Advancement in computational facilities
“Big” Data and Data Mining
• Data warehouses – Central repositories of
integrated data from one or more disparate
sources.
• Data marts – subsets of a data warehouse;
focus on single subjects such as sales, finance
or marketing
Useful Books on Data Mining
Techniques for this course
• “Data Mining for Business Intelligence” –
Shmueli, Patel, and Bruce (Textbook)
• “Data Mining and Business Analytics with R” –
Johannes Ledolter
• “Data Mining Techniques” – Linoff and Berry
• “An Introduction to Statistical Learning” –
James, Witten, Hastie, and Tibshirani
Core ideas in Data Mining
• Data exploration - Reviewing and examining the
data to see what messages they hold
- full understanding of the data may require a
reduction in its scale or dimension
- Data transformations
- Missing data
- Dealing with outliers
- Dealing with predictors of different types
Core ideas in Data Mining
• Data visualization – graphical exploration of the data
to see what information they hold
- Looking at each variable separately, as well as at
relationships between variables
- For numerical variables - histograms, boxplots
- For categorical variable - bar charts, dot plots
- For pairs of numerical variables to look for their
possible relationships, and type of relationships –
scatter plots
Core ideas in Data Mining
• Data reduction - Reduction of complex data
into simpler data. Instead of dealing with
thousands of product types, we might want
to put them in a smaller number of groups.
Core ideas in Data Mining
• Prediction - Predict the value of a numerical
(more specifically, continuous) variable
- Examples - sales, revenue, performance
- Each row is a case (unit, subject)
- Each column is a variable
- Technique: Multiple linear regression
Core ideas in Data Mining
• Classification – classifying units according to their
characteristics.
- Most basic form of data analysis
- Examples : (a) a loan applicant can repay on time, repay late,
or declare bankruptcy (b) the recipient of an offer can respond,
or not respond, (c) purchase / no purchase, (d) fraud / no fraud
- Each row of data is a case (customer, tax return, applicant)
- Each column is a variable
- Target variable is often binary (yes / no)
- Technique: Logistic regression; Discriminant analysis; k-
Nearest neighbors; Classification trees; Artificial Neural
Networks
Core ideas in Data Mining
• Association rules – Analysis of associations among items
purchased.
- Also called “affinity analysis”
- Data on transactions
- “What goes with what?”
- The “recommender” system of Amazon.com or
Netflix.com
- “Our records show you bought X, you may also like Y”
- Market Basket Analysis - Based on simple conditional
probability concept
Core ideas in Data Mining
• Predictive analytics - Combination of
classification, prediction and (to some extent)
association rules.
Learning Types
• Supervised learning algorithms
• Unsupervised learning algorithms
Supervised Learning Algorithms
• used in classification and prediction
• must have data available in which value of the
outcome of interest is known
• partitioning the data into two (sometimes,
three) parts – training data, validation data,
and test data
Supervised Learning Algorithms
Training partition:
• typically the largest partition
• contains the data used to build various models
we are examining
• this is the data from which the classification or
prediction algorithm “learns”, or is “trained”,
about the relationships between the outcome
and predictor variables
Supervised Learning Algorithms
Validation partition:
• after the algorithm has learned from the
training data, it is applied to the validation
data, to see how well it does
• used to assess the performance of each
model, so that we can compare the models,
and pick the best one
• sometimes, used also to fine tune, and hence
to improve the model
Supervised Learning Algorithms
Test partition:
• If many different models are being examined,
then we may save this third partition, to see
the performance of the model which is finally
chosen, with a new data
• Also called a “holdout”, or “evaluation”
partition
Supervised Learning Algorithms
Examples:
• Simple and Multiple Linear Regression
• Logistic Regression
• Discriminant Analysis
• k-Nearest Neighbors
• Classification and Regression Trees
• Artificial Neural Networks
• Support Vector Machines
Unsupervised Learning Algorithms
• used where there is no outcome variable to
predict or to classify
• no “learning” from cases where such an
outcome variable is known
Unsupervised Learning Algorithms
Examples:
• Association Rules
• Dimension Reduction Methods (such as,
principal component analysis)
• Clustering Techniques
Some typical steps in Data Mining
• Develop an understanding of the purpose of the
data mining project
• Obtain the data set to be used in the analysis
• Explore, clean and preprocess the data
• Reduce the data (if necessary). If supervised Data
Mining, then separate the data into training,
validation and test data sets
• Determine the data mining task (classification,
prediction, clustering etc.)
Some typical steps in Data Mining
• Choose the data mining techniques to be used
• Use algorithms to perform the task
• Interpret the results of the algorithms, and
compare the models (in case there are many)
• Deploy the model that performs the best
But most importantly:
The Understanding!

Before getting into any algorithm,

develop an understanding of the
problem at hand first!
Additional

SOME USEFUL CONCEPTS

Obtaining Data - Sampling
• Data Mining typically deals with large,
sometimes huge, databases
• Algorithms and models are typically applied to
a sample from a database, to produce
statistically valid results
• Once we develop and select a final model, we
use it to “score” the observations in the larger
databases
Rare Event Oversampling
• The event of interest may be a rare one
sometimes
• Example – customers purchasing a product in
response to a mailing
• Sampling may yield too few “interesting”
cases to effectively train a model
• Solution – oversample the rare cases to get a
more balanced dataset
• Use carefully!!
Pre-processing and Cleaning the Data
• Types of variables – numeric and categorical
• Numeric variables – Continuous and Integer
• Categorical variables – Ordered and
Unordered
• Dummies for categorical variables – XLMiner
cannot create dummies itself while R can; for
XLMiner, dummy variables need to be created
manually
Data Transformations
• Transforming the predictors may be necessary for
various reasons: The modeling technique may
require the predictors in a common scale
• Centering and Scaling
- Increases numerical stability of some models
- Loss of interpretability
• Skewness transformations
- Diagnosis from the skewness formula
- Usual transformations: log, square-root, inverse
- Box-Cox transformations
Detecting Outliers
• An outlier is an observation that is “extreme”, being distant
from rest of the data
• Check for obvious data recording errors
• “Even with a thorough understanding of the data, outliers
can be hard to define” – Kuhn and Johnson (2013)
• (If at all) detected, domain knowledge is necessary to
decide whether to delete it or not
• In some contexts, detecting outliers is the Data Mining
exercise itself (airport security screening); it is called
“anomaly detection”
• Outliers can have significant influence on some models,
e.g., regression analysis
• Models resistant to outliers: Tree-based algorithms
Handling Missing Data
• Missing values in a real dataset are unavoidable
• Informative missingness
• Mostly occurs for certain predictors
• Solution 1: Omission
- may be the most feasible solution sometimes
- usually not a problem for large datasets
• Solution 2: Imputation
- use of statistical / machine learning techniques to
impute the missing values by reasonable substitutes
- for example, use of k-nearest neighbours algorithm
Dealing with Predictors of Different
Types
• Removing predictors that are not useful
- Advantages
- Zero-variance, and Near-zero-variance
predictors
- Between-predictor correlations
- Understanding before removal is very
important though
Predictive Power and Overfitting
• How well the model will perform when
applied to new data?
• We want our model to generalize beyond the
dataset we have at hand
• Data partitioning
• Over-fitting
• Cross-Validation
References
1. “Data Mining for Business Intelligence” by G.
Shmueli, N. Patel and P. Bruce
2. “Applied Predictive Modeling” by M. Kuhn
and K. Johnson
3. “Data Mining and Statistics for Decision
Making” by S. Tuffery

Data Warehouse Data Mining - 700MCQ's
78% (37)
Data Warehouse Data Mining - 700MCQ's
28 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Software Engineering Midterm Lec 3
No ratings yet
Software Engineering Midterm Lec 3
6 pages
Permissions Poster SQL Server VNext and SQLDB
No ratings yet
Permissions Poster SQL Server VNext and SQLDB
1 page
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Binary Tree Traversal
No ratings yet
Binary Tree Traversal
10 pages
SQL RDBMS
100% (2)
SQL RDBMS
289 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
ALX Data Analytics Program Description
No ratings yet
ALX Data Analytics Program Description
6 pages
C# Faq
No ratings yet
C# Faq
19 pages
Simon Looker - Curriculum Vitae
No ratings yet
Simon Looker - Curriculum Vitae
2 pages
Operating System Exercises - Chapter 13-Sol
100% (1)
Operating System Exercises - Chapter 13-Sol
4 pages
Project Report: CS 574 - Computer Vision Using Machine Learning
No ratings yet
Project Report: CS 574 - Computer Vision Using Machine Learning
38 pages
Data Science Theory: Analysis and Analytics
No ratings yet
Data Science Theory: Analysis and Analytics
14 pages
Linked List - SLL
No ratings yet
Linked List - SLL
15 pages
Keras Cheat Sheet Python
No ratings yet
Keras Cheat Sheet Python
1 page
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Final - Data and Ai Governance.6sept2023
No ratings yet
Final - Data and Ai Governance.6sept2023
42 pages
Operating System Exercises - Chapter 16-Sol
No ratings yet
Operating System Exercises - Chapter 16-Sol
4 pages
Data Structures
No ratings yet
Data Structures
136 pages
Vol 2 No 4
100% (1)
Vol 2 No 4
113 pages
Module 2
No ratings yet
Module 2
20 pages
C# Questions1
No ratings yet
C# Questions1
11 pages
20 Machine Learning Projects For Beginners
No ratings yet
20 Machine Learning Projects For Beginners
22 pages
Kenny-230717-Google Data Scientist Guide
No ratings yet
Kenny-230717-Google Data Scientist Guide
8 pages
Introduction To Splunk
No ratings yet
Introduction To Splunk
7 pages
Java University Paper Questions MCA Mumbai University
No ratings yet
Java University Paper Questions MCA Mumbai University
2 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
NESSUS Group #01 (004,042) IS Presentation
No ratings yet
NESSUS Group #01 (004,042) IS Presentation
24 pages
Principal Component Analysis - Ipynb
No ratings yet
Principal Component Analysis - Ipynb
27 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Item-Based Collaborative Filtering Recommendation Algorithms
No ratings yet
Item-Based Collaborative Filtering Recommendation Algorithms
11 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
DSA Interview Questions
No ratings yet
DSA Interview Questions
5 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
20 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Design A Library System - Flowcharts and Pseudocode
No ratings yet
Design A Library System - Flowcharts and Pseudocode
1 page
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Urmimala Ma'Am Done by MD Samir Khan: BBA 4B
No ratings yet
Urmimala Ma'Am Done by MD Samir Khan: BBA 4B
18 pages
Quiz2 3510 Cheat-Sheet
100% (1)
Quiz2 3510 Cheat-Sheet
4 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Artificial Neural Networks Kluniversity Course Handout
No ratings yet
Artificial Neural Networks Kluniversity Course Handout
18 pages
Fundamentals of Algorithm
No ratings yet
Fundamentals of Algorithm
14 pages
Good Programming Skills
No ratings yet
Good Programming Skills
47 pages
ML Cheatsheet Final
No ratings yet
ML Cheatsheet Final
32 pages
Unit 3 - Learn To Restrict and Sort Data
No ratings yet
Unit 3 - Learn To Restrict and Sort Data
18 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Neural Networks Cheat Sheet - 2020 PDF
No ratings yet
Neural Networks Cheat Sheet - 2020 PDF
14 pages
Alation Customer Case Study Riot Games
No ratings yet
Alation Customer Case Study Riot Games
4 pages
Java Is Object Oriented Prog. Lang Java Is Object Oriented Prog. Lang
No ratings yet
Java Is Object Oriented Prog. Lang Java Is Object Oriented Prog. Lang
37 pages
(Skiena, 2017) - Book - The Data Science Design Manual - 3
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 3
1 page
Python in A Day - Cheet Sheet
No ratings yet
Python in A Day - Cheet Sheet
2 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Time Series
No ratings yet
Time Series
23 pages
Data Science Note
No ratings yet
Data Science Note
24 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
DVC - All Questions and Answers - CT 1, CT 2 and Model - Final
No ratings yet
DVC - All Questions and Answers - CT 1, CT 2 and Model - Final
114 pages
Python For Non-Programmers Final
No ratings yet
Python For Non-Programmers Final
218 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Feature engineering Complete Self-Assessment Guide
From Everand
Feature engineering Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
13_Unsupervised_Learning
No ratings yet
13_Unsupervised_Learning
9 pages
CS 2032 Datawarehousing & Data Mining QB Topic Wise
No ratings yet
CS 2032 Datawarehousing & Data Mining QB Topic Wise
11 pages
R16B TECHITIVYearSyllabuswithNLPAdded
No ratings yet
R16B TECHITIVYearSyllabuswithNLPAdded
255 pages
Teaching Evaluation System by Use of Machine Learning and Artificial Intelligence Methods
No ratings yet
Teaching Evaluation System by Use of Machine Learning and Artificial Intelligence Methods
15 pages
Assignment 2 Slot8 TTS3208 Summer
No ratings yet
Assignment 2 Slot8 TTS3208 Summer
11 pages
Power System Fault Classification and Prediction Based On A Three-Layer Data Mining Structure
No ratings yet
Power System Fault Classification and Prediction Based On A Three-Layer Data Mining Structure
18 pages
Datamining For Business Intelligence
No ratings yet
Datamining For Business Intelligence
6 pages
Chap5 Basic Association Analysis
No ratings yet
Chap5 Basic Association Analysis
105 pages
Data Preprocessing and Apriori Algorithm Improvement in Medical Data Mining
No ratings yet
Data Preprocessing and Apriori Algorithm Improvement in Medical Data Mining
4 pages
Association Rule Mining
No ratings yet
Association Rule Mining
17 pages
Data Warehousing and Data Mining - Handbook
0% (2)
Data Warehousing and Data Mining - Handbook
27 pages
DMW Module 5
No ratings yet
DMW Module 5
126 pages
85 BCA Cloud Computing Minor
No ratings yet
85 BCA Cloud Computing Minor
21 pages
2nd Unit NN Final Class Notes
No ratings yet
2nd Unit NN Final Class Notes
51 pages
Assignment PGP11170 Prithika Dasgupta
No ratings yet
Assignment PGP11170 Prithika Dasgupta
20 pages
Introduction To Decision Tree: Gini Index
No ratings yet
Introduction To Decision Tree: Gini Index
15 pages
(Big Data For Industry 4.0) K. Suganthi, R. Karthik, G. Rajesh, Peter Ho Chiung Ching - Machine Learning and Deep Learning Techniques in Wireless and Mobile Networking Systems-CRC Press (2021)
No ratings yet
(Big Data For Industry 4.0) K. Suganthi, R. Karthik, G. Rajesh, Peter Ho Chiung Ching - Machine Learning and Deep Learning Techniques in Wireless and Mobile Networking Systems-CRC Press (2021)
285 pages
BDA (18CS72) Module-5
No ratings yet
BDA (18CS72) Module-5
52 pages
Association Rules With Graph Patterns - XinWang
No ratings yet
Association Rules With Graph Patterns - XinWang
12 pages
Group Project Weka
No ratings yet
Group Project Weka
20 pages
Introduction To Data Mining-Sources
No ratings yet
Introduction To Data Mining-Sources
5 pages
4.1) FP Growth Algorithm
No ratings yet
4.1) FP Growth Algorithm
26 pages
Course Recommender System Aims at Predicting The Best Combination of Courses Selected by Students-1
No ratings yet
Course Recommender System Aims at Predicting The Best Combination of Courses Selected by Students-1
29 pages
Wekappt
No ratings yet
Wekappt
58 pages
Blatt03 Sol
No ratings yet
Blatt03 Sol
16 pages
KDS 501introduction To Data Analytics and Visualization
No ratings yet
KDS 501introduction To Data Analytics and Visualization
3 pages
ECLAT Algorithm For Frequent Item Sets Generation: January 2014
No ratings yet
ECLAT Algorithm For Frequent Item Sets Generation: January 2014
4 pages
ML 15 09 2022
No ratings yet
ML 15 09 2022
22 pages