Data Mining - An Overview
Data Mining - An Overview
IIM Udaipur
In this session we shall learn
• Data Mining - who cares? what is it? where is
it used?
• Some concepts in Data Mining
• Learning types
• Typical steps in Data Mining
What is Data Mining
• “Extracting useful information from large data
sets” – Hand, Mannila, and Smyth (2001)
• “Data mining is the process of exploration and
analysis, by automatic or semi-automatic
means, of large quantities of data in order to
discover meaningful patterns and rules” –
Berry and Linoff (1997)
What is Data Mining
• “[Data mining is] statistics at scale and speed”
- Pregibon (1999)
• “[Data Mining is] the process of discovering
meaningful correlations, patterns and trends
by sifting through large amounts of data
stored in repositories. Data mining employs
pattern recognition technologies, as well as
statistical and mathematical techniques” –
Gartner Group
Where is it used
• Medical research (or broadly, Health research)
• Science and Engineering research
• Military
• Intelligence
• Security
• Business research
• Sports
• And many more….
In the Business World
• From a list of prospective customers, which are most likely
to respond?
• Which customers are most likely to commit fraud?
• Which loan applications are likely to default?
• Which customers are most likely to abandon a subscription
service
(telephone, magazine etc.)?
In the Business World
All the questions above can be answered
through classification techniques – logistic
regression or classification trees!
• Individuals whose data matches the best with
that of the existing customers
• Higher probability of involving fraud
• Probability of leaving
How did they get here
• Statistical tools – linear regression, logistic
regression, discriminant analysis, principal
components analysis, clustering techniques,
time series analysis and forecasting
• Computer Science tools (machine learning
techniques) – classification trees, artificial
neural networks (ANN), support vector
machines (SVM)
How did they get here
• Shmueli, Patel and Bruce’s (2010) extension of
Pregibon’s (1999) idea of data mining –
“statistics at scale, speed, and simplicity”
“Big” Data and Data Mining
• Walmart captured 20 million transactions per
day in a 10-terabyte database in 2003
• Lyman and Varian (2003) estimate that 5-
exabytes of information were produced in
2002 (1-exabyte = 1 million terabytes)
• Scannable bar codes, POS devices, GPS
• Growth of Internet
• Advancement in computational facilities
“Big” Data and Data Mining
• Data warehouses – Central repositories of
integrated data from one or more disparate
sources.
• Data marts – subsets of a data warehouse;
focus on single subjects such as sales, finance
or marketing
Useful Books on Data Mining
Techniques for this course
• “Data Mining for Business Intelligence” –
Shmueli, Patel, and Bruce (Textbook)
• “Data Mining and Business Analytics with R” –
Johannes Ledolter
• “Data Mining Techniques” – Linoff and Berry
• “An Introduction to Statistical Learning” –
James, Witten, Hastie, and Tibshirani
Core ideas in Data Mining
• Data exploration - Reviewing and examining the
data to see what messages they hold
- full understanding of the data may require a
reduction in its scale or dimension
- Data transformations
- Missing data
- Dealing with outliers
- Dealing with predictors of different types
Core ideas in Data Mining
• Data visualization – graphical exploration of the data
to see what information they hold
- Looking at each variable separately, as well as at
relationships between variables
- For numerical variables - histograms, boxplots
- For categorical variable - bar charts, dot plots
- For pairs of numerical variables to look for their
possible relationships, and type of relationships –
scatter plots
Core ideas in Data Mining
• Data reduction - Reduction of complex data
into simpler data. Instead of dealing with
thousands of product types, we might want
to put them in a smaller number of groups.
Core ideas in Data Mining
• Prediction - Predict the value of a numerical
(more specifically, continuous) variable
- Examples - sales, revenue, performance
- Each row is a case (unit, subject)
- Each column is a variable
- Technique: Multiple linear regression
Core ideas in Data Mining
• Classification – classifying units according to their
characteristics.
- Most basic form of data analysis
- Examples : (a) a loan applicant can repay on time, repay late,
or declare bankruptcy (b) the recipient of an offer can respond,
or not respond, (c) purchase / no purchase, (d) fraud / no fraud
- Each row of data is a case (customer, tax return, applicant)
- Each column is a variable
- Target variable is often binary (yes / no)
- Technique: Logistic regression; Discriminant analysis; k-
Nearest neighbors; Classification trees; Artificial Neural
Networks
Core ideas in Data Mining
• Association rules – Analysis of associations among items
purchased.
- Also called “affinity analysis”
- Data on transactions
- “What goes with what?”
- The “recommender” system of Amazon.com or
Netflix.com
- “Our records show you bought X, you may also like Y”
- Market Basket Analysis - Based on simple conditional
probability concept
Core ideas in Data Mining
• Predictive analytics - Combination of
classification, prediction and (to some extent)
association rules.
Learning Types
• Supervised learning algorithms
• Unsupervised learning algorithms
Supervised Learning Algorithms
• used in classification and prediction
• must have data available in which value of the
outcome of interest is known
• partitioning the data into two (sometimes,
three) parts – training data, validation data,
and test data
Supervised Learning Algorithms
Training partition:
• typically the largest partition
• contains the data used to build various models
we are examining
• this is the data from which the classification or
prediction algorithm “learns”, or is “trained”,
about the relationships between the outcome
and predictor variables
Supervised Learning Algorithms
Validation partition:
• after the algorithm has learned from the
training data, it is applied to the validation
data, to see how well it does
• used to assess the performance of each
model, so that we can compare the models,
and pick the best one
• sometimes, used also to fine tune, and hence
to improve the model
Supervised Learning Algorithms
Test partition:
• If many different models are being examined,
then we may save this third partition, to see
the performance of the model which is finally
chosen, with a new data
• Also called a “holdout”, or “evaluation”
partition
Supervised Learning Algorithms
Examples:
• Simple and Multiple Linear Regression
• Logistic Regression
• Discriminant Analysis
• k-Nearest Neighbors
• Classification and Regression Trees
• Artificial Neural Networks
• Support Vector Machines
Unsupervised Learning Algorithms
• used where there is no outcome variable to
predict or to classify
• no “learning” from cases where such an
outcome variable is known
Unsupervised Learning Algorithms
Examples:
• Association Rules
• Dimension Reduction Methods (such as,
principal component analysis)
• Clustering Techniques
Some typical steps in Data Mining
• Develop an understanding of the purpose of the
data mining project
• Obtain the data set to be used in the analysis
• Explore, clean and preprocess the data
• Reduce the data (if necessary). If supervised Data
Mining, then separate the data into training,
validation and test data sets
• Determine the data mining task (classification,
prediction, clustering etc.)
Some typical steps in Data Mining
• Choose the data mining techniques to be used
• Use algorithms to perform the task
• Interpret the results of the algorithms, and
compare the models (in case there are many)
• Deploy the model that performs the best
But most importantly:
The Understanding!