09-Datamining Concepts
09-Datamining Concepts
Wipro Technologies
Agenda
► The Data mining Technology
► Data mining Process
Data Preparation
Data Mining Models
► Data Mining Techniques
► Data Mining Applications & Tools
► Data Mining Methodologies
The Data mining
Technology
A Problem...
► You are a marketing manager for a brokerage company
Problem: Churn is too high
►Turnover (after six month introductory period ends) is 40%
Customers receive incentives (average cost: $160)
when account is opened
Giving new incentives to everyone who might leave is very
expensive (as well as wasteful)
Bringing back a customer after they leave is both difficult
and costly
… A Solution
► One month before the end of the introductory period is over,
predict which customers will leave
If you want to keep a customer that is predicted to churn, offer
them something based on their predicted value
►The ones that are not predicted to churn need no attention
If you don’t want to keep the customer, do nothing
► Drivers
Focus on the customer, competition, and data assets
► Enablers
Increased data hoarding
Cheaper and faster hardware
Growing Base of data
Increased
Computing
Power
DM
Statistical Improved
and Learning Data Collection
Algorithms and Mgmt
Motivation for doing Data
Mining
► Investment in Data Collection/Data Warehouse
► Add value to the data holding
► Competitive advantage
► More effective decision making
► OLTP =) Data Warehouse =) Decision Support
► Work to add value to the data holding
► Support high level and long term decision making
► Fundamental move in use of Databases
Data Mining - Definition
► Data mining is the automated detection for new, valuable and
non trivial information in large volumes of data.
► It predicts future trends and finds behavior that the experts
may miss because it lies outside their expectations
Data mining lets you be proactive
Prospective rather than Retrospective
► DataMining Leads to simplification and automation of the
overall statistical process of deriving information from huge
volume of data.
Data Mining Introduction
► DM - what it can do
Exploit patterns & relationships in data to produce
models
Two uses for models:
►Predictive
►Descriptive
► DM - what it can’t do
Automatically find relationships
►withoutuser intervention
►when no relationships exist
Data Mining Introduction
► Data Mining and Data Warehousing
Data preparation for DM may be part of the Data
Warehousing
Data Warehouse not a requirement for Data Mining
► DM and OLAP
OLAP = Classic descriptive model
Requires significant user input
Example : Beer and diaper sales
►An OLAP tools shows reports giving sales of different items
►A data mining tool analyses the data and predicts ‘how
many times beer and diapers are sold together
Data Mining Introduction
► DM and Classical Statistics
Classical statistics based on elegant theory
and restrictive data assumptions
Fine if data sets small and assumptions met
Modeler plays active role - specifying model
form, interactions, etc
In newer tools, pattern finding is data-driven
rather than user-driven
Data Mining : Introduction
Data Mining is Not ...
► Data warehousing
► SQL / Ad Hoc Queries / Reporting
► Software Agents
► Online Analytical Processing (OLAP)
► Data Visualization
Data Mining Environment
Data mining - Analysis
Examples of Data Mining
► Cross Validation
►
Model monitoring
Determine if the model still ‘works’
Data Mining Process
Data Mining Models
Types of Models
Clustering
► Divide the data into a
number of different groups
► Determine the attributes
that characterises a group
automatically
► Can be used for
classification of new cases
K-Means Clustering
► User starts by specifying the number of
clusters (K)
► K data points are randomly selected
► Repeat until no change in specific clustering
100
statistics
Age
46
Association
► Identifies
the items that occur
together in a given event
If ‘A’ occurs then x% (confidence factor) of
the times ‘B’ occurs. This is found in y%
(Support) of the data.
► Used for ‘Market Basket Analysis’
Association Rules
► Finds relations among attributes in the
data that frequently co-occur .
E.g., Association Rules. Popular for basket
analysis Then
Buy diapers
on Buy beer
Friday night
► Response curves
How does the response rate of a targeted
selection compare to a random selection?
Classification
► Goal of classification is to build structures from
examples of past decisions that can be used to make
decisions for unseen cases.
► Predicts the cluster in which a new case fits in
► The characteristics of the groups can be defined by
an expert or fed from historic data
► Often referred to as supervised learning.
► Decision Tree and Rule induction are popular
techniques
► Neural Networks also used
Regression
► Forecasts the future values based on
existing values
► Types
Simple - one independent variable
Multiple - more than one independent
variables
Example 1: Regression
Name Income Age Order
Amt
a 23000 30 83
b 51100 40 131
c 68000 55 178
d 74000 46 166
e 23000 47 117
Pattern:
0
Dose (cc’s) 1000
Time series
► Forecasts the future trends
► Model includes time hierarchy like
year, quarter, month, week etc.
► Considers impact of seasonality,
calendar effects such as holidays
► What is the expected price for
Microsoft’s stock by the end of this
year?
Data Mining
Techniques
Techniques
► Neural Networks
► Decision Trees
► Rule Induction
► K nearest neighbour
Neural Networks
► Parameter adjustment systems
► Interconnected elements (neurons)
► Train the net on a training data set
► Use Trained net to make predictions
► Can deal with only numeric data
Neural Networks -
functioning
Adjust Weights
Feedback
Prediction Learning
Module
Output Actual
Middle Layer Data
Layer
Input
(functions)
Layer
(Feed Forward) Neural
Networks
Dose < 100 Dose ≥ 100 Dose < 160 Dose ≥ 160
Decision Trees : Examples
Advantages & Limitations
► Advantages
Models can be built very quickly
Suitable for large data sets
Easy to understand
Gives reasons for a decision taken
Handle non-numeric data very well
Minimum amount of data transformation
► Limitations
Leads to an artificial sense of clarity
Trees left to grow without bound take longer to build and become unintelligible
May overfit the data
Algorithms used for splitting are generally univariate - using single independent
variable at a time
Rule Induction
► Extraction of useful if-then rules from
data based on statistical significance.
► Completely a machine driven process.
► Can discover very general rules which
deal with both numeric and non-numeric
data.
► Translating the rules into a usable model
must be done either by the user or a
decision trees interface
Rule Induction Examples
► If Car = Ford and Age = 30…40
Then Defaults = Yes ,Weight = 3.7
► If Age = 25…35 and Prior_purchase = No
Then Defaults = No, Weight = 1.2
► Not necessarily exclusive (overlap)
► Start by considering single item rules
If A then B
►A = Missed Payment, B = Defaults on Credit Card
Is observed probability of A & B combination greater than
expected (assuming independence)?
►If It is, rule describes a predictable pattern
Rule Induction: Important
points
► Look at all possible variable combinations
► Compute probabilities of combinations
► Expensive!
► Look only at rules that predict relevant behavior
► Limit calculations to those with sufficient support
► When moving onto larger combinations of variables like
n3, n4, n5, ...
Support decreases dramatically, limiting calculations
K Nearest Neighbor
►A classification technique
► Decides in which class to put a new
case in
► Criterion is to find a maximum number
(k) of neighbors having most similar
properties
► Assigns a new case to the same class
to which most of the neighbors belong
K Nearest Neighbor
► Set of already classified cases selected to use
as a basis.
► Neighborhood size, in which to do the
comparisons, decided.
► Decided how to count the neighbors (assigning
weights to neighbors, may give more weight to
nearer neighbor than a farther one).
K Nearest Neighbor Model
100
Age 0 Dose (cc’s) 1000
1. As the name applies Techniques are supervised to work or perform. These models
are first TRAINED using data whose RESPONSE variable or result is already
KNOWN.
2. Predictive models (classification, regression etc) fall under this category –They have
to train and then test. Estimated value is to compared with known value
► Unsupervised Learning:
•Assign Churn Score to all customers in order to identify those who are
most likely to churn (Quarter etc).
•Define Clearly segments that are strongly divided by their churn relating
Behavior
CHURN ANALYSIS
► Basic Understanding
► Information Sources
► Suggested Analysis
Pareto analysis
Also called 80/20 Analysis. Its been observed that 80%
of the revenue profit comes from 20 % of the
customer. Key Business Improvement was identifying
those 20% and serves them better.
Techniques/ Reports/Algorithms
Techniques/ Reports/Algorithms
Techniques/ Reports/Algorithms
Techniques/ Reports/Algorithms
Customer profiling
active accounts, Light user, risky customer Active accounts, Loss makin
ofit making accounts. This segment helps in mapping with the predictiv
gment.
Techniques/ Reports/Algorithms
Called Life Time value Analysis .Revenue projected over 25 yrs and
Projected Churning loss and rate.
Techniques/ Reports/Algorithms
Techniques/ Reports/Algorithms
Techniques/ Reports/Algorithms
K.Hazard technique
CHURN ANALYSIS
Suggested Approach
Note
► CIBIL
1. It’s the first credit information bureau being established in India 2003. CIBIL will obtain and
Share data on borrowers both consumer and commercial for sound credit decision
therefore helping to avoid adverse selection.
2. Availability of credit information facilitates credit scoring mechanism and Credit Risk
Analysis will play an important role in that
► Basel II
1. As per the Basel II Accord which serves as a guideline for the banks across 32 countries to
reduce credit risk, Operation risk and business risks. By 2007 all banks should have data
warehouse in place so that information should be available for risk related analysis.
Data Mining
Methodology
Data Mining Methodology
Methodologies
CRISP-DM methodology
Wipro - Data mining methodology
CRISP - DM
► CRoss Industry Standard Process
model - Data Mining
► Hierarchical approach
Phase
Generic Task
Specialised Task
Process Instance
CRISP - DM
Mapping Generic to Specific
Model
► Application domain
The specific area in which the data mining project
takes place
► Data mining problem type
The specific class(es) of objective(s) which the
data mining project deals with
► Technical aspect
covers specific technical issues in data mining
► Tool and technique
The Cycle
Phases
► Business Understanding
► Data Understanding
► Data Preparation
► Modeling
► Evaluation
► Deployment
Business Understanding
► Understanding the project objectives
and requirements from a business
perspective
► Converting this requirements into a
data mining problem definition
► Preliminary plan designed to achieve
the objectives.
Data Understanding
► Initialdata collection
► Identifying data quality problems
► detect interesting subsets to form
hypotheses for hidden information.
Data Preparation
► Identificationof data at Table, Record,
and Attribute level
► Transformation and cleaning of data
for modeling tools.
► Data preparation tasks are performed
multiple times
Modeling
► Identificationof the suitable modeling
techniques for the requirement
► Applying the various possible models
on the data set
► Requires stepping back to the data
preparation phase, due to the model
specific requirements
Evaluation
► Evaluate the model, and review the
steps executed to construct the model,
from a business perspective
► Determining the business issues that
are not sufficiently considered.
► Deciding on the use of the data mining
results
Deployment
► This can be as simple as generating a report
or as complex as implementing a repeatable
data mining process, depending on the
requirements.
► Normally carried out by the customer and
not the data analyst.
► Customer has to understand up front what
actions will need to be carried out in order to
actually make use of the created models.
Wipro’s - DM Process
Model
Business Understanding
•Domain Understanding
•Problem Selection
•Solution Selection