Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
DATA MINING


Data that has relevance for managerial decisions is accumulating at an incredible rate due to a
host of technological advances. Electronic data capture has become inexpensive and ubiquitous
as a by-product of innovations such as the internet, e-commerce, electronic banking, point-of-
sale devices, bar-code readers, and intelligent machines.
Such data is often stored in data warehouses and data marts specifically intended for
management decision support. Data mining is a rapidly growing field that is concerned with
developing techniques to assist managers to make intelligent use of these repositories.
A number of successful applications have been reported in areas such as credit rating, fraud
detection, database marketing, customer relationship management, and stock market
investments.
The field of data mining has evolved from the disciplines of statistics and artificial intelligence.


Definition

Term for confluence of ideas from statistics and computer science (machine
learning and database methods) applied to large databases in science, engineering
and business.



Gartner Group
• “Data mining is the process of discovering meaningful new
correlations, patterns and trends by sifting through large amounts of
data stored in repositories, using pattern recognition technologies as well
as statistical and mathematical techniques.”
Drivers
• Market: From focus on product/service to focus on customer

• IT: From focus on up-to-date balances to focus on patterns in transactions - Data
Warehouses -OLAP

• Dramatic drop in storage costs : Huge databases – e.g Walmart: 20 million
transactions/day, 10 terabyte database, Blockbuster: 36 million households

• Automatic Data Capture of Transactions
– e.g. Bar Codes , POS devices, Mouse clicks, Location
data (GPS, cell phones)
• Internet: Personalized interactions, longitudinal
Data
Process
1. Develop understanding of application, goals
2. Create dataset for study (often from Data
Warehouse)
3. Data Cleaning and Preprocessing
4. Data Reduction and projection
5. Choose Data Mining task
6. Choose Data Mining algorithms
7. Use algorithms to perform task
8. Interpret and iterate thru 1-7 if necessary
Data Mining step 4-8
9. Deploy: integrate into operational systems.
SEMMA Methodology
• Sample from data sets, Partition into Training, Validation and Test
datasets

• Explore data set statistically and graphically

• Modify:Transform variables, Impute missing values

• Model: fit models e.g. regression, classfication tree, neural net

• Assess: Compare models using Partition, Test datasets

Customer Relationship Management
• Target Marketing
• Attrition Prediction/Churn Analysis
• Fraud Detection
• Credit Scoring
Target marketing
• Business problem: Use list of prospects for direct mailing campaign
• Solution: Use Data Mining to identify most promising respondents
combining demographic and geographic data with data on past purchase
behavior
• Benefit: Better response rate, savings in campaign cost
Example: Fleet Financial Group
• Redesign of customer service infrastructure, including $38 million investment in
data warehouse and marketing automation

• Used logistic regression to predict response probabilities to home-equity product
for sample of 20,000 customer profiles from 15 million customer base

• Used to predict profitable customers and customers who would be unprofitable
even if they respond
Churn Analysis: Telcos
• Business Problem: Prevent loss of customers, avoid adding churn-prone
customers

• Solution: Use neural nets, time series analysis to identify typical patterns of
telephone usage of likely-to-defect and likely-to-churn customers

• Benefit: Retention of customers, more effective promotions

Example: IDEA CELLULAR
• CHURN/Customer Profiling System implemented as part of major custom data
warehouse solution

• Preventive action based on customer characteristics and known cases of churning
and non-churning customers identify significant characteristics for churn

• Early detection Customer Profiling Systems based on usage pattern matching
with known cases of churn customers.

Fraud Detection
• Business problem: Fraud increases costs or reduces revenue
• Solution: Use logistic regression, neural nets to identify characteristics
of fraudulent cases to prevent in future or prosecute more vigorously
• Benefit: Increased profits by reducing undesirable customers
Risk Analysis
• Business problem: Reduce risk of loans to delinquent customers
• Solution: Use credit scoring models using discriminant analysis to
create score functions that separate out risky customers
• Benefit: Decrease in cost of bad debts

Finance
• Business problem: Pricing of corporate bonds depends on several
factors, risk profile of company , seniority of debt, dividends, prior
history, etc.
• Solution Approach: Through Data Mining , develop more accurate
models of predicting prices.

E-commerce and Internet
• Collaborative Filtering
• From Clicks to Customers


Recommendation Systems
• Business opportunity: Users rate items (Amazon.com, CDNOW.com,
MovieFinder.com) on the web. How to use information from other users to infer
ratings for a particular user?
• Solution: Use of a technique known as collaborative filtering
• Benefit: Increase revenues by cross selling, up selling

Clicks to Customers
• Business problem: 50% of Dell’s clients order their computer through the web.
However, the retention rate is 0.5%, i.e. of visitors of Dell’s web page become
customers.
• Solution Approach: Through the sequence of their clicks, cluster customers and
design website, interventions to maximize the number of customers who eventually
buy.
• Benefit: Increase revenues
Emerging Major Data Mining applications
• Spam
• Bioinformatics/Genomics
• Medical History Data – Insurance Claims
• Personalization of services in e-commerce
• RF Tags
• Security :
– Container Shipments
– Network Intrusion Detection




DATA STORAGE

Core Concepts
• Types of Data:
– Numeric
• Continuous – ratio and interval
• Discrete

• Need for Binning
– Categorical – order and unordered
– Binary

• Overfitting and Generalization
• Regularization: Penalty for model complexity
• Distance
• Curse of Dimensionality
• Random and stratified sampling, resampling
• Loss Functions




Typical characteristics of mining data
• “Standard” format is spreadsheet:
– Row=observation unit, Column=variable
• Many rows, many columns
• Many rows moderate number of columns (e.g. tel. calls)
• Many columns, moderate number of rows (e.g.genomics)
• Opportunistic (often by-product of transactions)
– Not from designed experiments
– Often has outliers, missing data




Techniques
• Supervised Techniques
– Classification:
• k-Nearest Neighbors, Naïve Bayes, Classification Trees
• Discriminant Analysis, Logistic Regression, Neural Nets
– Prediction (Estimation):
• Regression, Regression Trees, k-Nearest Neighbors


• Unsupervised Techniques
– Cluster Analysis, Principal Components
– Association Rules, Collaborative Filtering

More Related Content

Data mining

  • 1. DATA MINING Data that has relevance for managerial decisions is accumulating at an incredible rate due to a host of technological advances. Electronic data capture has become inexpensive and ubiquitous as a by-product of innovations such as the internet, e-commerce, electronic banking, point-of- sale devices, bar-code readers, and intelligent machines. Such data is often stored in data warehouses and data marts specifically intended for management decision support. Data mining is a rapidly growing field that is concerned with developing techniques to assist managers to make intelligent use of these repositories. A number of successful applications have been reported in areas such as credit rating, fraud detection, database marketing, customer relationship management, and stock market investments. The field of data mining has evolved from the disciplines of statistics and artificial intelligence. Definition Term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science, engineering and business. Gartner Group • “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” Drivers • Market: From focus on product/service to focus on customer • IT: From focus on up-to-date balances to focus on patterns in transactions - Data Warehouses -OLAP • Dramatic drop in storage costs : Huge databases – e.g Walmart: 20 million transactions/day, 10 terabyte database, Blockbuster: 36 million households • Automatic Data Capture of Transactions – e.g. Bar Codes , POS devices, Mouse clicks, Location data (GPS, cell phones)
  • 2. • Internet: Personalized interactions, longitudinal Data Process 1. Develop understanding of application, goals 2. Create dataset for study (often from Data Warehouse) 3. Data Cleaning and Preprocessing 4. Data Reduction and projection 5. Choose Data Mining task 6. Choose Data Mining algorithms 7. Use algorithms to perform task 8. Interpret and iterate thru 1-7 if necessary Data Mining step 4-8 9. Deploy: integrate into operational systems. SEMMA Methodology • Sample from data sets, Partition into Training, Validation and Test datasets • Explore data set statistically and graphically • Modify:Transform variables, Impute missing values • Model: fit models e.g. regression, classfication tree, neural net • Assess: Compare models using Partition, Test datasets Customer Relationship Management • Target Marketing • Attrition Prediction/Churn Analysis • Fraud Detection • Credit Scoring Target marketing • Business problem: Use list of prospects for direct mailing campaign • Solution: Use Data Mining to identify most promising respondents combining demographic and geographic data with data on past purchase
  • 3. behavior • Benefit: Better response rate, savings in campaign cost Example: Fleet Financial Group • Redesign of customer service infrastructure, including $38 million investment in data warehouse and marketing automation • Used logistic regression to predict response probabilities to home-equity product for sample of 20,000 customer profiles from 15 million customer base • Used to predict profitable customers and customers who would be unprofitable even if they respond Churn Analysis: Telcos • Business Problem: Prevent loss of customers, avoid adding churn-prone customers • Solution: Use neural nets, time series analysis to identify typical patterns of telephone usage of likely-to-defect and likely-to-churn customers • Benefit: Retention of customers, more effective promotions Example: IDEA CELLULAR • CHURN/Customer Profiling System implemented as part of major custom data warehouse solution • Preventive action based on customer characteristics and known cases of churning and non-churning customers identify significant characteristics for churn • Early detection Customer Profiling Systems based on usage pattern matching with known cases of churn customers. Fraud Detection • Business problem: Fraud increases costs or reduces revenue • Solution: Use logistic regression, neural nets to identify characteristics of fraudulent cases to prevent in future or prosecute more vigorously • Benefit: Increased profits by reducing undesirable customers Risk Analysis
  • 4. • Business problem: Reduce risk of loans to delinquent customers • Solution: Use credit scoring models using discriminant analysis to create score functions that separate out risky customers • Benefit: Decrease in cost of bad debts Finance • Business problem: Pricing of corporate bonds depends on several factors, risk profile of company , seniority of debt, dividends, prior history, etc. • Solution Approach: Through Data Mining , develop more accurate models of predicting prices. E-commerce and Internet • Collaborative Filtering • From Clicks to Customers Recommendation Systems • Business opportunity: Users rate items (Amazon.com, CDNOW.com, MovieFinder.com) on the web. How to use information from other users to infer ratings for a particular user? • Solution: Use of a technique known as collaborative filtering • Benefit: Increase revenues by cross selling, up selling Clicks to Customers • Business problem: 50% of Dell’s clients order their computer through the web. However, the retention rate is 0.5%, i.e. of visitors of Dell’s web page become customers. • Solution Approach: Through the sequence of their clicks, cluster customers and design website, interventions to maximize the number of customers who eventually buy. • Benefit: Increase revenues Emerging Major Data Mining applications • Spam • Bioinformatics/Genomics
  • 5. • Medical History Data – Insurance Claims • Personalization of services in e-commerce • RF Tags • Security : – Container Shipments – Network Intrusion Detection DATA STORAGE Core Concepts • Types of Data: – Numeric • Continuous – ratio and interval • Discrete • Need for Binning – Categorical – order and unordered – Binary • Overfitting and Generalization
  • 6. • Regularization: Penalty for model complexity • Distance • Curse of Dimensionality • Random and stratified sampling, resampling • Loss Functions Typical characteristics of mining data • “Standard” format is spreadsheet: – Row=observation unit, Column=variable • Many rows, many columns • Many rows moderate number of columns (e.g. tel. calls) • Many columns, moderate number of rows (e.g.genomics) • Opportunistic (often by-product of transactions) – Not from designed experiments – Often has outliers, missing data Techniques • Supervised Techniques – Classification: • k-Nearest Neighbors, Naïve Bayes, Classification Trees • Discriminant Analysis, Logistic Regression, Neural Nets – Prediction (Estimation): • Regression, Regression Trees, k-Nearest Neighbors • Unsupervised Techniques – Cluster Analysis, Principal Components – Association Rules, Collaborative Filtering