Module 4: Advanced Analytics - Theory and Methods: Lesson 6: Linear Regression
Module 4: Advanced Analytics - Theory and Methods: Lesson 6: Linear Regression
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 1
Regression
• Regression focuses on the relationship between an outcome and
its input variables.
In other words, we don't just predict the outcome, we also have a
sense of how changes in individual drivers affect the outcome.
• The outcome can be continuous or discrete.
When it's discrete, we are predicting the probability that the
outcome will occur.
Example Questions:
I want to predict the life time value (LTV) of this customer (and
understand what drives LTV).
I want to predict the probability that this loan will default (and
understand what drives default).
• Our examples: Linear Regression, Logistic Regression
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 2
Linear Regression -What is it?
• Used to estimate a continuous value as a linear (additive)
function of other variables
Income as a function of years of education, age, gender
House price as function of median home price in neighborhood,
square footage, number of bedrooms/bathrooms
Neighborhood house sales in the past year based on
unemployment, stock price etc.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 3
Linear Regression - Use Cases
• The preferred method for almost any problem where we are
predicting a continuous outcome
Try this first; if it fails, then try something more complicated
• Examples:
Customer lifetime value
Home value
Loss given default on loan
Income as a function of demographics
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 4
Example: Predict Mortgage Foreclosure/Delinquency Rates
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 5
Technical Description
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 6
Representing Categorical Variable
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 7
What do the Coefficients bi Mean?
• Change in y as a function of unit change in xi
all other things being equal
• Example: income in units of $10K, years in age, bage= 2
For the same gender, years of education, and state of residence, a
person's income increases by 2 units (20K)for every year older
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 8
Diagnostics
• Hold-out data
Does the model predict well on data it hasn't seen?
• N-fold cross-validation
Partition the data into N groups.
Fit N models, holding out each group, and calculate the residuals
on the group.
Estimated prediction error is the average over all the residuals.
• R2 : The fraction of the variance in the output variable that the
model can explain.
It is also the square of the correlation between the true output and
the predicted output. You want it close to 1.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 9
Diagnostics (Continued) Overpredicts for low true
values, underpredicts at
higher values. Improve
• Plot it! the model.
Prediction vs. true outcome
• Look for:
Systematic over/under
prediction
Non-consistent variance
The data cloud should be
symmetric about the line of
true prediction
Glaring outliers
• You will see other diagnostic
plots in the lab Not quite
consistent
variance, but much
better.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 10
Linear Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Concise representation (the coefficients) Does not handle missing values well
Robust to redundant variables, correlated Assumes that each variable affects the
variables outcome linearly and additively
Lose some explanatory value Variable transformations and
modeling variable interactions can
alleviate this
A good idea to take the log of
monetary amounts or any variable
with a wide dynamic range
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 11
Module 4: Advanced Analytics – Theory and Methods
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 12
Logistic Regression
• Used to estimate the probability that an event will occur as a
function of other variables
The probability that a borrower will default as a function of his credit
score, income, the size of the loan, and his existing debts
• Can be considered a classifier, as well
Assign the class label with the highest probability
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 13
Logistic Regression Use Cases
• The preferred method for many binary classification problems:
Especially if you are interested in the probability of an event, not
just predicting the "yes or no“
Try this first; if it fails, then try something more complicated
• Binary Classification examples:
The probability that a borrower will default
The probability that a customer will churn
• Multi-class example
The probability that a politician will vote yes/vote no/not show up
to vote on a given bill
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 14
Logistic Regression Model - Example
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 15
Logistic Regression- Visualizing the Model
Blue=defaulters
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 16
Technical Description (Binary Case)
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 17
What do the Coefficients bi Mean?
• Invert the logit expression:
• exp(bj) tells us how the odds-ratio of y=1 changes for every unit change in xj
• Example: bcreditScore = -0.69
• exp(bcreditScore) = 0.5 = 1/2
• for the same income, loan, and existing debt, the odds-ratio of default is
halved for every point increase in credit score
• Standard packages return the significance of the coefficients in the same way
as in linear regression
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 18
An Interesting Fact About Logistic Regression
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 19
Diagnostics: ROC Curve
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 20
Diagnostics: Plot the Histograms of Scores
good separation
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 21
Logistic Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Explanatory value: Does not handle missing values well
Relative impact of each variable on the outcome
in a more complicated way than linear regression
Robust with redundant variables, correlated variables Assumes that each variable affects the log-odds of the
Lose some explanatory value outcome linearly and additively
Variable transformations and modeling variable
interactions can alleviate this
A good idea to take the log of monetary amounts
or any variable with a wide dynamic range
Concise representation with the Cannot handle variables that affect the outcome in a
the coefficients discontinuous way.
Step functions
Easy to score data Doesn't work well with discrete drivers that have a lot
of distinct values
For example, ZIP code
Returns good probability estimates of an event
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 22
Module 4: Advanced Analytics – Theory and Methods
Lesson 4: Text Analysis
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 23
Text Analysis
Encompasses the processing and representation of text for
analysis and learning tasks
• High-dimensionality
Every distinct term is a dimension
Green Eggs and Ham: A 50-D problem!
• Data is Un-structured
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 24
Text Analysis – Problem-solving Tasks
• Parsing
Impose a structure on the unstructured/semi-structured text for
downstream analysis
• Search/Retrieval
Which documents have this word or phrase?
Which documents are about this topic or this entity?
• Text-mining Parsing
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 26
Buzz Tracking: The Process
1. Monitor social networks, review sites Parse the data feeds to get actual content.
for mentions of our products. Find and filter the raw text for product
names
(Use Regular Expression).
2. Collect the reviews. Extract the relevant raw text.
Convert the raw text into a suitable
document representation.
Index into our review corpus.
3. Sort the reviews by product. Classification (or "Topic Tagging")
4. Are they good reviews or bad reviews? Classification (sentiment analysis)
We can keep a simple count here, for trend
analysis.
5. Marketing calls up and reads selected Search/Information Retrieval.
reviews in full, for greater insight.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 27
Parsing the Feeds Parsing
• Impose structure on
semi-structured
data.
• We need to know
where to look for
what we are looking
for.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 28
Regular Expressions Parsing
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 29
Extract and Represent Text Parsing
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 30
Document Representation - Other Features Parsing
• Feature:
Anything about the document that is used for search or
analysis.
• Title
• Keywords or tags
• Date information
• Source information
• Named entities
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 31
Representing a Corpus (Collection of Documents) Parsing
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 32
Text Classification (I) - "Topic Tagging" Text
Mining
"The bPhone-5X has coverage everywhere. It's much less flaky than
my old bPhone-4G."
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 33
"Topic Tagging" Text
Mining
3. Sort the Reviews by Product
Judicious choice of features
Product mentioned in title?
Tweet, or review?
Term frequency
Canonicalize abbreviations
"5X" = "bPhone-5X"
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 34
Text Classification (II) Sentiment Analysis Text
Mining
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 35
Search and Information Retrieval Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 36
Quality of Search Results Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
• Relevance
Is this document what I wanted?
Used to rank search results
• Precision
What % of documents in the result are relevant?
• Recall
Of all the relevant documents in the corpus, what % were returned
to me?
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 37
Computing Relevance Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
Call up all the documents that have any of the terms from the
query, and count how many times each term occurs:
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 38
Inverse Document Frequency (idf) Search
&Retrieva
5. Marketing calls up and reads selected reviews in full, for greater insight. l
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 39
TF-IDF and Modified Retrieval Algorithm Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 40
Other Relevance Metrics Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
• "Authoritativeness" of source
PageRank is an example of this
• Recency of document
• How often the document has been retrieved by other users
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 41
Effectiveness of Search and Retrieval Search
&Retrieva
l
• Relevance metric
important for precision, user experience
• Effective crawl, extraction, indexing
important for recall (and precision)
more important, often, than retrieval algorithm
• MapReduce
Reverse index, corpus term frequencies, idf
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 42
Challenges - Text Analysis
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 43