Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
27 views

Module 4: Advanced Analytics - Theory and Methods: Lesson 6: Linear Regression

Uploaded by

Babu Rao
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Module 4: Advanced Analytics - Theory and Methods: Lesson 6: Linear Regression

Uploaded by

Babu Rao
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Module 4: Advanced Analytics – Theory and Methods

Lesson 6: Linear Regression

During this lesson the following topics are covered:


• General description of regression models
• Technical description of a linear regression model
• Common use cases for the linear regression model
• Interpretation and scoring with the linear regression model
• Diagnostics for validating the linear regression model
• The Reasons to Choose (+) and Cautions (-) of the linear
regression model

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 1
Regression
• Regression focuses on the relationship between an outcome and
its input variables.
 In other words, we don't just predict the outcome, we also have a
sense of how changes in individual drivers affect the outcome.
• The outcome can be continuous or discrete.
 When it's discrete, we are predicting the probability that the
outcome will occur.
Example Questions:
 I want to predict the life time value (LTV) of this customer (and
understand what drives LTV).
 I want to predict the probability that this loan will default (and
understand what drives default).
• Our examples: Linear Regression, Logistic Regression
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 2
Linear Regression -What is it?
• Used to estimate a continuous value as a linear (additive)
function of other variables
 Income as a function of years of education, age, gender
 House price as function of median home price in neighborhood,
square footage, number of bedrooms/bathrooms
 Neighborhood house sales in the past year based on
unemployment, stock price etc.

• Input variables can be continuous or discrete.


• Output:
 A set of coefficients that indicate the relative impact of each
driver.
 A linear expression for predicting outcome as a function of drivers.

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 3
Linear Regression - Use Cases
• The preferred method for almost any problem where we are
predicting a continuous outcome
 Try this first; if it fails, then try something more complicated
• Examples:
 Customer lifetime value
 Home value
 Loss given default on loan
 Income as a function of demographics

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 4
Example: Predict Mortgage Foreclosure/Delinquency Rates

fdq_rate = -0.9 + 0.66 CurrentUnemp + 1.06 ChgInUnem1yr + 0.22 hicost_mort_rate

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 5
Technical Description

• Solve for the bi


 Ordinary Least Squares
 storage quadratic in number of variables
 must invert a matrix
• Categorical variables are expanded to a set of indicator
variables, one for each possible value.

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 6
Representing Categorical Variable

• State is a categorical variable: 50 possible values.


• Expand it to 49 indicator (0/1) variables:
 The remaining level is the "default level“
 This is done automatically by standard packages
• Gender is categorical, too, but binary
 so one variable: genderMale, which is 0 for females

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 7
What do the Coefficients bi Mean?
• Change in y as a function of unit change in xi
 all other things being equal
• Example: income in units of $10K, years in age, bage= 2
 For the same gender, years of education, and state of residence, a
person's income increases by 2 units (20K)for every year older

• Standard packages also report the significance of the bi:


probability that, in reality, bi = 0
 bi "significant" if P(bi = 0) is small

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 8
Diagnostics
• Hold-out data
 Does the model predict well on data it hasn't seen?
• N-fold cross-validation
 Partition the data into N groups.
 Fit N models, holding out each group, and calculate the residuals
on the group.
 Estimated prediction error is the average over all the residuals.
• R2 : The fraction of the variance in the output variable that the
model can explain.
 It is also the square of the correlation between the true output and
the predicted output. You want it close to 1.

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 9
Diagnostics (Continued) Overpredicts for low true
values, underpredicts at
higher values. Improve
• Plot it! the model.
 Prediction vs. true outcome
• Look for:
 Systematic over/under
prediction
 Non-consistent variance
 The data cloud should be
symmetric about the line of
true prediction
 Glaring outliers
• You will see other diagnostic
plots in the lab Not quite
consistent
variance, but much
better.

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 10
Linear Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Concise representation (the coefficients) Does not handle missing values well
Robust to redundant variables, correlated Assumes that each variable affects the
variables outcome linearly and additively
Lose some explanatory value Variable transformations and
modeling variable interactions can
alleviate this
A good idea to take the log of
monetary amounts or any variable
with a wide dynamic range

Explanatory value Can't handle variables that affect the


Relative impact of each variable on outcome in a discontinuous way
the outcome Step functions
Easy to score data Doesn't work well with discrete drivers that
have a lot of distinct values
For example, ZIP code

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 11
Module 4: Advanced Analytics – Theory and Methods

Lesson 6: Logistic Regression


During this lesson the following topics are covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 12
Logistic Regression
• Used to estimate the probability that an event will occur as a
function of other variables
 The probability that a borrower will default as a function of his credit
score, income, the size of the loan, and his existing debts
• Can be considered a classifier, as well
 Assign the class label with the highest probability

• Input variables can be continuous or discrete


• Output:
 A set of coefficients that indicate the relative impact of each driver
 A linear expression for predicting the log-odds ratio of outcome as a
function of drivers. (Binary classification case)
 Log-odds ratio easily converted to the probability of the outcome

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 13
Logistic Regression Use Cases
• The preferred method for many binary classification problems:
 Especially if you are interested in the probability of an event, not
just predicting the "yes or no“
 Try this first; if it fails, then try something more complicated
• Binary Classification examples:
 The probability that a borrower will default
 The probability that a customer will churn
• Multi-class example
 The probability that a politician will vote yes/vote no/not show up
to vote on a given bill

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 14
Logistic Regression Model - Example

• Training data: default is 0/1


 default=1 if loan defaulted
• The model will return the probability that a loan with given
characteristics will default
• If you only want a "yes/no" answer, you need a threshold
 The standard threshold is 0.5

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 15
Logistic Regression- Visualizing the Model

Overall fraction of default:


~20%

Logistic regression returns a


score that estimates the
probability that a borrower
will default

The graph compares the


distribution of defaulters and
non-defaulters as a function
of the model's predicted
probability, for borrowers
scoring higher than 0.1

Blue=defaulters

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 16
Technical Description (Binary Case)

• y=1 is the case of interest: 'TRUE'


• LHS is called logit(P(y=1))
 hence, "logistic regression"
• logit(P(y=1)) is inverted by the sigmoid function
 standard packages can return probability for you
• Categorical variables are expanded as with linear regression
• Iterative, not closed form solution
 "Iteratively re-weighted least squares"

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 17
What do the Coefficients bi Mean?
• Invert the logit expression:

• exp(bj) tells us how the odds-ratio of y=1 changes for every unit change in xj
• Example: bcreditScore = -0.69
• exp(bcreditScore) = 0.5 = 1/2
• for the same income, loan, and existing debt, the odds-ratio of default is
halved for every point increase in credit score
• Standard packages return the significance of the coefficients in the same way
as in linear regression

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 18
An Interesting Fact About Logistic Regression

"The probability mass equals the counts"

• If 13% of our loan risk training set defaults


 The sum of all the training set scores will be 13% of the number of
training examples

• If 40% of applicants with income < $50,000 default


 The sum of all the training set scores of people in this income
category will be 40% of the number of examples in this income
category

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 19
Diagnostics: ROC Curve

Area under the curve (AUC)


tells you how well the model
predicts. (Ideal AUC = 1)

For logistic regression, ROC


curve can help set classifier
threshold

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 20
Diagnostics: Plot the Histograms of Scores
good separation

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 21
Logistic Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Explanatory value: Does not handle missing values well
Relative impact of each variable on the outcome
in a more complicated way than linear regression
Robust with redundant variables, correlated variables Assumes that each variable affects the log-odds of the
Lose some explanatory value outcome linearly and additively
Variable transformations and modeling variable
interactions can alleviate this
A good idea to take the log of monetary amounts
or any variable with a wide dynamic range

Concise representation with the Cannot handle variables that affect the outcome in a
the coefficients discontinuous way.
Step functions
Easy to score data Doesn't work well with discrete drivers that have a lot
of distinct values
For example, ZIP code
Returns good probability estimates of an event

Preserves the summary statistics of the training data


"The probabilities equal the counts"

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 22
Module 4: Advanced Analytics – Theory and Methods
Lesson 4: Text Analysis

During this lesson the following topics are covered:


• Challenges with text analysis
• Key tasks in text analysis
• Definition of terms used in text analysis
• Term frequency, inverse document frequency
• Representation and features of documents and corpus
• Use of regular expressions in parsing text
• Metrics used to measure the quality of search results
• Relevance with tf-idf, precision and recall

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 23
Text Analysis
Encompasses the processing and representation of text for
analysis and learning tasks

• High-dimensionality
 Every distinct term is a dimension
 Green Eggs and Ham: A 50-D problem!
• Data is Un-structured

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 24
Text Analysis – Problem-solving Tasks
• Parsing
 Impose a structure on the unstructured/semi-structured text for
downstream analysis
• Search/Retrieval
 Which documents have this word or phrase?
 Which documents are about this topic or this entity?
• Text-mining Parsing

 "Understand" the content Search


&Retrieval

 Clustering, classification Text Mining

• Tasks are not an ordered list


 Does not represent process
 Set of tasks used appropriately depending on the problem
addressed
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 25
Example: Brand Management

• Acme currently makes two products


 bPhone
 bEbook
• They have lots of competition. They want to maintain their
reputation for excellent products and keep their sales high.
• What is the buzz on Acme?
 Search for mentions of Acme products
 Twitter, Facebook, Review Sites, etc.
 What do people say?
 Positive or negative?
 What do people think is good or bad about the products?

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 26
Buzz Tracking: The Process

1. Monitor social networks, review sites Parse the data feeds to get actual content.
for mentions of our products. Find and filter the raw text for product
names
(Use Regular Expression).
2. Collect the reviews. Extract the relevant raw text.
Convert the raw text into a suitable
document representation.
Index into our review corpus.
3. Sort the reviews by product. Classification (or "Topic Tagging")
4. Are they good reviews or bad reviews? Classification (sentiment analysis)
We can keep a simple count here, for trend
analysis.
5. Marketing calls up and reads selected Search/Information Retrieval.
reviews in full, for greater insight.

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 27
Parsing the Feeds Parsing

1. Monitor social networks, review sites for mentions of our products

• Impose structure on
semi-structured
data.
• We need to know
where to look for
what we are looking
for.

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 28
Regular Expressions Parsing

1. Monitor social networks, review sites for mentions of our products

• Regular Expressions (regexp) are a means for finding words,


strings or particular patterns in text.
• A match is a Boolean response. The basic use is to ask “does this
regexp match this string?”

regexp matches Note


b[P|p]hone bPhone, bphone Pipe “|” means “or”
bEb*k bEbook, bEbk, bEback … “*” is a wildcard, matches anything
^I love A line starting with "I love" “^” means start of a string
Acme$ A line ending with “Acme” “$” means the end of a string

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 29
Extract and Represent Text Parsing

2. Collect the reviews

Document Representation: "I love LOVE my bPhone!"


A structure for analysis
Convert this to a vector in the term
• "Bag of words" space:
 common representation
 A vector with one dimension for every unique acme 0
term in space bebook 0
 term-frequency (tf): number times a term occurs
 Good for basic search, classification
bPhone 1
fantastic 0
• Reduce Dimensionality
 Term Space – not ALL terms love 2
 no stop words: "the", "a" slow 0
 often no pronouns
terrible 0
 Stemming
 "phone" = "phones" terrific 0

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 30
Document Representation - Other Features Parsing

2. Collect the reviews

• Feature:
 Anything about the document that is used for search or
analysis.
• Title
• Keywords or tags
• Date information
• Source information
• Named entities

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 31
Representing a Corpus (Collection of Documents) Parsing

2. Collect the reviews


• Reverse index
 For every possible feature, a list of all the documents that contain
that feature
• Corpus metrics
 Volume
 Corpus-wide term frequencies
 Inverse Document Frequency (IDF)
 more on this later
• Challenge: a Corpus is dynamic
 Index, metrics must be updated continuously

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 32
Text Classification (I) - "Topic Tagging" Text
Mining

3. Sort the Reviews by Product


Not as straightforward as it seems

"The bPhone-5X has coverage everywhere. It's much less flaky than
my old bPhone-4G."

"While I love Acme's bPhone series, I've been quite disappointed by


the bEbook. The text is illegible, and it makes even the Kindle
look blazingly fast."

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 33
"Topic Tagging" Text
Mining
3. Sort the Reviews by Product
Judicious choice of features
 Product mentioned in title?
 Tweet, or review?
 Term frequency
 Canonicalize abbreviations
 "5X" = "bPhone-5X"

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 34
Text Classification (II) Sentiment Analysis Text
Mining

4. Are they good reviews or bad reviews?

• Naïve Bayes is a good first attempt


• But you need tagged training data!
 THE major bottleneck in text classification
• What to do?
 Hand-tagging
 Clues from review sites
 thumbs-up or down, # of stars
 Cluster documents, then label the clusters

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 35
Search and Information Retrieval Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.

• Marketing calls up documents with queries:


 Collection of search terms
 "bPhone battery life"
 Can also be represented as "bag of words"
 Possibly restricted by other attributes
 within the last month
 from This Review Site

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 36
Quality of Search Results Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.
• Relevance
 Is this document what I wanted?
 Used to rank search results
• Precision
 What % of documents in the result are relevant?
• Recall
 Of all the relevant documents in the corpus, what % were returned
to me?

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 37
Computing Relevance Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.

 Call up all the documents that have any of the terms from the
query, and count how many times each term occurs:

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 38
Inverse Document Frequency (idf) Search
&Retrieva
5. Marketing calls up and reads selected reviews in full, for greater insight. l

idfi = log (N/tfi)


 N: Number of documents in corpus
 tfi: Number of documents in which term occurs in the corpus
• Measures term uniqueness in corpus
 "phone" vs. "brick"
• Indicates the importance of the term
 Search (relevance)
 Classification (discriminatory power)

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 39
TF-IDF and Modified Retrieval Algorithm Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.

• Term frequency – inverse document frequency (tf-idf)


tfdocument(term) * idf(term)
query: "unbrick phone"
• Document with "unbrick" a few times more relevant than
document with "phone" many times
• Measure of Relevance with tf-idf
• Call up all the documents that have any of the terms from the
query, and sum up the tf-idf of each term:

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 40
Other Relevance Metrics Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.

• "Authoritativeness" of source
 PageRank is an example of this
• Recency of document
• How often the document has been retrieved by other users

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 41
Effectiveness of Search and Retrieval Search
&Retrieva
l

• Relevance metric
 important for precision, user experience
• Effective crawl, extraction, indexing
 important for recall (and precision)
 more important, often, than retrieval algorithm
• MapReduce
 Reverse index, corpus term frequencies, idf

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 42
Challenges - Text Analysis

• Challenge: finding the right structure for your unstructured data


• Challenge: very high dimensionality
• Challenge: thinking about your problem the right way

2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 43

You might also like