Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
34 views

Linear Regression - Module 3

Linear regression is a machine learning algorithm that models the linear relationship between a dependent variable and one or more independent variables. Simple linear regression involves one independent variable, while multiple linear regression involves more than one. The goal is to find the best fit line that minimizes error between predicted and actual values of the dependent variable.

Uploaded by

Arjun Singh A
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Linear Regression - Module 3

Linear regression is a machine learning algorithm that models the linear relationship between a dependent variable and one or more independent variables. Simple linear regression involves one independent variable, while multiple linear regression involves more than one. The goal is to find the best fit line that minimizes error between predicted and actual values of the dependent variable.

Uploaded by

Arjun Singh A
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Exploratory Data Analysis For Business

Module 3

Linear Regression and Variable Selection

Meaning- Review Expectation, Variance, Frequentist Basics, Parameter Estimation, Linear Methods,
Point Estimate, Example Results, Theoretical Justification, R Scripts. Variable Selection- Variable
Selection for the Linear Model, R Scripts

Linear Regression
Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between a dependent variable and one or more independent features.
When the number of the independent feature, is 1 then it is known as Univariate Linear
regression, and in the case of more than one feature, it is known as multivariate linear
regression.
Why Linear Regression is Important?
The interpretability of linear regression is a notable strength. The model’s equation
provides clear coefficients that elucidate the impact of each independent variable on the
dependent variable, facilitating a deeper understanding of the underlying dynamics. Its
simplicity is a virtue, as linear regression is transparent, easy to implement, and serves as a
foundational concept for more complex algorithms.
Linear regression is not merely a predictive tool; it forms the basis for various advanced
models. Techniques like regularization and support vector machines draw inspiration from
linear regression, expanding its utility. Additionally, linear regression is a cornerstone in
assumption testing, enabling researchers to validate key assumptions about the data.
Types of Linear Regression
There are two main types of linear regression:
Simple Linear Regression
This is the simplest form of linear regression, and it involves only one independent variable
and one dependent variable. The equation for simple linear regression is:

where:
 Y is the dependent variable
 X is the independent variable
 β0 is the intercept
 β1 is the slope
Multiple Linear Regression
This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:

where:
 Y is the dependent variable
 X1, X2, …, Xp are the independent variables
 β0 is the intercept
 β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to
learn a function so if you want to predict Y from an unknown X this learned function can
be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features.
What is the best Fit Line?
Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a
minimum. There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between
the dependent and independent variables. The slope of the line indicates how much the
dependent variable changes for a unit change in the independent variable(s).
Here Y is called a dependent or target variable and X is called an independent variable also known as
the predictor of Y. There are many types of functions or modules that can be used for regression. A
linear function is the simplest type of function. Here, X may be a single feature or multiple features
representing the problem.

Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is the
work experience and Y (output) is the salary of a person. The regression line is the best-fit line for our
model.

We utilize the cost function to compute the best values in order to get the best fit line since different
values for weights or the coefficient of lines result in different regression lines.

Hypothesis function in Linear Regression

As we have assumed earlier that our independent feature is the experience i.e X and the respective
salary Y is the dependent variable. Let’s assume there is a linear relationship between X and Y then
the salary can be predicted using:

The model gets the best regression fit line by finding the best θ1 and θ2 values.

θ1: intercept

θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best-fit line. So when we are finally using our
model for prediction, it will predict the value of y for the input value of x.
Input vector:

Output Y is real-valued.

Predict Y from X by f(X) so that the expected loss function

is minimized.

Expectation
Intuitively, the expectation of a random variable is its "average" value under its distribution.

Formally, the expectation of a random variable X, denoted E[X], is its Lebesgue integral with respect
to its distribution.

Lebesgue's theory defines integrals for a class of functions called measurable functions.

The expectation is monotone: if X ≥Y, then E(X) ≥ E(Y)

Variance
The term variance refers to a statistical measurement of the spread between numbers in a data set.
More specifically, variance measures how far each number in the set is from the mean (average), and
thus from every other number in the set.

The variance of a random variable X is defined as:

and the variance obeys the following


Frequentist Basics
The frequentist view defines the probability of an event as the proportion of times that the event
occurs in a sequence of possibly hypothetical trials.

The data x1, ... , xn is generally assumed to be independent and identically distributed (i.i.d.).

We would like to estimate some unknown value θ associated with the distribution from which the
data was generated.

In general, our estimate will be a function of the data (i.e., a statistic)

Example: Given the results of n independent flips of a coin, determine the probability p with which it
lands on heads.

Parameter Estimation
Parameter estimation is the process of computing a model's parameter values from measured data.
There are two types of estimates for each population parameter: the point estimate and confidence
interval (CI) estimate.

In practice, we often seek to select a distribution (model) corresponding to our data.

If our model is parameterized by some set of values, then this problem is that of parameter
estimation.

How can we obtain estimates in general? One Answer: Maximize the likelihood and the estimate is
called the maximum likelihood estimate, MLE.

Discussion

Let's look at the setup for linear regression. We have an input vector:

This vector is p dimensional.

The output Y is a real value and is ordered.

We want to predict Y from X.

Before we actually do the prediction we have to train the function f(X). By the end of the training, I
would have a function f(X) to map every X into an estimated Y. Then, we need some way to measure
how good this predictor function is. This is measured by the expectation of a loss.

Why do we have a loss in the estimation?


Y is actually a random variable given X. For instance, consider predicting someone's weight based on
the person's height. People can have different weights given the same height. If you think of the
weight as Y and the height as X, Y is random given X. We, therefore, cannot have a perfect prediction
for every subject because f(X) is a fixed function, impossible to be correct all the time. The loss
measures how different the true Y is from your prediction.

Why do we have the overall loss expressed as an expectation?

The loss may be different for different subjects. In statistics, a common thing to do is to average the
losses over the entire population.

Squared loss:

We simply measure the difference between the two variables and square them so that we can
handle negative and positive difference symmetrically.

Suppose the distribution of Y given X is known, the optimal predictor is:

This is the conditional expectation of Y given X. The function E(Y | X) is called the regression
function.

Linear Methods
A linear function is a function that represents a straight line on the coordinate plane. A linear
function is of the form f(x) = mx + b where 'm' and 'b' are real numbers.

 'm' is the slope of the line


 'b' is the y-intercept of the line
 'x' is the independent variable
 'y' (or f(x)) is the dependent variable

A linear function is an algebraic function. This is because it involves only algebraic operations.

Example: A movie streaming service charges a monthly fee of $4.50 and an additional fee of $0.35
for every movie downloaded. Now, the total monthly fee is represented by the linear function f(x) =
0.35x + 4.50, where x is the number of movies downloaded in a month.
The linear regression model:

This is just a linear combination of the measurements that are used to make predictions, plus a
constant, (the intercept term). This is a simple approach. However, It might be the case that the
regression function might be pretty close to a linear function, and hence the model is a good
approximation.

What if the model is not true?

1. It still might be a good approximation - the best we can do.


2. Sometimes because of the lack of training data or smarter algorithms, this is the most we
can estimate robustly from the data.

Comments on Xj

 We assume that these are quantitative inputs [or dummy indicator variables representing
levels of a qualitative input]
 We can also perform transformations of the quantitative inputs, e.g., log(•), √(•). In this case,
this linear regression model is still a linear function in terms of the coefficients to be
estimated. However, instead of using the original Xj, we have replaced them or augmented
them with the transformed values. Regardless of the transformations performed on Xj f(x) is
still a linear function of the unknown parameters.
 Some basic expansions:

Below is a geometric interpretation of a linear regression.

For instance, if we have two variables, X 1 and X2, and we predict Y by a linear combination of X 1 and
X2, the predictor function corresponds to a plane (hyperplane) in the three-dimensional space of X 1 ,
X2 ,Y. Given a pair of X1 and X2 we could find the corresponding point on the plane to decide Y by
drawing a perpendicular line to the hyperplane, starting from the point in the plane spanned by the
two predictor variables.
For accurate prediction, hopefully, the data will lie close to this hyperplane, but they won't lie exactly
in the hyperplane (unless perfect prediction is achieved). In the plot above, the red points are the
actual data points. They do not lie on the plane but are close to it.

How should we choose this hyperplane?

We choose a plane such that the total squared distance from the red points (real data points) to the
corresponding predicted points in the plane is minimized. Graphically, if we add up the squares of
the lengths of the line segments drawn from the red points to the hyperplane, the optimal
hyperplane should yield the minimum sum of squared lengths.

Estimation

The issue of finding the regression function E(Y|X) is converted to estimating βj, j=0,1, 2, ….. ,p.

Remember in earlier discussions we talked about the trade-off between model complexity and
accurate prediction on training data. In this case, we start with a linear model, which is relatively
simple. The model complexity issue is taken care of by using a simple linear function. In basic linear
regression, there is no explicit action taken to restrict model complexity.

With the model complexity under check, the next thing we want to do is to have a predictor that fits
the training data well.

Let the training data be:

Without knowing the true distribution for X and Y, we cannot directly minimize the expected loss.

This empirical loss is basically the accuracy you computed based on the training data. This is called
the residual sum of squares, RSS.

The x's are known numbers from the training data.

Notation
Here is the input matrix X of dimension N × (p +1):
Earlier we mentioned that our training data had N number of points. So, in the example where we
were predicting the number of doctors, there were 101 metropolitan areas that were investigated.
Therefore, N =101. Dimension p = 3 in this example. The input matrix is augmented with a column of
1's (for the intercept term). So, above you see the first column contains all 1's. Then if you look at
every row, every row corresponds to one sample point and the dimensions go from one to p. Hence,
the input matrix X is of dimension N × (p +1).

Output vector y:

Again, this is taken from the training data set.


Point Estimate

A point estimate definition is a calculation where a sample statistic is used to estimate or


approximate an unknown population parameter. For example, the average height of a random
sample can be used to estimate the average height of a larger population.

Geometric Interpretation
Each column of X is a vector in an N-dimensional space (not the p + 1 dimensional feature
vector space). Here, we take out columns in matrix X, and this is why they live in N-
dimensional space. Values for the same variable across all of the samples are put in a vector.
I represent this input matrix as the matrix formed by the column vectors:

Here is the column of 1's for the intercept term. It turns out that the fitted output vector

is a linear combination of the column vectors

This means that lies in the subspace spanned by

The dimension of the column vectors is N, the number of samples. Usually, the number of
samples is much bigger than the dimension p. The true y can be any point in this N-
dimensional space. What we want to find is an approximation constraint in the p + 1
dimensional space such that the distance between the true y and the approximation is
minimized. It turns out that the residual sum of squares is equal to the square of the

Euclidean distance between y and .

Geometrically speaking let's look at a really simple example. Take a look at the diagram

below. What we want to find is a that lies in the hyperplane defined or spanned by

You would draw a perpendicular line from y to the plane to find . This comes
from a basic geometric fact. In general, if you want to find some point in a subspace to
represent some point in a higher dimensional space, the best you can do is to project that
point to your subspace.

The difference between your approximation and the true vector has to be perpendicular to
the subspace.
The geometric interpretation is very helpful for understanding coefficient shrinkage and
subset selection
Example Results
Let's take a look at some results for our earlier example about the number of active
physicians in a Standard Metropolitan Statistical Area, If I do the optimization using the
equations, I obtain these values below:

Let's take a look at some scatter plots. We plot one variable versus another. For instance, in
the upper left-hand plot, we plot the pairs of x1 and y. These are two-dimensional plots, each
variable plotted individually against any other variable.
In the plots above you can see that x 3 is almost a perfectly linear function of x 1. This might
indicate that there might be some problems when you do the optimization. What happens is
that if x3 is a perfectly linear function of x1, then when you solve the linear equation to
determine the β's, there is no unique solution. The scatter plots help to discover such
potential problems.
In practice, because there is always measurement error, you rarely get a perfect linear
relationship. However, you might get something very close. In this case, the matrix, X TX, will
be close to singular, causing large numerical errors in computation. Therefore, we would like
to have predictor variables that are not so strongly correlated.

Theoretical Justification
If the Linear Model Is True
Here is some theoretical justification for why we do parameter estimation using least
squares.
If the linear model is true, i.e., if the conditional expectation of Y given X indeed is a linear
function of the Xj's, and Y is the sum of that linear function and an independent Gaussian
noise, we have the following properties for least squares estimation.

1. The least squares estimation of β is unbiased,

You should see that the higher is, the variance of will be higher. This is very natural.
Basically, if the noise level is high, you're bound to have a large variance in your estimation.
But then, of course, it also depends on XTX. This is why in experimental design, methods are
developed to choose X so that the variance tends to be small.

Note that is a vector and hence its variance is a covariance matrix of size (p + 1) × (p + 1).
The covariance matrix not only tells the variance for every individual βj, but also the
covariance for any pair of βj and βk, j ≠ k.
Gauss-Markov Theorem
This theorem says that the least squares estimator is the best linear unbiased estimator.
Assume that the linear model is true. For any linear combination of the parameters β0,…. βp
you get a new parameter denoted by Ѳ=aTβ.

We want to estimate Ѳ and the least squares estimate of Ѳ is:

which is linear in y. The Gauss-Markov theorem states that for any other linear unbiased
estimator, cTy , the linear estimator obtained from the least squares estimation on Ѳ
is guaranteed to have a smaller variance than cTy:

Keep in mind that you're only comparing with linear unbiased estimators. If the estimator is
not linear, or is not unbiased, then it is possible to do better in terms of squared loss.

Variable Selection for the Linear Model


So in linear regression, the more features Xj the better (since RSS keeps going down)? NO!
Carefully selected features can improve model accuracy. But adding too many can lead to
overfitting:

 Overfitted models describe random error or noise instead of any underlying


relationship.
 They generally have poor predictive performance on test data.
 For instance, we can use a 15-degree polynomial function to fit the following data so
that the fitted curve goes nicely through the data points. However, a brand-new
dataset collected from the same population may not fit this particular curve well at
all.
 Sometimes when we do prediction we may not want to use all of the predictor
variables (sometimes p is too big). For example, a DNA array expression example has
a sample size (N) of 96 but a dimension (p) of over 4000!
In such cases, we would select a subset of predictor variables to perform regression or
classification, e.g. to choose k predicting variables from the total of p variables yielding
minimum RSS ( ).
Variable Selection for the Linear Regression Model

When the prediction is of interest:


F-test;
Likelihood ratio test;
AIC, BIC, etc.;
Cross-validation.
F-test

The residual sum-of-squares RSS (β) is defined as:

Let RSS1 correspond to the bigger model with p1+1 parameters, and RSS0 correspond to the
nested smaller model with p0+1 parameters.
The F statistic measures the reduction of RSS per additional parameter in the bigger model:

Under the normal error assumption, the F statistic will have a F(p1-p0),(N-p1-1) distribution.
For linear regression models, an individual t-test is equivalent to an F-test for dropping a
single coefficient βj from the model.

Likelihood Ratio Test (LRT)


Let L1 be the maximum value of the likelihood of the bigger model.
Let L2 be the maximum value of the likelihood of the nested smaller model.
The likelihood ratio λ = L0 / L1 is always between 0 and 1, and the less likely are the restrictive
assumptions underlying the smaller model, the smaller will be λ.

The likelihood ratio test statistic (deviance), -2log(λ), approximately follows a


distribution.
So we can test the fit of the 'null' model M0 against a more complex model M1.

Akaike Information Criterion (AIC)


Use of the LRT requires that our models are nested. Akaike (1971/74) proposed a more
general measure of "model badness:"

where p is the number of parameters.


Faced with a collection of putative models, the 'best' (or 'least bad') one can be chosen by
seeing which has the lowest AIC.

The scale is statistical, not scientific, but the trade-off is clear; we must improve the log-
likelihood by one unit for every extra parameter.
AIC is asymptotically equivalent to leave-one-out cross-validation.

Bayes Information Criterion (BIC)


AIC tends to overfit models
Another information criterion which penalizes complex models more severely is:

also known as the Schwarz' criterion due to Schwarz (1978), where an approximate Bayesian
derivation is given.
Lowest BIC is taken to identify the 'best model', as before.
BIC tends to favor simpler models than those chosen by AIC.

Stepwise Selection
AIC and BIC also allow stepwise model selection.
An exhaustive search for the subset may not be feasible if p is very large. There are two main
alternatives:
Forward stepwise selection:
First, we approximate the response variable y with a constant (i.e., an intercept-only
regression model).
Then we gradually add one more variable at a time (or add main effects first, then
interactions).
Every time we always choose from the rest of the variables the one that yields the best
accuracy in prediction when added to the pool of already selected variables. This accuracy
can be measured by the F-statistic, LRT, AIC, BIC, etc.
For example, if we have 10 predictor variables, first we would approximate y with a constant,
and then use one variable out of the 10 (I would perform 10 regressions, each time using a
different predictor variable; for every regression I have a residual sum of squares; the
variable that yields the minimum residual sum of squares is chosen and put in the pool of
selected variables). We then proceed to choose the next variable from the 9 left, etc.
Backward stepwise selection: This is similar to forward stepwise selection, except that we
start with the full model using all the predictors and gradually delete variables one at a time.
There are various methods developed to choose the number of predictors, for instance, the
F-ratio test. We stop forward or backward stepwise selection when no predictor produces an
F-ratio statistic greater than some threshold.

You might also like