Essentials of Linear Regression in Python
Essentials of Linear Regression in Python
in Python
Learn what formulates a regression problem and how a
linear regression algorithm works in Python.
The field of Data Science has progressed like nothing before. It incorporates
so many different domains like Statistics, Linear Algebra, Machine Learning,
Databases into its account and merges them in the most meaningful way
possible. But, in its core, what makes this domain one of the craziest ones?
- The powerful statistical algorithms
As the title of this tutorial suggests, you will cover Linear Regression in
details in this tutorial. Specifically, you will cover:
Cost functions
Before diving deep into the theories behind Linear Regression let's have a
clear view of the term regression.
So, in the above image, X is the set of values that correspond to the living
areas of various houses (also considered as the space of input values) and y is
the price of the respective houses but note that these values are predicted
by h. h is the function that maps the X values to y (often called as predictor).
For historical reasons, this h is referred to as a hypothesis function. Keep in
mind that, this dataset has only featured, i.e., the living areas of various
houses and consider this to be a toy dataset for the sake of understanding.
Note that, the predicted values here are continuous in nature. So, your
ultimate goal is, given a training set, to learn a function h:X→Yh:X→Y so
that h(x) is a "good" predictor for the corresponding value of y. Also, keep in
mind that the domain of values that both X and Y accept are all real numbers
and you can define it like this: X=Y=IRX=Y=IR where, IRIR is the set of
all real numbers.
A pair (x(i), y(i)) is called a training example. You can define the training set as
{(x(i), y(i)) ; i = 1,...,m} (in case the training set contains m instances and there
is only one feature x in the dataset).
A bit of mathematics there for you so that you don't go wrong even in the
simplest of things. So, according to Han, Kamber, and Pei-
"In general, these methods are used to predict the value of a response
(dependent) variable from one or more predictor (independent) variables,
where the variables are numeric." - Data Mining: Concepts and Techniques
(3rd edn.)
As simple as that!
So, in the course of understanding a typical regression problem you also saw
how to define a hypothesis for it as well. Brilliant going. You have set the
mood just perfect! Now, you will straight dive into the mechanics of Linear
Regression.
Linear regression is perhaps one of the most well known and well-understood
algorithms in statistics and machine learning. Linear regression was
developed in the field of statistics and is studied as a model for understanding
the relationship between input and output numerical variables, but with the
course of time, it has become an integral part of modern machine learning
toolbox.
Let's have a toy dataset for it. You will use the same house price prediction
dataset to investigate this but this time with two features. The task remains
the same i.e., predicting the house price.
Source:
Andrew Ng's lecture notes
As mentioned earlier, now the x’s are two-dimensional which means your
dataset contains two features. For instance, x1(i) is the living area of the i-th
house in the training set, and x2(i) is its number of bedrooms.
To perform regression, you must decide the way you are going to represent h.
As an initial choice, let’s say you decide to approximate y as a linear function
of x:
hθ(x) = θ0 + θ1x1 + θ2x2
Here, the θi’s are the parameters (also called weights) parameterizing the
space of linear functions mapping from XX to YY. In a simpler sense, these
parameters are used for accurately mapping XX to YY. But to keep things
simple for your understanding, you will drop the θ subscript in hθ(x), and
write it simply as h(x). To simplify your notation even further, you will also
introduce the convention of letting x0 = 1 (this is the intercept term), so that
where on the right-hand side above you are considering θ and x both as
vectors, and here n is the number of input instances (not counting x0).
But the main question that gets raised at this point is how do you pick or
learn the parameters θ? You cannot change your input instances as to predict
the prices. You have only these θ parameters to tune/adjust.
One prominent method seems to be to make h(x) close to y, at least for the
training examples you have. To understand this more formally, let's try
defining a function that determines, for each value of the θ’s, how close the
h(x(i))’s are to the corresponding y(i) ’s. The function should look like the
following:
Source: StackOverflow
To understand the reason behind taking the squared value instead of the
absolute value, consider this squared-term as an advantage for the future
operations to be performed for training the regression model. But if you want
to dig deeper, help yourself.
You just saw one of the most important formulas in the world of Data
Science/Machine Learning/Statistics. It is called as cost function.
This is an essential derivation because not only it gives birth to the next
evolution of the linear regression (Ordinary Least Squares) but also
formulates the foundations of a whole class of linear modeling algorithms
(remember you came across a term called Generalized Linear Models).
It is important to note that, linear regression can often be divided into two
basic forms:
Simple Linear Regression (SLR) which deals with just two variables
(the one you saw at first)
These things are very straightforward but can often cause confusion.
You have already laid your foundations of linear regression. Now you will
study more about the ways of estimating the parameters you saw in the above
section. This estimation of parameters is essentially known as the training of
linear regression. Now, there are many methods to train a linear regression
model Ordinary Least Squares (OLS) being the most popular among them.
So, it is good to refer a linear regression model trained using OLS as
Ordinary Least Squares Linear Regression or just Least Squares Regression.
Note that the parameters here in this context are also called model
coefficients.
In this section, you will take a brief look at some techniques to prepare a
linear regression model.
Source: ml-cheatsheet
Intuitively speaking, the above formula denotes the small change that
happens in J w.r.t the θj parameter and how it affects the initial value of θj.
But look carefully, you have a partial derivative here to deal with. The whole
derivation process is out of the scope of this tutorial.
Just note that for a single training example, this gives the update rule:
Source: ml-cheatsheet
The rule is called the LMS update rule (LMS stands for “least mean squares”)
and is also known as the Widrow-Hoff learning rule.
"The Ordinary Least Squares procedure seeks to minimize the sum of the
squared residuals. This means that given a regression line through the data
we calculate the distance from each data point to the regression line, square
it, and sum all of the squared errors together. This is the quantity that
ordinary least squares seeks to minimize." - Jason Brownlee
More briefly speaking, it works by starting with random values for each
coefficient. The sum of the squared errors is calculated for each pair of input
and output values. A learning rate is used as a scale factor, and the
coefficients are updated in the direction towards minimizing the error. The
process is repeated until a minimum sum squared error is achieved or no
further improvement is possible.
The term α (learning rate) is very important here since it determines the size
of the improvement step to take on each iteration of the procedure.
The method that looks at every example in the entire training set on
every step and is called batch gradient descent.
The method where you repeatedly run through the training set, and
each time you encounter a training example, you update the parameters
according to the gradient of the error with respect to that single training
example only. This algorithm is called stochastic gradient
descent (also incremental gradient descent).
That is all for gradient descent for this tutorial. Now, you take a look at
another way of optimizing a linear regression model, i.e. Regularization.
Regularization:
DataCamp already has a good introductory article on Regularization. You
might want to check that out before proceeding with this one.
There are two variants of regularization procedures for linear regression are:
Lasso Regression: adds a penalty term which is equivalent to the absolute
value of the magnitude of the coefficients (also called L1 regularization).
where,
λλ is the constant factor that you add in order to control the speed of
the improvement in error (learning rate)
No, you will implement a simple linear regression in Python for yourself
now. It should be fun!
data = datasets.load_boston()
===========================
Notes
------
town
million)
to 1940
centres
of blacks by town
$1000's
http://archive.ics.uci.edu/ml/datasets/Housing
'Hedonic
prices and the demand for clean air', J. Environ. Economics &
Management,
diagnostics
table on
problems.
**References**
261.
- Quinlan, R. (1993). Combining Instance-Based and Model-Based
Morgan Kaufmann.
http://archive.ics.uci.edu/ml/datasets/Housing)
Now, before applying linear regression, you will have to prepare the data and
segregate the features and the label of the dataset. MEDV (median home value)
is the label in this case. You can access the features of the dataset
using feature_names attribute.
import numpy as np
import pandas as pd
df = pd.DataFrame(data.data, columns=data.feature_names)
At this point, you need to consider a few important things about linear
regression before applying it to the data. You could have studied this earlier
in this tutorial, but studying these factors at this particular point of time will
help you get the real feel.
Linear Assumption: Linear regression is best employed to capture the
relationship between the input variables and the outputs. In order to do
so, linear regression assumes this relationship to be linear (which might
not be the case all the time). But you can always transform your data so
that a linear relationship is maintained. For example, if your data has an
exponential relationship, you can apply log-transform to make the
relationship linear.
Let's do some hands-on now. To keep things simple you will just take RM —
the average number of rooms feature for now. Note
that Statsmodels does not add a constant term (recall the factor θ0) by
default. Let’s see it first without the constant term in your regression model:
X = df["RM"]
y = target["MEDV"]
predictions = model.predict(X)
# Print out the statistics
model.summary()
What is this output! It is way too big to understand when you are seeing it for
the first time. Let's go through the most critical points step by step:
There is a 95% confidence intervals for the RM which means that the
model predicts at a 95% percent confidence that the value of RM is
between 3.548 to 3.759).
These are the most important points you should take care of for the time
being (and you can ignore the warning as well).
A constant term can easily be added to the linear regression model. You can
do it by X = sm.add_constant(X) (X is the name of the dataframe
containing the input (independent variables).
X = sm.add_constant(X)
predictions = model.predict(X)
model.summary()
It can be clearly seen that the addition of the constant term has a direct effect
on the coefficient term. Without the constant term, your model was passing
through the origin, but now you have a y-intercept at -34.67. Now the slope
of the RM predictor is also changed from 3.634 to 9.1021 (coef of RM).
Now you will fit a regression model with more than one variable — you will
add LSTAT (percentage of lower status of the population) along with
the RM variable. The model training (fitting) procedure remains the exact
same as previous:
X = df[["RM", "LSTAT"]]
y = target["MEDV"]
predictions = model.predict(X)
model.summary()
Let's interpret this one now:
Houses having a small number of rooms are likely to have low price
values.
In the areas where the status of the population, is lower the house prices
are likely to be low.
This was the example of both single and multiple linear regression
in Statsmodels. Your homework will be to investigate and interpret the
results with the further features.
Next, let's see how linear regression can be implemented using your very
own scikit-learn. You already have the dataset imported, but you will
have to import the linear_model class.
from sklearn import linear_model
X = df
y = target["MEDV"]
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
print(predictions[0:5])
If you want to know some more details (such as the R-squared, coefficients,
etc.) of your model, you can easily do so.
lm.score(X,y)
0.7406077428649427
lm.coef_
2.68856140e+00,
-1.47575880e+00,
9.39251272e-03,
-5.25466633e-01])
Wrap up!
Beautiful! You have made it to the end. Covering one of the simplest and the
most fundamental algorithms was not that easy, but you did it pretty well.
You not only got familiarized with simple linear regression but also studied
many fundamental aspects, terms, factors of machine learning. You did an in-
depth case study in Python as well.
This tutorial can also be treated as a motivation for you to implement Linear
Regression from scratch. Following are the brief steps if anyone wants to do
it for real:
Calculate covariance
Estimate coefficients
Make predictions
Following are some references that were used in order to prepare this tutorial: