How to Perform Simple Linear Regression in Python
How to Perform Simple Linear Regression in Python
Python (Step-by-Step)
Simple linear regression is a technique that we can use to understand the relationship between a
single explanatory variable and a single response variable.
This technique finds a line that best “fits” the data and takes on the following form:
ŷ = b0 + b1x
where:
This equation can help us understand the relationship between the explanatory and response
variable, and (assuming it’s statistically significant) it can be used to predict the value of a
response variable given the value of the explanatory variable.
This tutorial provides a step-by-step explanation of how to perform simple linear regression in
Python.
For this example, we’ll create a fake dataset that contains the following two variables for 15
students:
We’ll attempt to fit a simple linear regression model using hours as the explanatory variable
and exam score as the response variable.
The following code shows how to create this fake dataset in Python:
import pandas as pd
#create dataset
df = pd.DataFrame({'hours': [1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 11, 11,
12, 12, 14],
'score': [64, 66, 76, 73, 74, 81, 83, 82, 80,
88, 84, 82, 91, 93, 89]})
#view first six rows of dataset
df[0:6]
hours score
0 1 64
1 2 66
2 4 76
3 5 73
4 5 74
5 6 81
Before we fit a simple linear regression model, we should first visualize the data to gain an
understanding of it.
First, we want to make sure that the relationship between hours and score is roughly linear, since
that is an underlying assumption of simple linear regression.
We can create a simple scatterplot to view the relationship between the two variables:
plt.scatter(df.hours, df.score)
plt.title('Hours studied vs. Exam Score')
plt.xlabel('Hours')
plt.ylabel('Score')
plt.show()
From the plot we can see that the relationship does appear to be linear. As hours increases, score
tends to increase as well in a linear fashion.
Next, we can create a boxplot to visualize the distribution of exam scores and check for outliers.
By default, Python defines an observation to be an outlier if it is 1.5 times the interquartile range
greater than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile
(Q1).
df.boxplot(column=['score'])
There are no tiny circles in the boxplot, which means there are no outliers in our dataset.
Step 3: Perform Simple Linear Regression
Once we’ve confirmed that the relationship between our variables is linear and that there are no
outliers present, we can proceed to fit a simple linear regression model using hours as the
explanatory variable and score as the response variable:
Note: We’ll use the OLS() function from the statsmodels library to fit the regression model.
import statsmodels.api as sm
From the model summary we can see that the fitted regression equation is:
This means that each additional hour studied is associated with an average increase in exam
score of 1.9824 points. And the intercept value of 65.334 tells us the average expected exam
score for a student who studies zero hours.
We can also use this equation to find the expected exam score based on the number of hours that
a student studies. For example, a student who studies for 10 hours is expected to receive an exam
score of 85.158:
P>|t|: This is the p-value associated with the model coefficients. Since the p-value
for hours (0.000) is significantly less than .05, we can say that there is a statistically
significant association between hours and score.
R-squared: This number tells us the percentage of the variation in the exam scores can
be explained by the number of hours studied. In general, the larger the R-squared value of
a regression model the better the explanatory variables are able to predict the value of the
response variable. In this case, 83.1% of the variation in scores can be explained hours
studied.
F-statistic & p-value: The F-statistic (63.91) and the corresponding p-value (2.25e-06)
tell us the overall significance of the regression model, i.e. whether explanatory variables
in the model are useful for explaining the variation in the response variable. Since the p-
value in this example is less than .05, our model is statistically significant and hours is
deemed to be useful for explaining the variation in score.
After we’ve fit the simple linear regression model to the data, the last step is to create residual
plots.
One of the key assumptions of linear regression is that the residuals of a regression model are
roughly normally distributed and are homoscedastic at each level of the explanatory variable. If
these assumptions are violated, then the results of our regression model could be misleading or
unreliable.
To verify that these assumptions are met, we can create the following residual plots:
Residual vs. fitted values plot: This plot is useful for confirming homoscedasticity. The x-axis
displays the fitted values and the y-axis displays the residuals. As long as the residuals appear to
be randomly and evenly distributed throughout the chart around the value zero, we can assume
that homoscedasticity is not violated:
Since the residuals appear to be randomly scattered around zero, this is an indication that
heteroscedasticity is not a problem with the explanatory variable.
Q-Q plot: This plot is useful for determining if the residuals follow a normal distribution. If the
data values in the plot fall along a roughly straight line at a 45-degree angle, then the data is
normally distributed:
#define residuals
res = model.resid
Since the residuals are normally distributed and homoscedastic, we’ve verified that the
assumptions of the simple linear regression model are met. Thus, the output from our model is
reliable.