0% found this document useful (0 votes)

2 views

How to Perform Simple Linear Regression in Python

This document provides a step-by-step tutorial on performing simple linear regression in Python, focusing on the relationship between hours studied and exam scores. It covers data loading, visualization, model fitting using the OLS function from the statsmodels library, and interpretation of results, including the significance of the model and the relationship between variables. Finally, it discusses the creation of residual plots to verify the assumptions of linear regression.

Uploaded by

shreyushsharma22

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

How to Perform Simple Linear Regression in Python

Uploaded by

shreyushsharma22

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

How to Perform Simple Linear Regression in

Python (Step-by-Step)

Simple linear regression is a technique that we can use to understand the relationship between a
single explanatory variable and a single response variable.

This technique finds a line that best “fits” the data and takes on the following form:

ŷ = b0 + b1x

where:

 ŷ: The estimated response value

 b0: The intercept of the regression line
 b1: The slope of the regression line

This equation can help us understand the relationship between the explanatory and response
variable, and (assuming it’s statistically significant) it can be used to predict the value of a
response variable given the value of the explanatory variable.

This tutorial provides a step-by-step explanation of how to perform simple linear regression in
Python.

Step 1: Load the Data

For this example, we’ll create a fake dataset that contains the following two variables for 15
students:

 Total hours studied for some exam

 Exam score

We’ll attempt to fit a simple linear regression model using hours as the explanatory variable
and exam score as the response variable.

The following code shows how to create this fake dataset in Python:

import pandas as pd

#create dataset
df = pd.DataFrame({'hours': [1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 11, 11,
12, 12, 14],
'score': [64, 66, 76, 73, 74, 81, 83, 82, 80,
88, 84, 82, 91, 93, 89]})
#view first six rows of dataset
df[0:6]

hours score
0 1 64
1 2 66
2 4 76
3 5 73
4 5 74
5 6 81

Step 2: Visualize the Data

Before we fit a simple linear regression model, we should first visualize the data to gain an
understanding of it.

First, we want to make sure that the relationship between hours and score is roughly linear, since
that is an underlying assumption of simple linear regression.

We can create a simple scatterplot to view the relationship between the two variables:

import matplotlib.pyplot as plt

plt.scatter(df.hours, df.score)
plt.title('Hours studied vs. Exam Score')
plt.xlabel('Hours')
plt.ylabel('Score')
plt.show()
From the plot we can see that the relationship does appear to be linear. As hours increases, score
tends to increase as well in a linear fashion.

Next, we can create a boxplot to visualize the distribution of exam scores and check for outliers.
By default, Python defines an observation to be an outlier if it is 1.5 times the interquartile range
greater than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile
(Q1).

If an observation is an outlier, a tiny circle will appear in the boxplot:

df.boxplot(column=['score'])

There are no tiny circles in the boxplot, which means there are no outliers in our dataset.
Step 3: Perform Simple Linear Regression

Once we’ve confirmed that the relationship between our variables is linear and that there are no
outliers present, we can proceed to fit a simple linear regression model using hours as the
explanatory variable and score as the response variable:

Note: We’ll use the OLS() function from the statsmodels library to fit the regression model.

import statsmodels.api as sm

#define response variable

y = df['score']

#define explanatory variable

x = df[['hours']]

#add constant to predictor variables

x = sm.add_constant(x)

#fit linear regression model

model = sm.OLS(y, x).fit()

#view model summary

print(model.summary())

OLS Regression Results

===================================================================
===========
Dep. Variable: score R-squared:
0.831
Model: OLS Adj. R-squared:
0.818
Method: Least Squares F-statistic:
63.91
Date: Mon, 26 Oct 2020 Prob (F-statistic):
2.25e-06
Time: 15:51:45 Log-Likelihood:
-39.594
No. Observations: 15 AIC:
83.19
Df Residuals: 13 BIC:
84.60
Df Model: 1
Covariance Type: nonrobust
===================================================================
===========
coef std err t P>|t| [0.025
0.975]
-------------------------------------------------------------------
-----------
const 65.3340 2.106 31.023 0.000 60.784
69.884
hours 1.9824 0.248 7.995 0.000 1.447
2.518
===================================================================
===========
Omnibus: 4.351 Durbin-Watson:
1.677
Prob(Omnibus): 0.114 Jarque-Bera (JB):
1.329
Skew: 0.092 Prob(JB):
0.515
Kurtosis: 1.554 Cond. No.
19.2
===================================================================
===========

From the model summary we can see that the fitted regression equation is:

Score = 65.334 + 1.9824*(hours)

This means that each additional hour studied is associated with an average increase in exam
score of 1.9824 points. And the intercept value of 65.334 tells us the average expected exam
score for a student who studies zero hours.

We can also use this equation to find the expected exam score based on the number of hours that
a student studies. For example, a student who studies for 10 hours is expected to receive an exam
score of 85.158:

Score = 65.334 + 1.9824*(10) = 85.158

Here is how to interpret the rest of the model summary:

 P>|t|: This is the p-value associated with the model coefficients. Since the p-value
for hours (0.000) is significantly less than .05, we can say that there is a statistically
significant association between hours and score.
 R-squared: This number tells us the percentage of the variation in the exam scores can
be explained by the number of hours studied. In general, the larger the R-squared value of
a regression model the better the explanatory variables are able to predict the value of the
response variable. In this case, 83.1% of the variation in scores can be explained hours
studied.
 F-statistic & p-value: The F-statistic (63.91) and the corresponding p-value (2.25e-06)
tell us the overall significance of the regression model, i.e. whether explanatory variables
in the model are useful for explaining the variation in the response variable. Since the p-
value in this example is less than .05, our model is statistically significant and hours is
deemed to be useful for explaining the variation in score.

Step 4: Create Residual Plots

After we’ve fit the simple linear regression model to the data, the last step is to create residual
plots.

One of the key assumptions of linear regression is that the residuals of a regression model are
roughly normally distributed and are homoscedastic at each level of the explanatory variable. If
these assumptions are violated, then the results of our regression model could be misleading or
unreliable.

To verify that these assumptions are met, we can create the following residual plots:

Residual vs. fitted values plot: This plot is useful for confirming homoscedasticity. The x-axis
displays the fitted values and the y-axis displays the residuals. As long as the residuals appear to
be randomly and evenly distributed throughout the chart around the value zero, we can assume
that homoscedasticity is not violated:

#define figure size

fig = plt.figure(figsize=(12,8))

#produce residual plots

fig = sm.graphics.plot_regress_exog(model, 'hours', fig=fig)
Four plots are produced. The one in the top right corner is the residual vs. fitted plot. The x-axis
on this plot shows the actual values for the predictor variable points and the y-axis shows the
residual for that value.

Since the residuals appear to be randomly scattered around zero, this is an indication that
heteroscedasticity is not a problem with the explanatory variable.

Q-Q plot: This plot is useful for determining if the residuals follow a normal distribution. If the
data values in the plot fall along a roughly straight line at a 45-degree angle, then the data is
normally distributed:

#define residuals
res = model.resid

#create Q-Q plot

fig = sm.qqplot(res, fit=True, line="45")
plt.show()
The residuals stray from the 45-degree line a bit, but not enough to cause serious concern. We
can assume that the normality assumption is met.

Since the residuals are normally distributed and homoscedastic, we’ve verified that the
assumptions of the simple linear regression model are met. Thus, the output from our model is
reliable.

Castle in The Sky
No ratings yet
Castle in The Sky
4 pages
Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R
No ratings yet
Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R
39 pages
Econometrics Assignment
No ratings yet
Econometrics Assignment
40 pages
Conversion To Christianity Among The Nagas
100% (4)
Conversion To Christianity Among The Nagas
44 pages
Linear Regression for Real
No ratings yet
Linear Regression for Real
1 page
Exercice V
No ratings yet
Exercice V
5 pages
Data Analytics - R Markdown
No ratings yet
Data Analytics - R Markdown
20 pages
Unit 3
No ratings yet
Unit 3
24 pages
unit 3 6
No ratings yet
unit 3 6
3 pages
Statistic For Agriculture Studies: The Assumptions of Regression
No ratings yet
Statistic For Agriculture Studies: The Assumptions of Regression
6 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Machine Learning-Lecture 1(Student)
No ratings yet
Machine Learning-Lecture 1(Student)
14 pages
Ecotrix Assignment
No ratings yet
Ecotrix Assignment
5 pages
10 Regression Analysis
No ratings yet
10 Regression Analysis
55 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Seu Ds610 Mod03
No ratings yet
Seu Ds610 Mod03
45 pages
Topic 1 class exercises
No ratings yet
Topic 1 class exercises
5 pages
Lab-5-1-Regression and Multiple Regression
100% (2)
Lab-5-1-Regression and Multiple Regression
8 pages
Week 6 - Model Assumptions in Linear Regression
No ratings yet
Week 6 - Model Assumptions in Linear Regression
17 pages
R Lab 4
No ratings yet
R Lab 4
7 pages
Lab 9 Report
No ratings yet
Lab 9 Report
5 pages
6th Lecture Note 108335647 230518 203102
No ratings yet
6th Lecture Note 108335647 230518 203102
35 pages
Ch3 Multiple Regression
No ratings yet
Ch3 Multiple Regression
56 pages
ECMT1020 - Week 06 Workshop
No ratings yet
ECMT1020 - Week 06 Workshop
4 pages
Econometrics 7
No ratings yet
Econometrics 7
49 pages
2 Simple Regression Model
No ratings yet
2 Simple Regression Model
55 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
CH 14 Handout
No ratings yet
CH 14 Handout
6 pages
SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian Two Wave Panel Data Analysis
No ratings yet
SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian Two Wave Panel Data Analysis
12 pages
Assignment: Topic - Testing For Violation of OLS Assumptions
No ratings yet
Assignment: Topic - Testing For Violation of OLS Assumptions
50 pages
Ary_reg
No ratings yet
Ary_reg
10 pages
Using R For Linear Regression
No ratings yet
Using R For Linear Regression
9 pages
sheet 2A_MR_2021
No ratings yet
sheet 2A_MR_2021
3 pages
3 Multiple Regression Model
No ratings yet
3 Multiple Regression Model
48 pages
Example for Hypothesis
No ratings yet
Example for Hypothesis
5 pages
Lesson Week 13
No ratings yet
Lesson Week 13
6 pages
R-programming - Unit 5
No ratings yet
R-programming - Unit 5
43 pages
EC212: Introduction To Econometrics Multiple Regression: Estimation (Wooldridge, Ch. 3)
No ratings yet
EC212: Introduction To Econometrics Multiple Regression: Estimation (Wooldridge, Ch. 3)
76 pages
Week 1 Non Linear
No ratings yet
Week 1 Non Linear
61 pages
Exam 1 Spring 2023 Donald
No ratings yet
Exam 1 Spring 2023 Donald
8 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Solutions Week 10
No ratings yet
Solutions Week 10
7 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
Lab4 - SLR - Ipynb - Colaboratory
No ratings yet
Lab4 - SLR - Ipynb - Colaboratory
7 pages
Simple Regression Model
No ratings yet
Simple Regression Model
54 pages
Elements of Statistics and Probability STA 201 S M Rajib Hossain MNS, BRAC University Lecture-8
No ratings yet
Elements of Statistics and Probability STA 201 S M Rajib Hossain MNS, BRAC University Lecture-8
6 pages
Estadisticas Descriptivas - DSTAT Rhs ONE, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12$
No ratings yet
Estadisticas Descriptivas - DSTAT Rhs ONE, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12$
4 pages
BN2102 7-10
No ratings yet
BN2102 7-10
24 pages
04 - Notebook4 - Additional Information
No ratings yet
04 - Notebook4 - Additional Information
5 pages
Engineering - Simple Correlation and Regression - 2024
No ratings yet
Engineering - Simple Correlation and Regression - 2024
35 pages
Se - Eco - 22 - 0208 Output Results
No ratings yet
Se - Eco - 22 - 0208 Output Results
6 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Updated_Lecture_7
No ratings yet
Updated_Lecture_7
29 pages
Nu - Edu.kz Econometrics-I Assignment 4 Answer Key
No ratings yet
Nu - Edu.kz Econometrics-I Assignment 4 Answer Key
4 pages
Study Guide Chapter 1 (EC220)
No ratings yet
Study Guide Chapter 1 (EC220)
11 pages
Najah Mubashira Final STT 351 Project
No ratings yet
Najah Mubashira Final STT 351 Project
7 pages
Notes 9
No ratings yet
Notes 9
57 pages
Chat Openai Com Share 42b24a73 839b 4128 Ade9 7d8eed9e9533
No ratings yet
Chat Openai Com Share 42b24a73 839b 4128 Ade9 7d8eed9e9533
21 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
MLR
No ratings yet
MLR
3 pages
Polynomial Regression Models: Possible Models For When The Response Function Is "Curved"
No ratings yet
Polynomial Regression Models: Possible Models For When The Response Function Is "Curved"
36 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Training Plan - 101769488 - TP - 61084 - V1
No ratings yet
Training Plan - 101769488 - TP - 61084 - V1
4 pages
TFN Transes - HX of NRG To NRG Theorist
No ratings yet
TFN Transes - HX of NRG To NRG Theorist
14 pages
BWG Isolierstoss en
No ratings yet
BWG Isolierstoss en
4 pages
First Aid For Burns
No ratings yet
First Aid For Burns
11 pages
MV RegionalParks BrandBook
No ratings yet
MV RegionalParks BrandBook
56 pages
33kv Alternate Supply
No ratings yet
33kv Alternate Supply
11 pages
500 Cau Trac Nghiem Tu Vung
No ratings yet
500 Cau Trac Nghiem Tu Vung
20 pages
Cha Cha Cha
No ratings yet
Cha Cha Cha
67 pages
Accounting Information System
No ratings yet
Accounting Information System
6 pages
FPM 046
100% (2)
FPM 046
134 pages
Importance of Heat Treatment
No ratings yet
Importance of Heat Treatment
5 pages
SMEs India - 15509
0% (1)
SMEs India - 15509
1,992 pages
Grammar Choices (3rd) - KEY
No ratings yet
Grammar Choices (3rd) - KEY
47 pages
IFMS West Bengal Brochure
100% (2)
IFMS West Bengal Brochure
36 pages
ASEU TEACHERFILE WEB 5831243475081930867.docx 1618394686
No ratings yet
ASEU TEACHERFILE WEB 5831243475081930867.docx 1618394686
7 pages
Shade of Divine Grace
100% (2)
Shade of Divine Grace
89 pages
UGC HR Recruitment
No ratings yet
UGC HR Recruitment
2 pages
Vidhijnya - 2010-11 - Law Vpms Thane
No ratings yet
Vidhijnya - 2010-11 - Law Vpms Thane
82 pages
Cave - Wikipedia
No ratings yet
Cave - Wikipedia
3 pages
On The Method of Ship's Transoceanic Route Planning
No ratings yet
On The Method of Ship's Transoceanic Route Planning
8 pages
YAWAA
No ratings yet
YAWAA
10 pages
Perfecto V Fernandez Philosophy and Law
100% (2)
Perfecto V Fernandez Philosophy and Law
13 pages
Aajonus Proof Receipts v0.1
No ratings yet
Aajonus Proof Receipts v0.1
53 pages
Poisoning Toxicology
No ratings yet
Poisoning Toxicology
2 pages
Theology and The Curriculum: Univer1Jit11, Is Still Seen As A Major Philosophical Statement On The Na
No ratings yet
Theology and The Curriculum: Univer1Jit11, Is Still Seen As A Major Philosophical Statement On The Na
11 pages
Bat01 Map Civ 0005
No ratings yet
Bat01 Map Civ 0005
4 pages
RULE-1160
No ratings yet
RULE-1160
23 pages
NSS Chemistry Part 3 Metals - MC
No ratings yet
NSS Chemistry Part 3 Metals - MC
20 pages