Python Sklearn Linear Regression
Python Sklearn Linear Regression
with Python
This tutorial is a part of Zero to Data Science Bootcamp by Jovian and Machine Learning with Python: Zero to
GBMs
Jupyter Notebooks: This tutorial is a Jupyter notebook - a document made of cells. Each cell can contain
code written in Python or explanations in plain English. You can execute code cells and view the results,
e.g., numbers, messages, graphs, tables, les, etc., instantly within the notebook. Jupyter is a powerful
platform for experimentation and analysis. Don't be afraid to mess around with the code & break things -
you'll learn a lot by encountering and xing errors. You can use the "Kernel > Restart & Clear Output"
menu option to clear all outputs and start again from the top.
Problem Statement
This tutorial takes a practical and coding-focused approach. We'll de ne the terms machine learning and linear
regression in the context of a problem, and later generalize their de nitions. We'll work through a typical machine
learning problem step-by-step:
QUESTION: ACME Insurance Inc. offers affordable health insurance to thousands of customer all over
the United States. As the lead data scientist at ACME, you're tasked with creating an automated system
to estimate the annual medical expenditure for new customers, using information such as their age,
sex, BMI, children, smoking habits and region of residence.
Estimates from your system will be used to determine the annual insurance premium (amount paid every
month) offered to the customer. Due to regulatory requirements, you must be able to explain why your
system outputs a certain prediction.
You're given a CSV le containing veri ed historical data, consisting of the aforementioned information
and the actual medical charges incurred by over 1300 customers.
EXERCISE: Before proceeding further, take a moment to think about how can approach this problem. List ve or
more ideas that come to your mind below:
1. ???
2. ???
3. ???
4. ???
5. ???
Downloading the Data
To begin, let's download the data using the urlretrieve function from urllib.request .
medical_charges_url = 'https://raw.githubusercontent.com/JovianML/opendatasets/master/d
urlretrieve(medical_charges_url, 'medical.csv')
We can now create a Pandas dataframe using the downloaded le, to view and analyze the data.
import pandas as pd
medical_df = pd.read_csv('medical.csv')
medical_df
The dataset contains 1338 rows and 7 columns. Each row of the dataset contains information about one
customer.
Our objective is to nd a way to estimate the value in the "charges" column using the values in the other columns. If
we can do so for the historical data, then we should able to estimate charges for new customers too, simply by
asking for information like their age, sex, BMI, no. of children, smoking habits and region.
medical_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
Looks like "age", "children", "bmi" (body mass index) and "charges" are numbers, whereas "sex", "smoker" and
"region" are strings (possibly categories). None of the columns contain any missing values, which saves us a fair
bit of work!
medical_df.describe()
The ranges of values in the numerical columns seem reasonable too (no negative ages!), so we may not have to do
much data cleaning or correction. The "charges" column seems to be signi cantly skewed however, as the median
(50 percentile) is much lower than the maximum value.
EXERCISE: What other inferences can you draw by looking at the table above? Add your inferences
below:
1. ???
2. ???
3. ???
4. ???
5. ???
import jovian
jovian.commit()
'https://jovian.ai/aakashns/python-sklearn-linear-regression'
We'll use libraries Matplotlib, Seaborn and Plotly for visualization. Follow these tutorials to learn how to use these
libraries:
https://jovian.ai/aakashns/python-matplotlib-data-visualization
https://jovian.ai/aakashns/interactive-visualization-plotly
https://jovian.ai/aakashns/dataviz-cheatsheet
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
The following settings will improve the default style and font sizes for our charts.
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
Age
Age is a numeric column. The minimum age in the dataset is 18 and the maximum age is 64. Thus, we can visualize
the distribution of age using a histogram with 47 bins (one for each year) and a box plot. We'll use plotly to make
the chart interactive, but you can create similar charts using Seaborn.
medical_df.age.describe()
count 1338.000000
mean 39.207025
std 14.049960
min 18.000000
25% 27.000000
50% 39.000000
75% 51.000000
max 64.000000
Name: age, dtype: float64
fig = px.histogram(medical_df,
x='age',
marginal='box',
nbins=47,
title='Distribution of Age')
fig.update_layout(bargap=0.1)
fig.show()
The distribution of ages in the dataset is almost uniform, with 20-30 customers at every age, except for the ages
18 and 19, which seem to have over twice as many customers as other ages. The uniform distribution might arise
from the fact that there isn't a big variation in the number of people of any given age (between 18 & 64) in the USA.
EXERCISE: Can you explain why there are over twice as many customers with ages 18 and 19, compared
to other ages?
???
fig = px.histogram(medical_df,
x='bmi',
marginal='box',
color_discrete_sequence=['red'],
title='Distribution of BMI (Body Mass Index)')
fig.update_layout(bargap=0.1)
fig.show()
The measurements of body mass index seem to form a Gaussian distribution centered around the value 30, with a
few outliers towards the right. Here's how BMI values can be interpreted (source):
EXERCISE: Can you explain why the distribution of ages forms a uniform distribution while the
distribution of BMIs forms a gaussian distribution?
???
Charges
Let's visualize the distribution of "charges" i.e. the annual medical charges for customers. This is the column we're
trying to predict. Let's also use the categorical column "smoker" to distinguish the charges for smokers and non-
smokers.
fig = px.histogram(medical_df,
x='charges',
marginal='box',
color='smoker',
color_discrete_sequence=['green', 'grey'],
title='Annual Medical Charges')
fig.update_layout(bargap=0.1)
fig.show()
We can make the following observations from the above graph:
For most customers, the annual medical charges are under $10,000. Only a small fraction of customer have
higher medical expenses, possibly due to accidents, major illnesses and genetic diseases. The distribution
follows a "power law"
There is a signi cant difference in medical expenses between smokers and non-smokers. While the median for
non-smokers is $7300, the median for smokers is close to $35,000.
EXERCISE: Visualize the distribution of medical charges in connection with other factors like "sex" and
"region". What do you observe?
Smoker
Let's visualize the distribution of the "smoker" column (containing values "yes" and "no") using a histogram.
medical_df.smoker.value_counts()
no 1064
yes 274
Name: smoker, dtype: int64
It appears that 20% of customers have reported that they smoke. Can you verify whether this matches the
national average, assuming the data was collected in 2010? We can also see that smoking appears a more
common habit among males. Can you verify this?
EXERCISE: Visualize the distributions of the "sex", "region" and "children" columns and report your
observations.
Having looked at individual columns, we can now visualize the relationship between "charges" (the value we wish
to predict) and other columns.
Age and Charges
Let's visualize the relationship between "age" and "charges" using a scatter plot. Each point in the scatter plot
represents one customer. We'll also use values in the "smoker" column to color the points.
fig = px.scatter(medical_df,
x='age',
y='charges',
color='smoker',
opacity=0.8,
hover_data=['sex'],
title='Age vs. Charges')
fig.update_traces(marker_size=5)
fig.show()
The general trend seems to be that medical charges increase with age, as we might expect. However, there is
signi cant variation at every age, and it's clear that age alone cannot be used to accurately determine medical
charges.
We can see three "clusters" of points, each of which seems to form a line with an increasing slope:
1. The rst and the largest cluster consists primary of presumably "healthy non-smokers" who have
relatively low medical charges compared to others
2. The second cluster contains a mix of smokers and non-smokers. It's possible that these are actually two
distinct but overlapping clusters: "non-smokers with medical issues" and "smokers without major medical
issues".
3. The nal cluster consists exclusively of smokers, presumably smokers with major medical issues that are
possibly related to or worsened by smoking.
EXERCISE: What other inferences can you draw from the above chart?
???
fig = px.scatter(medical_df,
x='bmi',
y='charges',
color='smoker',
opacity=0.8,
hover_data=['sex'],
title='BMI vs. Charges')
fig.update_traces(marker_size=5)
fig.show()
It appears that for non-smokers, an increase in BMI doesn't seem to be related to an increase in medical charges.
However, medical charges seem to be signi cantly higher for smokers with a BMI greater than 30.
What other insights can you gather from the above graph?
EXERCISE: Create some more graphs to visualize how the "charges" column is related to other columns
("children", "sex", "region" and "smoker"). Summarize the insights gathered from these graphs.
Correlation
As you can tell from the analysis, the values in some columns are more closely related to the values in "charges"
compared to other columns. E.g. "age" and "charges" seem to grow together, whereas "bmi" and "charges" don't.
This relationship is often expressed numerically using a measure called the correlation coe cient , which can be
computed using the .corr method of a Pandas series.
medical_df.charges.corr(medical_df.age)
0.2990081933306476
medical_df.charges.corr(medical_df.bmi)
0.19834096883362895
To compute the correlation for categorical columns, they must rst be converted into numeric columns.
0.787251430498478
Strength: The greater the absolute value of the correlation coe cient, the stronger the relationship.
The extreme values of -1 and 1 indicate a perfectly linear relationship where a change in one variable is
accompanied by a perfectly consistent change in the other. For these relationships, all of the data points
fall on a line. In practice, you won’t see either type of perfect relationship.
A coe cient of zero represents no linear relationship. As one variable increases, there is no tendency in
the other variable to either increase or decrease.
When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r
approaches -1 or 1, the strength of the relationship increases and the data points tend to fall closer to a
line.
Direction: The sign of the correlation coe cient represents the direction of the relationship.
Positive coe cients indicate that when the value of one variable increases, the value of the other variable
also tends to increase. Positive relationships produce an upward slope on a scatterplot.
Negative coe cients represent cases when the value of one variable increases, the value of the other
variable tends to decrease. Negative relationships produce a downward slope.
Pandas dataframes also provide a .corr method to compute the correlation coe cients between all pairs of
numeric columns.
medical_df.corr()
The result of .corr is called a correlation matrix and is often visualized using a heatmap.
While this may seem obvious, computers can't differentiate between correlation and causation, and decisions
based on automated system can often have major consequences on society, so it's important to study why
automated systems lead to a given result. Determining cause-effect relationships requires human insight.
jovian.commit()
'https://jovian.ai/aakashns/python-sklearn-linear-regression'
Apart from a few exceptions, the points seem to form a line. We'll try and " t" a line using this points, and use the
line to predict charges for a given age. A line on the X&Y coordinates has the following formula:
y = wx + b
The line is characterized two numbers: w (called "slope") and b (called "intercept").
Model
In the above case, the x axis shows "age" and the y axis shows "charges". Thus, we're assuming the following
relationship between the two:
charges = w × age + b
We'll try determine w and b for the line that best ts the data.
This technique is called linear regression, and we call the above equation a linear regression model, because it
models the relationship between "age" and "charges" as a straight line.
The numbers w and b are called the parameters or weights of the model.
The values in the "age" column of the dataset are called the inputs to the model and the values in the charges
column are called "targets".
def estimate_charges(age,
estimate_charges w, b):
return w * age + b
w = 50
b = 100
ages = non_smoker_df.age
estimated_charges = estimate_charges(ages, w, b)
We can overlay this line on the actual data, so see how well our model ts the data.
target = non_smoker_df.charges
def try_parameters(w,
try_parameters b):
ages = non_smoker_df.age
target = non_smoker_df.charges
estimated_charges = estimate_charges(ages, w, b)
try_parameters(60, 200)
try_parameters(400, 5000)
EXERCISE: Try various values of w and b to nd a line that best ts the data. What is the effect of
changing the value of w? What is the effect of changing b?
As we change the values, of w and b manually, trying to move the line visually closer to the points, we are learning
the approximate relationship between "age" and "charges".
Wouldn't it be nice if a computer could try several different values of w and b and learn the relationship between
"age" and "charges"? To do this, we need to solve a couple of problems:
1. We need a way to measure numerically how well the line ts the points.
2. Once the "measure of t" has been computed, we need a way to modify w and b to improve the the t.
If we can solve the above problems, it should be possible for a computer to determine w and b for the best t
line, starting from a random guess.
Loss/Cost Function
We can compare our model's predictions with the actual targets using the following method:
Calculate the difference between the targets and predictions (the differenced is called the "residual")
Square all elements of the difference matrix to remove negative values.
The result is a single number, known as the root mean squared error (RMSE). The above description can be stated
mathematically as follows:
import numpy as np
def rmse(targets,
rmse predictions):
return np.sqrt(np.mean(np.square(targets - predictions)))
Let's compute the RMSE for our model with a sample set of weights
w = 50
b = 100
try_parameters(w, b)
targets = non_smoker_df['charges']
predicted = estimate_charges(non_smoker_df.age, w, b)
rmse(targets, predicted)
8461.949562575493
Here's how we can interpret the above number: On average, each element in the prediction differs from the actual
target by $8461.
The result is called the loss because it indicates how bad the model is at predicting the target variables. It
represents information loss in the model: the lower the loss, the better the model.
Let's modify the try_parameters functions to also display the loss.
def try_parameters(w,
try_parameters b):
ages = non_smoker_df.age
target = non_smoker_df.charges
predictions = estimate_charges(ages, w, b)
try_parameters(50, 100)
EXERCISE: Try different values of w and b to minimize the RMSE loss. What's the lowest value of loss you
are able to achieve? Can you come with a general strategy for nding better values of w and b by trial
and error?
Optimizer
Next, we need a strategy to modify weights w and b to reduce the loss and improve the " t" of the line to the
data.
Both of these have the same objective: to minimize the loss, however, while ordinary least squares directly
computes the best values for w and b using matrix operations, while gradient descent uses a iterative
approach, starting with a random values of w and b and slowly improving them using derivatives.
Doesn't it look similar to our own strategy of gradually moving the line closer to the points?
Let's use the LinearRegression class from scikit-learn to nd the best t line for "age" vs. "charges" using
the ordinary least squares optimization technique.
Next, we can use the fit method of the model to nd the best t line for the inputs and targets.
help(model.fit)
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training data
.. versionadded:: 0.17
parameter *sample_weight* support to LinearRegression.
Returns
-------
self : returns an instance of self.
Not that the input X must be a 2-d array, so we'll need to pass a dataframe, instead of a single column.
inputs = non_smoker_df[['age']]
targets = non_smoker_df.charges
print('inputs.shape :', inputs.shape)
print('targes.shape :', targets.shape)
inputs.shape : (1064, 1)
targes.shape : (1064,)
model.fit(inputs, targets)
LinearRegression()
We can now make predictions using the model. Let's try predicting the charges for the ages 23, 37 and 61
model.predict(np.array([[23],
[37],
[61]]))
Do these values seem reasonable? Compare them with the scatter plot above.
predictions = model.predict(inputs)
predictions
rmse(targets, predictions)
4662.505766636395
Seems like our prediction is off by $4000 on average, which is not too bad considering the fact that there are
several outliers.
The parameters of the model are stored in the coef_ and intercept_ properties.
# w
model.coef_
array([267.24891283])
# b
model.intercept_
-2091.4205565650864
try_parameters(model.coef_, model.intercept_)
EXERCISE: Use the SGDRegressor class from scikit-learn to train a model using the stochastic
gradient descent technique. Make predictions and compute the loss. Do you see any difference in the
result?
EXERCISE: Repeat the steps is this section to train a linear regression model to estimate medical charges
for smokers. Visualize the targets and predictions, and compute the loss.
Machine Learning
Congratulations, you've just trained your rst machine learning model! Machine learning is simply the process of
computing the best parameters to model the relationship between some feature and targets.
1. Model
2. Cost Function
3. Optimizer
We'll look at several examples of each of the above in future tutorials. Here's how the relationship between these
three components can be visualized:
As we've seen above, it takes just a few lines of code to train a machine learning model using scikit-learn .
# Generate predictions
predictions = model.predict(inputs)
Loss: 4662.505766636395
jovian.commit()
# Generate predictions
predictions = model.predict(inputs)
Loss: 4662.3128354612945
As you can see, adding the BMI doesn't seem to reduce the loss by much, as the BMI has a very weak correlation
with charges, especially for non smokers.
non_smoker_df.charges.corr(non_smoker_df.bmi)
0.08403654312833268
You can see that it's harder to interpret a 3D scatter plot compared to a 2D scatter plot. As we add more features,
it becomes impossible to visualize all feature at once, which is why we use measures like correlation and loss.
model.coef_, model.intercept_
(array([266.87657817, 7.07547666]), -2293.6320906488727)
Clearly, BMI has a much lower weightage, and you can see why. It has a tiny contribution, and even that is
probably accidental. This is an important thing to keep in mind: you can't nd a relationship that doesn't exist, no
matter what machine learning technique or optimization algorithm you apply.
EXERCISE: Train a linear regression model to estimate charges using BMI alone. Do you expect it to be
better or worse than the previously trained models?
Let's go one step further, and add the nal numeric column: "children", which seems to have some correlation with
"charges".
non_smoker_df.charges.corr(non_smoker_df.children)
0.13892870453542194
# Generate predictions
predictions = model.predict(inputs)
Loss: 4608.470405038246
Once again, we don't see a big reduction in the loss, even though it's greater than in the case of BMI.
EXERCISE: Repeat the steps is this section to train a linear regression model to estimate medical charges
for smokers. Visualize the targets and predictions, and compute the loss.
EXERCISE: Repeat the steps is this section to train a linear regression model to estimate medical charges
for all customers. Visualize the targets and predictions, and compute the loss. Is the loss lower or
higher?
# Generate predictions
predictions = model.predict(inputs)
Loss: 11355.317901125973
jovian.commit()
'https://jovian.ai/aakashns/python-sklearn-linear-regression'
To use the categorical columns, we simply need to convert them to numbers. There are three common techniques
for doing this:
1. If a categorical column has just two categories (it's called a binary category), then we can replace their values
with 0 and 1.
2. If a categorical column has more than 2 categories, we can perform one-hot encoding i.e. create a new
column for each category with 1s and 0s.
3. If the categories have a natural order (e.g. cold, neutral, warm, hot), then they can be converted to numbers
(e.g. 1, 2, 3, 4) preserving the order. These are called ordinals
Binary Categories
The "smoker" category has just two values "yes" and "no". Let's create a new column "smoker_code" containing 0 for
"no" and 1 for "yes".
medical_df.charges.corr(medical_df.smoker_code)
0.787251430498478
medical_df
# Generate predictions
predictions = model.predict(inputs)
Loss: 6056.439217188081
The loss reduces from 11355 to 6056 , almost by 50%! This is an important lesson: never ignore categorical
data.
<AxesSubplot:xlabel='sex', ylabel='charges'>
medical_df['sex_code'] = medical_df.sex.map(sex_codes)
medical_df.charges.corr(medical_df.sex_code)
0.057292062202025484
# Generate predictions
predictions = model.predict(inputs)
Loss: 6056.100708754546
As you might expect, this does have a signi cant impact on the loss.
One-hot Encoding
The "region" column contains 4 values, so we'll need to use hot encoding and create a new column for each region.
one_hot = enc.transform(medical_df[['region']]).toarray()
one_hot
medical_df
age sex bmi children smoker region charges smoker_code sex_code northeast northwest
... ... ... ... ... ... ... ... ... ... ... ...
age sex bmi children smoker region charges smoker_code sex_code northeast northwest
Let's include the region columns into our linear regression model.
# Generate predictions
predictions = model.predict(inputs)
Loss: 6041.6796511744515
EXERCISE: Are two separate linear regression models, one for smokers and one of non-smokers, better
than a single linear regression model? Why or why not? Try it out and see if you can justify your answer
with data.
jovian.commit()
'https://jovian.ai/aakashns/python-sklearn-linear-regression'
Model Improvements
Let's discuss and apply some more improvements to our model.
Feature Scaling
Recall that due to regulatory requirements, we also need to explain the rationale behind the predictions our model.
To compare the importance of each feature in the model, our rst instinct might be to compare their weights.
model.coef_
model.intercept_
-12525.547811195444
weights_df = pd.DataFrame({
'feature': np.append(input_cols, 1),
'weight': np.append(model.coef_, model.intercept_)
})
weights_df
feature weight
0 age 256.856353
1 bmi 339.193454
2 children 475.500545
3 smoker_code 23848.534542
4 sex_code -131.314359
5 northeast 587.009235
6 northwest 234.045336
7 southeast -448.012814
8 southwest -373.041756
9 1 -12525.547811
While it seems like BMI and the "northeast" have a higher weight than age, keep in mind that the range of values for
BMI is limited (15 to 40) and the "northeast" column only takes the values 0 and 1.
Because different columns have different ranges, we run into two issues:
1. We can't compare the weights of different column to identify which features are important
2. A column with a larger range of inputs may disproportionately affect the loss and dominate the optimization
process.
For this reason, it's common practice to scale (or standardize) the values in numeric column by subtracting the
mean and dividing by the standard deviation.
medical_df
age sex bmi children smoker region charges smoker_code sex_code northeast northwest
... ... ... ... ... ... ... ... ... ... ... ...
StandardScaler()
scaler.mean_
scaler.var_
scaled_inputs = scaler.transform(medical_df[numeric_cols])
scaled_inputs
# Generate predictions
predictions = model.predict(inputs)
Loss: 6041.6796511744515
feature weight
3 smoker_code 23848.534542
9 1 8466.483215
0 age 3607.472736
1 bmi 2067.691966
5 northeast 587.009235
2 children 572.998210
6 northwest 234.045336
4 sex_code -131.314359
8 southwest -373.041756
7 southeast -448.012814
1. Smoker
2. Age
3. BMI
# Generate predictions
predictions_test = model.predict(inputs_test)
# Generate predictions
predictions_train = model.predict(inputs_train)
Can you explain why the training loss is lower than the test loss? We'll discuss this in a lot more detail in future
tutorials.
2. Pick the right model, loss functions and optimizer for the problem at hand
jovian.commit()
https://www.kaggle.com/vikrishnan/boston-house-prices
https://www.kaggle.com/sohier/calco
https://www.kaggle.com/budincsevity/szeged-weather
Check out the following links to learn more about linear regression:
https://jovian.ai/aakashns/02-linear-regression
https://www.kaggle.com/hely333/eda-regression
https://www.youtube.com/watch?v=kHwlB_j7Hkc
Revision Questions
1. Why do we have to perform EDA before tting a model to the data?
2. What is a parameter?
3. What is correlation?
6. What is causation? Explain difference between correlation and causation with an example.
7. De ne Linear Regression.
15. What is an Optimizer? What are different types of optimizers? Explain each with an example.
25. How does loss value help in determining whether the model is good or not?
27. How do we handle categorical variables in Machine Learning? What are the common techniques?
32. How do we split data for model tting (training and testing) in Python?